Recently, I had to create a Disaster Recovery (DR) Plan for my organization. It’s actually quite surprising none had ever been created. Part of that is on me since I’ve been here long enough to have done so, and am responsible for most of the systems we would need to recover in the event of a disaster. But let’s be honest, creating DR Plans isn’t one of those “exotic” tasks a Systems Architect likes to do. Working with and playing with the tech we manage is what we get excited about! Regardless, it still *needs* to be done.
One may ask why does creating a DR Plan need to be done? It may seem obvious to most, but some it may not. I thought I’d answer that question here, as well as provide a “skeleton” list of items to consider when building out your own DR Plan.
First, having a DR Plan brings a sense of “calm” for the IT staff. In the event of a disaster of any sort - site outage; mal-intent via ransomware, disgruntled employee, or major hardware outage; etc - chaos can surely take over. Having a DR Plan can lessen this chaos knowing you’ve already thought about such scenarios and considered the various systems impacted, giving a working process to attempt to recover your systems to move forward through the disaster. Second, it gives IT a procedural model with which to go by to attempt to recover your systems. Your DR Plan may not be perfect, but at least it’ll provide a solid groundwork to begin your recovery processes. When you get through your disaster scenario, you can then revisit your DR Plan and make adjustments and/or corrections where needed. Lastly, having a DR Plan can give your “C-Levels” a bit of assurance the business can recover and continue operations in the event of a disaster.
Now, what items should you consider when devising your DR Plan? I’ll list some items below:
- Before devising your DR Plan, decide what audience will read the Plan. If it is only for IT Staff, you don’t have to be as explanatory. For example, when sharing about IT technologies such as DNS, LDAP, DHCP, etc, if the possibility your C-Levels will glance at your document, you’ll need to give a brief overview of what those are so they understand the importance of where they fall in your service levels. If the Plan won’t be used by anyone but the IT Staff, it’s a pretty safe bet you don’t have to be as descriptive
- First & foremost, depending on the extent of your DR scenario, keep in mind you may not be able to use your phone or email systems. As such, you’ll need to include an alternative means your business will use to communicate with each other, vendors, security professionals, and your customers. Use of your personal devices or personal email may be the only way to communicate until you can recover your systems
- Service Level Agreements - engage pertinent IT Staff, as well as other relevant departmental staff, to determine the recovery time (RTO) your business expects certain sevices and applications to be recovered. Then model your recovery processes off those SLAs. It’s also relevant to discuss recovery points, the amount of data loss you’re willing to deal with on your systems. Backup RPO is more than likely going to be different than DR RPO. There are a few exceptions, but depending on how you have your BU/DR environment set up, this could be true and should be explained up front to those who may have those unrealistic expectations
- Network - if recovery is needed in a secondary (DR) site, is the network already set up so all you have to worry about is system recovery? If it isn’t set up, what needs to be done prior to system recovery? Depending on what is needed here, you may need to create an ‘Appendix’ item and detail what needs to be done for this step. Or, if you have a ‘flat’ Layer 2 network spanning both sites, not much, if anything would need to be done
- Recovery Details - personally, I don’t agree with adding descriptive detail of how to recover your systems. A high-level overview should be good enough. But, I do recommend providing links to recovery tasks details from vendors you use to help in your recovery efforts. For example, you can place links from Veeam’s Helpcenter User Guides on how to perform Replica Failover, Failback, etc. in your DR Plan document so you don’t have to spend extra time looking up details if they’re needed.
- TEST! This goes without saying, but once you’ve recovered your systems, you of course need to test your systems and applications to verify they’re up & working ok
- Failback - if you lost your production/CO site, make sure to include high-level steps required to fail your systems and applications back to their original state once you’ve restored your CO site. Again, details probably aren’t prudent here, but do provide resources (links) where you can look if needed
- TEST! Yes, again. I recommend performing a minor mock DR scenario if you can so you know at least a little of what to expect when the ‘real thing’ hits your organization. Doing so can help alleviate a little bit of the great anxiety you’ll be feeling when a real-world DR situation occurs
Well, I think that’ll provide a good start for anyone needing to come up with a DR Plan on their own, or who occasionally reviews their Plan and is always looking for items they may have missed initially. What other tasks or processes do you all think are pertinent for a DR Plan I left out from above?
Cheers!