At the recent VeeamOn Tour, held in London, I was lucky enough to be asked to sit on a Veeam Vanguard panel, to talk about the topic of data security, which led to the question "Everyone should have a recovery plan, but how do you ensure it is reliable?". Let's go through my points offered on the day, in a little more detail
When talking to customers about recovery plans, there are 4 points I like to discuss:-
Understand your valuable data/core systems & processes
"Kind of obvious Craig" I hear you say. Well yes and no…it's not always the usual suspects. Everyone would immediately point to production data as their most valuable data, and it's not wrong. It just there's more to valuable data than just production data.
Companies need to be able to take that step back from production and look at their data estate holistically. Yes we want production data/systems protected, up and running ASAP, but what about data/systems that are with upstream or downstream from production data?
For example, you're a manufacturing business, taking online orders and you've encountered an major issue. Invoking your recovery plan, you fail over your business to run from your DR site. No point failing over to DR if for example the dispatch department can't print off address labels, because their label printing PC and courier scheduling software is offline. Or you fail over, but neglected to include your Payroll (not production system …right?), I'm sure you would want the ability to pay your employees.
Tip: Before writing a recovery plan, run a table top exercise with stakeholders from each area of the business and run scenarios "What if we lost x/y/z system? What data is impacted? Who is impacted? What upstream/downstream services are affected?"
Ensure your data is clean
Another obvious one, but not point having a data protection/recovery plan if the data you're recovering isn't clean. So how do you know your data is clean? There's an old sys admin saying "backups are your last line of defence", and never more so that in today's environment. Backups should be one part of a multi layered data protection strategy, leverage the features of your backup vendor to allow you to perform checks on the data, for example Veeam offer SureBackup/SureReplica and Secure restore.
Tip: Incorporate regular testing/scanning of your backup data, it should also be considered part of your valuable data.
Have a clean place to put your recovered data
Little point having clean data to recover, if you don't have a clean place to recover to. Consideration for your recovery plan "Do I have adequate resources available to recover?". Again not always the usual suspects when it comes to resources:-
Something we can't buy from a vendor or create , especially when the CEO is stood by our desk during an incident shouting "GET IT BACK ONLINE NOW". We can leverage automation to speed up recovery
Not everyone has infrastructure ready to go. Look to leverage 3rd party MSPs and/or Hyperscalers to provision the required resources
Do the staff have the necessary skills/process knowledge to recovery within the timescales required?
Tip: Include your DR infrastructure/systems in your security scanning/patching schedule.
Backup plan for your Recovery plan.
To quote Mike Tyson "Everyone has a plan, until they get punched in the mouth" , which is a nice way of saying you can do all the planning but until the event your planning for happens, you'll never know. Fortunately technology today means we can simulate most scenarios, and automate recovery where possible.
However automation should not mean teams are off the hook for not knowing the full recovery process. If you have a 100 step automated recovery plan, and it fails at step 57 during a live event, your staff need to know how to progress that recovery plan from step 57 onwards….have a backup ;)
Ok you have your recovery plan, how often are you testing the plan? Once a year? Every 6 months? Are you performing full failovers? Much like your backups, you should consider incremental recovery failovers. Still perform your annual/bi-annual "big bang" failover, but also look to run regular single server/system recovery. It will be less overwhelming for staff to run, and easier to run on a more frequent basis. Pick a different service/system each time.
Tip: Your recovery plan should *NOT* be a static process/document. It should evolve with your environment, so tie it in with your Change Management system. Add a consideration to your change process "Does this impact our ability to recover with out existing recovery plan?".
For example you create a recovery plan in January , but need to "dust it off" in September to recover from an incident. That's 9 months worth of infrastructure/application/process changes that could impact the ability to recover. Learn to keep your recovery plan up to date.
There we have it, a few discussion/talking points to help you create a reliable, relevant and actionable recovery plan. The list is by no means exhaustive, do you agree or disagree with any of the above points? Is there anything you would add?