Everyone should have a recovery plan, but how do you ensure it is reliable?

  • 18 September 2023
  • 6 comments
  • 99 views

Userlevel 7
Badge +9

At the recent VeeamOn Tour, held in London, I was lucky enough to be asked to sit on a Veeam Vanguard panel, to talk about the topic of data security, which led to the question "Everyone should have a recovery plan, but how do you ensure it is reliable?". Let's go through my points offered on the day, in a little more detail

When talking to customers about recovery plans, there are 4 points I like to discuss:-

Understand your valuable data/core systems & processes

"Kind of obvious Craig" I hear you say. Well yes and no…it's not always the usual suspects. Everyone would immediately point to production data as their most valuable data, and it's not wrong. It just there's more to valuable data than just production data.

Companies need to be able to take that step back from production and look at their data estate holistically. Yes we want production data/systems protected, up and running ASAP, but what about data/systems that are with upstream or downstream from production data?

For example, you're a manufacturing business, taking online orders and you've encountered an major issue.  Invoking your recovery plan, you fail over your business to run from your DR site. No point failing over to DR if for example the dispatch department can't print off address labels, because their label printing PC and courier scheduling software is offline. Or you fail over, but neglected to include your Payroll (not production system …right?), I'm sure you would want the ability to pay your employees.

Tip: Before writing a recovery plan, run a table top exercise with stakeholders from each area of the business and run scenarios "What if we lost x/y/z system? What data is impacted? Who is impacted? What upstream/downstream services are affected?"

 

Ensure your data is clean

Another obvious one, but not point having a data protection/recovery plan if the data you're recovering isn't clean. So how do you know your data is clean? There's an old sys admin saying "backups are your last line of defence", and never more so that in today's environment. Backups should be one part of a multi layered data protection strategy, leverage the features of your backup vendor to allow you to perform checks on the data, for example Veeam offer SureBackup/SureReplica and Secure restore.

Tip: Incorporate regular testing/scanning of your backup data, it should also be considered part of your valuable data.

 

Have a clean place to put your recovered data

Little point having clean data to recover, if you don't have a clean place to recover to. Consideration for your recovery plan "Do I have adequate resources available to recover?". Again not always the usual suspects when it comes to resources:-

a. Time

Something we can't buy from a vendor or create , especially when the CEO is stood by our desk during an incident shouting "GET IT BACK ONLINE NOW". We can leverage automation to speed up recovery

b. Hardware

Not everyone has infrastructure ready to go. Look to leverage 3rd party MSPs and/or Hyperscalers to provision the required resources

c. People

Do the staff have the necessary skills/process knowledge to recovery within the timescales required?

Tip: Include your DR infrastructure/systems in your security scanning/patching schedule.

 

Backup plan for your Recovery plan.

To quote Mike Tyson "Everyone has a plan, until they get punched in the mouth" , which is a nice way of saying you can do all the planning but until the event your planning for happens, you'll never know. Fortunately technology today means we can simulate most scenarios, and automate recovery where possible.

However automation should not mean teams are off the hook for not knowing the full recovery process. If you have a 100 step automated recovery plan, and it fails at step 57 during a live event, your staff need to know how to progress that recovery plan from step 57 onwards….have a backup ;)

Ok you have your recovery plan, how often are you testing the plan? Once a year? Every 6 months? Are you performing full failovers? Much like your backups, you should consider incremental recovery failovers. Still perform your annual/bi-annual "big bang" failover, but also look to run regular single server/system recovery. It will be less overwhelming for staff to run, and easier to run on a more frequent basis. Pick a different service/system each time.

Tip: Your recovery plan should *NOT* be a static process/document. It should evolve with your environment, so tie it in with your Change Management system. Add a consideration to your change process "Does this impact our ability to recover with out existing recovery plan?".

For example you create a recovery plan in January , but need to "dust it off" in September to recover from an incident. That's 9 months worth of infrastructure/application/process changes that could impact the ability to recover.  Learn to keep your recovery plan up to date.

There we have it, a few discussion/talking points to help you create a reliable, relevant and actionable recovery plan. The list is by no means exhaustive, do you agree or disagree with any of the above points? Is there anything you would add?


6 comments

Userlevel 7
Badge +17

Nice, well-rounded recovery plan talking points to consider Craig. Thanks for sharing! 

Userlevel 7
Badge +20

This was a great article Craig. I actually read it from your blog when I got the email update.  Some really great points here for recovery.  👍

Userlevel 7
Badge +8

Great point. I make a point of using Veeam Labs to do testing every once in a while myself to confirm the backups are good. 

There’s also the fact I have to do frequent restores because someone always deletes something they shouldn’t hahah

 

Userlevel 7
Badge +9

I had a discussion with a delegate at the event, and they asked how they could incorporate their 3rd party security software into the testing. I told him he has 2 options

  1. install the agents onto the VMs, backup/replicate the VMs, then the agent is available in the virtual labs
  2. Add a VM to the virtual lab, when running in test, then put one interface into the lab network. Remember you’re not just constrained to the confinements of the backup data/jobs.
Userlevel 7
Badge +9

Great points @Cragdoo. Misplaced confidence will still prevent many organisations from following / implementing these best practices you have highlighted.

Userlevel 3

That is definitely solid advice, and I can definitely agree on needing to have the appropriate Time, Hardware, and People. Below are some points of experience why having each of those resources is useful (from working at a company that didn’t really have any of them, humorous to read perhaps, but not likely necessary).

The last company I worked at, the CEO would totally come stand by my desk (and other people on the team) and insist it should be all working again by now, like “we have ‘blah blah’ software so why isn’t it working?” never seemed to fully comprehend the human component of IT Support and Disaster Recovery. But this is the same CEO who repeatedly turned down attempts to automate processes because “I don’t trust scripts, you never know what they’re going to do” and then when things didn’t work fast enough the CEO would step in and try to do things to “help” but usually just got in the way because now someone from the team actually needed to supervise the CEO, who didn’t really know much about how everything worked, but wouldn’t stay out of the way (this was usually me, as the person in charge of things there, and the most knowledgeable on everything, but that also meant the person who could most quickly take over and fix things also was unavailable to actually do them).

Hardware is good, redundant hardware is definitely good to have, whether that’s an offline spare that can be physically set up if needed, or (preferably) an online spare that’s just standing by until it’s needed. Of course this sort of thing costs money and can be harder to justify the expense for something that just sits there until the primary thing becomes unusable for one reason or another. That company I was at had a more “if it works we’re keeping it” policy on old hardware, so our “spares” were always 5+ years older than our production hardware, whenever things were upgraded we just kept the old ones around to use as “failover” hardware. Which, as the person in charge of handling any such “failover” processes, I can say, that was not fun. Definitely recommend any intended “failover” hardware be as identical to “production” hardware as possible. These days cloud failovers can be more appealing due to not having to buy hardware that then does “nothing” until a failover, but the expense can certainly be less than appealing if operations can’t be restored back to standard production hardware relatively quickly after an incident. (The company I was at was an MSP in a situation where “cloud” services couldn’t be used due to (very outdated) compliance requirements, virtually everything had to be hosted at our customer’s offices.)

People are great too, unfortunately I can say this all too well, it is important to have the right people, and even with the right people, it’s important to be able to work cooperatively and be aware that not every individual is necessarily able to take over and do everything all by themselves. Unfortunately I can attest very well to how much minimum-wage staff who haven’t had any sort of IT work experience before can’t be just dropped into doing things. I’m all for the notion that nearly any position can be someone’s first position, but if someone doesn’t have experience they need to be able to rely on their team for assistance. The same prior company I worked at was keen on hiring people with no experience, saying they could learn as they go, but then the CEO totally disapproved of team work, if people didn’t know they should ask the CEO or their supervisor, not their coworkers (who had better things to do), but of course the CEO didn’t actually know that much and the supervisors were all non-IT oriented “management staff” whose little technical knowledge came from consulting the CEO when something wasn’t working.

(The company had a very high turnover in all levels of the IT staff, most people quit within 90 days of getting hired so virtually everyone was “new”. And I know from talking with them, most people cited the CEO as the reason for quitting.)

Comment