World Backup Day 2024: SLAs & Exceeding Them!


Userlevel 7
Badge +20

Hi, for this year’s world backup day I’ll be providing two posts on topics that I believe are important for data protection, this is the first one: SLAs & Exceeding them!

So, what is an SLA? An SLA (Service Level Agreement) is typically an accepted standard for a service. In the case of data protection, SLAs might be RPOs (Recovery Point Objectives) or RTOs (Recovery Time Objectives). The metrics measured by these are already well defined from various others so I won’t go over old ground here, but they are important for the topic at hand.

Typically any organisation will have an RPO, the maximum tolerance for data loss when restoring from backup. There might be multiple RPOs depending on the “tier” of the service being protected. 24 hours for a reporting server but 1 hour for a SQL database as an example.

Whenever I engage with a customer and look at configuring backups, I’ll ask for how frequently they want to backup their data, and the frequency I’m given always equals the RPO for the service, and the next question I ask is “why?”. This often takes the customer by surprise, but consider this:

When designing a virtualisation cluster, we measure the number of nodes within the cluster that we can lose without impacting service. When a sales person reaches their target for the month, they look to exceed their target, not simply stop and wait for their target to reset.

But in data protection, we stop at the border of acceptable. If we need to back something up once a day that is what we do. But where is the fault tolerance in this? Should you have an unexpected power cut, or a server crashes, or a backup repository is unexpectedly low on space and a backup fails, then you have violated your SLA.

There’s no other words for it, if you’ve agreed with the organisation to make sure the data loss from a major incident doesn’t exceed 24 hours, and then your nightly backup fails, you’ve defied that agreement.

But it doesn’t have to be this way!

Instead, consider halving your backup intervals, and thereby doubling your backup frequency, with incremental backups this won’t take as long, won’t result in much higher data consumption (typically, unless your servers generate a LOT of temporary data that you can’t exclude).

In this scenario, a backup failure means that you’ve got half of the time defined within your SLA to resolve the issue instead. And should the issue surrounding your backups look like it will impact your SLA, you can proactively warn the business of exceptional circumstances before you’ve violated the SLA, which is always a better position to be in.

 

Making This Advice Practical

 

Now, these words would be useless without something to take away, so let’s consider what we can do:

Firstly, when designing new data protection platforms, qualifying the SLAs you will be required to meet; and your solution’s ability to exceed these demands is an essential building block. There is a tendency for backup repositories to be “cheap and deep”, aka high capacity, low performance, but these architectures are always focused on maximising retention, in reality having a smaller, high performance ‘landing zone’ will help you to exceed your RPOs, and RTOs, by completing backups faster and restoring it back faster. Options for this could be flash storage block-based or high-performance object storage within the same location as your production workloads. I add this caveat because AWS S3 is extremely high performance, but your WAN connection will nearly always be your bottleneck to consume the speeds available.

Secondly, reviewing data protection methods. We’re all familiar with backups, but with the ability to replicate data such as via Continuous Data Protection integrations, and immutable storage snapshots, it might be possible to create a layered SLA structure, based on the types of failures you are combating, to compliment your existing infrastructure.

Thirdly, testing. SLAs are worthless if they can’t be met. Validate your recovery times against their objectives, and test how many backups you can take within the RPO time window. Once you’re aware of the bottlenecks of your solution; you can address these directly, potentially reducing the amount of budget required to fix any shortfalls.

Finally, communicate! If you aren’t able to meet your agreed SLAs, you open your business to risk. Mitigate or acknowledge that risk by communicating with the key business stakeholders. This can lead to mutually beneficial outcomes such as “don’t worry about backing up my SQL Server every 15 minutes just perform log shipping that often and back it up daily”. Framing your issues in an outcome focused approach can be beneficial too. “Hi boss, I’m concerned that we haven’t got much margin for error in our backup solution currently. We’ve agreed to back up the servers daily and we’re achieving that most of the time, but if there’s an error we could be looking at two days of data loss. I propose we <implement X technology/purchase Y solution>”. The conversation shows that things aren’t currently a disaster, but there’s an element of unmanaged risk that should be addressed, and the key stakeholders can look at investing to meet the SLAs or adjusting expectations to meet with the reality of the current solution.

I won’t pretend that IT departments aren’t being asked to do “more with less”, but keeping expectations aligned with reality is an essential part of risk management, which is a key reason to backup in the first place. Underpromise and overdeliver, set realistically achievable SLAs and then smash those targets!


3 comments

Userlevel 7
Badge +20

Really great article Michael and definitely things to take in to account.

Userlevel 7
Badge +17

Nice post Michael. For sure things to consider when devising a recovery strategy and a DR Plan. 

Userlevel 7
Badge +7

Nice post @MicoolPaul !

Actually I had never thought SLA violation as such a major problem, of course if time-limited..

In my opinion, the probability of a problem occurring to the backup infrastructure and then to a component that needs restore is, lower...

On the other hand, if something very serious happens to the infrastructure, there of course the DR plan is triggered.

In any case these are all factors to consider..so thank you for sharing them!

Comment