Skip to main content

Do we overengineer for uptime and underprepare for recovery?

  • May 22, 2026
  • 6 comments
  • 72 views

Madi.Cristil
Forum|alt.badge.img+8

I think part of the confusion around high availability comes from how universally we try to apply it.

Once it works well at the application layer, the instinct is to push the same thinking everywhere else. If something matters, it should be highly available. Reduce downtime, remove single points of failure, make everything seamless.

That logic starts to break down when you look at systems that aren’t serving users directly, like backup systems.

There’s a discussion in the Veeam community where someone asked a straightforward question: how do you make Veeam for AWS highly available?

One of the answers pointed in a different direction. Instead of treating it as something that must always stay up, the focus was on backing up its configuration and, if needed, redeploying a new instance and restoring it.

At first, that feels counterintuitive. If everything else is designed for high availability, why not this as well?

But it makes more sense once you think about what problem you’re actually trying to solve.

High availability is about minimizing interruption. Fault tolerance goes a step further and aims to keep things running even through failure. Both assume that keeping the system alive is the priority.

With a backup system, the priority is slightly different. It’s not primarily about whether the service itself is always available. It’s about whether it can be brought back quickly, in a known state, without losing its role in the bigger picture.

In other words, it’s not only an availability problem. It’s a recovery problem.

That same distinction shows up in AWS as well, just in a different form. AWS does an excellent job of keeping infrastructure available. Services are distributed, failures are isolated, and most of the time things keep running even when parts of the system fail.

But the moment the issue is not the infrastructure — when it’s data, access, or configuration — that model doesn’t help in the same way.

Replication and availability mechanisms are designed to maintain the current state of a system. That includes both healthy data and unintended changes. That’s why additional mechanisms like versioning and backups are needed: to recover from deletion, corruption, or human error.

I think not everything needs to stay up at all costs. Some components benefit more from being recoverable than from being always-on.

That’s where the role of platforms like Veeam becomes clearer.

They don’t replace cloud availability mechanisms. They operate in a different space — focusing on recovery, control, and the ability to return to a trusted point in time.

You still design your applications for high availability. Multi-AZ, load balancing, scaling — all of that remains essential.

But alongside that, you also design for the scenarios where things don’t just fail — they drift, they break silently, or they become unreliable in ways availability alone doesn’t address. And those scenarios aren’t solved by keeping systems running. They’re solved by having something clean to return to.

That’s probably the simplest way to look at it.

High availability keeps things up. Fault tolerance keeps them running through failure.

But not everything needs to be forced into that model.

Some parts of the architecture — especially the ones responsible for protection — are better designed around how easily they can be rebuilt, restored, and trusted again when you need them most.

What do you think? Let me know in the comments! 

6 comments

AndrePulia
Forum|alt.badge.img+9
  • Veeam Vanguard
  • May 22, 2026

HI ​@Madi.Cristil I Completely agree with this view. We often try to apply high availability/fault tolerance everywhere, but backup systems are a different case. In many scenarios, being recoverable quickly and reliably matters more than being always-on. Great perspective.


eblack
Forum|alt.badge.img+2
  • Influencer
  • May 22, 2026

Design for recovery, not uptime, is the right model for protection infrastructure in my opinion. We measure Always-On and Availability Groups in millisecond failover and response times. Applying HA thinking to a backup system can actually create false confidence in some cases. The backup system being available is not the same as the backup system being functional and trustworthy.

I enjoyed the read, nice post. 


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

Thank you, ​@AndrePulia  and ​@eblack !


Chris.Childerhose
Forum|alt.badge.img+21

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.

I think that's the right approach! 


Geoff Burke
Forum|alt.badge.img+22
  • Veeam Vanguard
  • May 22, 2026

This brings me flashbacks of some customers stating in relation to their windows server “We are a 24/7 shop, no downtime or reboots!”, but what about windows updates?, “we can’t afford them”. Until of course they could not afford not to do them!