I think part of the confusion around high availability comes from how universally we try to apply it.
Once it works well at the application layer, the instinct is to push the same thinking everywhere else. If something matters, it should be highly available. Reduce downtime, remove single points of failure, make everything seamless.
That logic starts to break down when you look at systems that aren’t serving users directly, like backup systems.
There’s a discussion in the Veeam community where someone asked a straightforward question: how do you make Veeam for AWS highly available?
One of the answers pointed in a different direction. Instead of treating it as something that must always stay up, the focus was on backing up its configuration and, if needed, redeploying a new instance and restoring it.
At first, that feels counterintuitive. If everything else is designed for high availability, why not this as well?
But it makes more sense once you think about what problem you’re actually trying to solve.
High availability is about minimizing interruption. Fault tolerance goes a step further and aims to keep things running even through failure. Both assume that keeping the system alive is the priority.
With a backup system, the priority is slightly different. It’s not primarily about whether the service itself is always available. It’s about whether it can be brought back quickly, in a known state, without losing its role in the bigger picture.
In other words, it’s not only an availability problem. It’s a recovery problem.
That same distinction shows up in AWS as well, just in a different form. AWS does an excellent job of keeping infrastructure available. Services are distributed, failures are isolated, and most of the time things keep running even when parts of the system fail.
But the moment the issue is not the infrastructure — when it’s data, access, or configuration — that model doesn’t help in the same way.
Replication and availability mechanisms are designed to maintain the current state of a system. That includes both healthy data and unintended changes. That’s why additional mechanisms like versioning and backups are needed: to recover from deletion, corruption, or human error.
I think not everything needs to stay up at all costs. Some components benefit more from being recoverable than from being always-on.
That’s where the role of platforms like Veeam becomes clearer.
They don’t replace cloud availability mechanisms. They operate in a different space — focusing on recovery, control, and the ability to return to a trusted point in time.
You still design your applications for high availability. Multi-AZ, load balancing, scaling — all of that remains essential.
But alongside that, you also design for the scenarios where things don’t just fail — they drift, they break silently, or they become unreliable in ways availability alone doesn’t address. And those scenarios aren’t solved by keeping systems running. They’re solved by having something clean to return to.
That’s probably the simplest way to look at it.
High availability keeps things up. Fault tolerance keeps them running through failure.
But not everything needs to be forced into that model.
Some parts of the architecture — especially the ones responsible for protection — are better designed around how easily they can be rebuilt, restored, and trusted again when you need them most.
What do you think? Let me know in the comments!
