Skip to main content

Do we overengineer for uptime and underprepare for recovery?


Madi.Cristil
Forum|alt.badge.img+8

I think part of the confusion around high availability comes from how universally we try to apply it.

Once it works well at the application layer, the instinct is to push the same thinking everywhere else. If something matters, it should be highly available. Reduce downtime, remove single points of failure, make everything seamless.

That logic starts to break down when you look at systems that aren’t serving users directly, like backup systems.

There’s a discussion in the Veeam community where someone asked a straightforward question: how do you make Veeam for AWS highly available?

One of the answers pointed in a different direction. Instead of treating it as something that must always stay up, the focus was on backing up its configuration and, if needed, redeploying a new instance and restoring it.

At first, that feels counterintuitive. If everything else is designed for high availability, why not this as well?

But it makes more sense once you think about what problem you’re actually trying to solve.

High availability is about minimizing interruption. Fault tolerance goes a step further and aims to keep things running even through failure. Both assume that keeping the system alive is the priority.

With a backup system, the priority is slightly different. It’s not primarily about whether the service itself is always available. It’s about whether it can be brought back quickly, in a known state, without losing its role in the bigger picture.

In other words, it’s not only an availability problem. It’s a recovery problem.

That same distinction shows up in AWS as well, just in a different form. AWS does an excellent job of keeping infrastructure available. Services are distributed, failures are isolated, and most of the time things keep running even when parts of the system fail.

But the moment the issue is not the infrastructure — when it’s data, access, or configuration — that model doesn’t help in the same way.

Replication and availability mechanisms are designed to maintain the current state of a system. That includes both healthy data and unintended changes. That’s why additional mechanisms like versioning and backups are needed: to recover from deletion, corruption, or human error.

I think not everything needs to stay up at all costs. Some components benefit more from being recoverable than from being always-on.

That’s where the role of platforms like Veeam becomes clearer.

They don’t replace cloud availability mechanisms. They operate in a different space — focusing on recovery, control, and the ability to return to a trusted point in time.

You still design your applications for high availability. Multi-AZ, load balancing, scaling — all of that remains essential.

But alongside that, you also design for the scenarios where things don’t just fail — they drift, they break silently, or they become unreliable in ways availability alone doesn’t address. And those scenarios aren’t solved by keeping systems running. They’re solved by having something clean to return to.

That’s probably the simplest way to look at it.

High availability keeps things up. Fault tolerance keeps them running through failure.

But not everything needs to be forced into that model.

Some parts of the architecture — especially the ones responsible for protection — are better designed around how easily they can be rebuilt, restored, and trusted again when you need them most.

What do you think? Let me know in the comments! 

10 comments

AndrePulia
Forum|alt.badge.img+9
  • Veeam Vanguard
  • May 22, 2026

HI ​@Madi.Cristil I Completely agree with this view. We often try to apply high availability/fault tolerance everywhere, but backup systems are a different case. In many scenarios, being recoverable quickly and reliably matters more than being always-on. Great perspective.


eblack
Forum|alt.badge.img+2
  • Influencer
  • May 22, 2026

Design for recovery, not uptime, is the right model for protection infrastructure in my opinion. We measure Always-On and Availability Groups in millisecond failover and response times. Applying HA thinking to a backup system can actually create false confidence in some cases. The backup system being available is not the same as the backup system being functional and trustworthy.

I enjoyed the read, nice post. 


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

Thank you, ​@AndrePulia  and ​@eblack !


Chris.Childerhose
Forum|alt.badge.img+21

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.

I think that's the right approach! 


Geoff Burke
Forum|alt.badge.img+22
  • Veeam Vanguard
  • May 22, 2026

This brings me flashbacks of some customers stating in relation to their windows server “We are a 24/7 shop, no downtime or reboots!”, but what about windows updates?, “we can’t afford them”. Until of course they could not afford not to do them!


HangTen416
Forum|alt.badge.img+11
  • Influencer
  • May 22, 2026

This statement is up for DEBATE!

“High availability keeps things up. Fault tolerance keeps them running through failure.”

As some have already discussed here, the cost and complexity of having HA configured for all components is high and it’s valid to question whether it’s worth it if the issue is a misconfiguration or corruption on the primary system. That will simply be copied to the secondary.

As for fault tolerance, we also need a way to detect faults. For a real world example, VB365 does not have backup health check like VBR does and there is no way to know if the backups are recoverable at all the restore points until a restore is actually initiated. So you don’t know you have a problem until you’re trying to restore to fix a problem.

So to answer your question we do over-engineer uptime and under-engineer for recovery.


Scott
Forum|alt.badge.img+10
  • Veeam Legend
  • May 23, 2026

This is an amazing point and I couldn’t agree more. Not to mention how expensive FT environments can get, and how many resources end up wasted on systems that don’t actually require that level of protection.

With increasing storage, memory, licensing, and infrastructure costs, organizations really need to evaluate requirements and spend money where it makes sense.

There are 4 things I often find myself repeating. 

  • HA and FT are different technologies for different purposes

  • HA/FT are not backups or DR, though they can be part of a DR strategy

  • Backups are not DR either, but are often one of the most important parts of it

  • Storage snapshots are not backups, but can help with recovery in some situations including DR

Fault Tolerance means exactly what it says, systems continue running through a fault with little to no interruption. They are perfect for mission critical services that need 24/7 uptime, but also usually the most expensive. You’re often doubling hardware, storage, networking, licensing, datacenter space, and more. It also has to be designed properly, otherwise you’re just introducing new single points of failure. What is the point of an FT system on the same piece of hardware, or connected to a single switch. 

Highly Available systems are usually more of an active/passive setup where services fail over or restart automatically with minimal downtime.

Both HA and FT have their place depending on business requirements. The issue is when people assume every system needs that level of protection. 

Another important point is that HA and FT environments will often replicate corruption, ransomware, application issues, or human mistakes. They protect against hardware failures and help minimize downtime, but they are not substitutes for backups.

For less critical systems, backups with Veeam can provide DR capabilities at a fraction of the cost without requiring expensive HA.

People 100% often overengineer plans wasting money thinking every system is critical. For myself, if our main site is down, I know for a fact many areas that can not work. Do those systems need to be operational if no one is utilizing them?   

On the other side, what if just the SAN fails, or the sprinklers went off?. That could be a disaster requiring them to be brought online at another site and business can keep running. That doesn’t mean they need to be HA/FT, but having a disaster recovery plan for situations like that, utilizing replication or backups should still be thought out.

 

For myself, I usually categorize systems something like:

  • Critical FT systems that truly cannot go down

  • Critical HA systems with minimal acceptable downtime

  • Systems restored from backups/replication

  • Lower priority systems restored best effort

  • Systems where the business has accepted the risk

In a real disaster, outages should honestly be expected. Multi-site FT with synchronous replication absolutely has its place, but it gets extremely expensive and complex very quickly. For most businesses, that level of protection only makes sense for a small number of truly critical systems.

 

END RANT! - Kidding, but after many years of explaining and dealing with this, and lately the increasing costs of hardware, there has been many meetings about this. Also, test your Backups, DR plans, FT, and HA systems frequently, in every possible scenario, because if you don’t, you will have a very bad time at 4AM on a Saturday. 


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 24, 2026

This is an amazing point and I couldn’t agree more. Not to mention how expensive FT environments can get, and how many resources end up wasted on systems that don’t actually require that level of protection.

With increasing storage, memory, licensing, and infrastructure costs, organizations really need to evaluate requirements and spend money where it makes sense.

There are 4 things I often find myself repeating. 

  • HA and FT are different technologies for different purposes

  • HA/FT are not backups or DR, though they can be part of a DR strategy

  • Backups are not DR either, but are often one of the most important parts of it

  • Storage snapshots are not backups, but can help with recovery in some situations including DR

Fault Tolerance means exactly what it says, systems continue running through a fault with little to no interruption. They are perfect for mission critical services that need 24/7 uptime, but also usually the most expensive. You’re often doubling hardware, storage, networking, licensing, datacenter space, and more. It also has to be designed properly, otherwise you’re just introducing new single points of failure. What is the point of an FT system on the same piece of hardware, or connected to a single switch. 

Highly Available systems are usually more of an active/passive setup where services fail over or restart automatically with minimal downtime.

Both HA and FT have their place depending on business requirements. The issue is when people assume every system needs that level of protection. 

Another important point is that HA and FT environments will often replicate corruption, ransomware, application issues, or human mistakes. They protect against hardware failures and help minimize downtime, but they are not substitutes for backups.

For less critical systems, backups with Veeam can provide DR capabilities at a fraction of the cost without requiring expensive HA.

People 100% often overengineer plans wasting money thinking every system is critical. For myself, if our main site is down, I know for a fact many areas that can not work. Do those systems need to be operational if no one is utilizing them?   

On the other side, what if just the SAN fails, or the sprinklers went off?. That could be a disaster requiring them to be brought online at another site and business can keep running. That doesn’t mean they need to be HA/FT, but having a disaster recovery plan for situations like that, utilizing replication or backups should still be thought out.

 

For myself, I usually categorize systems something like:

  • Critical FT systems that truly cannot go down

  • Critical HA systems with minimal acceptable downtime

  • Systems restored from backups/replication

  • Lower priority systems restored best effort

  • Systems where the business has accepted the risk

In a real disaster, outages should honestly be expected. Multi-site FT with synchronous replication absolutely has its place, but it gets extremely expensive and complex very quickly. For most businesses, that level of protection only makes sense for a small number of truly critical systems.

 

END RANT! - Kidding, but after many years of explaining and dealing with this, and lately the increasing costs of hardware, there has been many meetings about this. Also, test your Backups, DR plans, FT, and HA systems frequently, in every possible scenario, because if you don’t, you will have a very bad time at 4AM on a Saturday. 

Great addition to my blog! Thanks for that, ​@Scott !


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 24, 2026

This statement is up for DEBATE!

“High availability keeps things up. Fault tolerance keeps them running through failure.”

As some have already discussed here, the cost and complexity of having HA configured for all components is high and it’s valid to question whether it’s worth it if the issue is a misconfiguration or corruption on the primary system. That will simply be copied to the secondary.

As for fault tolerance, we also need a way to detect faults. For a real world example, VB365 does not have backup health check like VBR does and there is no way to know if the backups are recoverable at all the restore points until a restore is actually initiated. So you don’t know you have a problem until you’re trying to restore to fix a problem.

So to answer your question we do over-engineer uptime and under-engineer for recovery.

It seems like you really enjoyed the debate initiative , ​@HangTen416 ! 😎😁