Skip to main content

Do we overengineer for uptime and underprepare for recovery?


Madi.Cristil
Forum|alt.badge.img+8

I think part of the confusion around high availability comes from how universally we try to apply it.

Once it works well at the application layer, the instinct is to push the same thinking everywhere else. If something matters, it should be highly available. Reduce downtime, remove single points of failure, make everything seamless.

That logic starts to break down when you look at systems that aren’t serving users directly, like backup systems.

There’s a discussion in the Veeam community where someone asked a straightforward question: how do you make Veeam for AWS highly available?

One of the answers pointed in a different direction. Instead of treating it as something that must always stay up, the focus was on backing up its configuration and, if needed, redeploying a new instance and restoring it.

At first, that feels counterintuitive. If everything else is designed for high availability, why not this as well?

But it makes more sense once you think about what problem you’re actually trying to solve.

High availability is about minimizing interruption. Fault tolerance goes a step further and aims to keep things running even through failure. Both assume that keeping the system alive is the priority.

With a backup system, the priority is slightly different. It’s not primarily about whether the service itself is always available. It’s about whether it can be brought back quickly, in a known state, without losing its role in the bigger picture.

In other words, it’s not only an availability problem. It’s a recovery problem.

That same distinction shows up in AWS as well, just in a different form. AWS does an excellent job of keeping infrastructure available. Services are distributed, failures are isolated, and most of the time things keep running even when parts of the system fail.

But the moment the issue is not the infrastructure — when it’s data, access, or configuration — that model doesn’t help in the same way.

Replication and availability mechanisms are designed to maintain the current state of a system. That includes both healthy data and unintended changes. That’s why additional mechanisms like versioning and backups are needed: to recover from deletion, corruption, or human error.

I think not everything needs to stay up at all costs. Some components benefit more from being recoverable than from being always-on.

That’s where the role of platforms like Veeam becomes clearer.

They don’t replace cloud availability mechanisms. They operate in a different space — focusing on recovery, control, and the ability to return to a trusted point in time.

You still design your applications for high availability. Multi-AZ, load balancing, scaling — all of that remains essential.

But alongside that, you also design for the scenarios where things don’t just fail — they drift, they break silently, or they become unreliable in ways availability alone doesn’t address. And those scenarios aren’t solved by keeping systems running. They’re solved by having something clean to return to.

That’s probably the simplest way to look at it.

High availability keeps things up. Fault tolerance keeps them running through failure.

But not everything needs to be forced into that model.

Some parts of the architecture — especially the ones responsible for protection — are better designed around how easily they can be rebuilt, restored, and trusted again when you need them most.

What do you think? Let me know in the comments! 

19 comments

AndrePulia
Forum|alt.badge.img+9
  • Veeam Vanguard
  • May 22, 2026

HI ​@Madi.Cristil I Completely agree with this view. We often try to apply high availability/fault tolerance everywhere, but backup systems are a different case. In many scenarios, being recoverable quickly and reliably matters more than being always-on. Great perspective.


eblack
Forum|alt.badge.img+2
  • Influencer
  • May 22, 2026

Design for recovery, not uptime, is the right model for protection infrastructure in my opinion. We measure Always-On and Availability Groups in millisecond failover and response times. Applying HA thinking to a backup system can actually create false confidence in some cases. The backup system being available is not the same as the backup system being functional and trustworthy.

I enjoyed the read, nice post. 


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

Thank you, ​@AndrePulia  and ​@eblack !


Chris.Childerhose
Forum|alt.badge.img+21

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 22, 2026

In my opinion, you need to have a bit of both - uptime + recovery.  I know when doing designs you tend to want to lean in the uptime direction, but you need to look at the whole picture to ensure you can also recover fast enough to get things back on track.  I look at both these when doing designs for Veeam.

I think that's the right approach! 


Geoff Burke
Forum|alt.badge.img+22
  • Veeam Vanguard
  • May 22, 2026

This brings me flashbacks of some customers stating in relation to their windows server “We are a 24/7 shop, no downtime or reboots!”, but what about windows updates?, “we can’t afford them”. Until of course they could not afford not to do them!


HangTen416
Forum|alt.badge.img+11
  • Influencer
  • May 22, 2026

This statement is up for DEBATE!

“High availability keeps things up. Fault tolerance keeps them running through failure.”

As some have already discussed here, the cost and complexity of having HA configured for all components is high and it’s valid to question whether it’s worth it if the issue is a misconfiguration or corruption on the primary system. That will simply be copied to the secondary.

As for fault tolerance, we also need a way to detect faults. For a real world example, VB365 does not have backup health check like VBR does and there is no way to know if the backups are recoverable at all the restore points until a restore is actually initiated. So you don’t know you have a problem until you’re trying to restore to fix a problem.

So to answer your question we do over-engineer uptime and under-engineer for recovery.


Scott
Forum|alt.badge.img+10
  • Veeam Legend
  • May 23, 2026

This is an amazing point and I couldn’t agree more. Not to mention how expensive FT environments can get, and how many resources end up wasted on systems that don’t actually require that level of protection.

With increasing storage, memory, licensing, and infrastructure costs, organizations really need to evaluate requirements and spend money where it makes sense.

There are 4 things I often find myself repeating. 

  • HA and FT are different technologies for different purposes

  • HA/FT are not backups or DR, though they can be part of a DR strategy

  • Backups are not DR either, but are often one of the most important parts of it

  • Storage snapshots are not backups, but can help with recovery in some situations including DR

Fault Tolerance means exactly what it says, systems continue running through a fault with little to no interruption. They are perfect for mission critical services that need 24/7 uptime, but also usually the most expensive. You’re often doubling hardware, storage, networking, licensing, datacenter space, and more. It also has to be designed properly, otherwise you’re just introducing new single points of failure. What is the point of an FT system on the same piece of hardware, or connected to a single switch. 

Highly Available systems are usually more of an active/passive setup where services fail over or restart automatically with minimal downtime.

Both HA and FT have their place depending on business requirements. The issue is when people assume every system needs that level of protection. 

Another important point is that HA and FT environments will often replicate corruption, ransomware, application issues, or human mistakes. They protect against hardware failures and help minimize downtime, but they are not substitutes for backups.

For less critical systems, backups with Veeam can provide DR capabilities at a fraction of the cost without requiring expensive HA.

People 100% often overengineer plans wasting money thinking every system is critical. For myself, if our main site is down, I know for a fact many areas that can not work. Do those systems need to be operational if no one is utilizing them?   

On the other side, what if just the SAN fails, or the sprinklers went off?. That could be a disaster requiring them to be brought online at another site and business can keep running. That doesn’t mean they need to be HA/FT, but having a disaster recovery plan for situations like that, utilizing replication or backups should still be thought out.

 

For myself, I usually categorize systems something like:

  • Critical FT systems that truly cannot go down

  • Critical HA systems with minimal acceptable downtime

  • Systems restored from backups/replication

  • Lower priority systems restored best effort

  • Systems where the business has accepted the risk

In a real disaster, outages should honestly be expected. Multi-site FT with synchronous replication absolutely has its place, but it gets extremely expensive and complex very quickly. For most businesses, that level of protection only makes sense for a small number of truly critical systems.

 

END RANT! - Kidding, but after many years of explaining and dealing with this, and lately the increasing costs of hardware, there has been many meetings about this. Also, test your Backups, DR plans, FT, and HA systems frequently, in every possible scenario, because if you don’t, you will have a very bad time at 4AM on a Saturday. 


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 24, 2026

This is an amazing point and I couldn’t agree more. Not to mention how expensive FT environments can get, and how many resources end up wasted on systems that don’t actually require that level of protection.

With increasing storage, memory, licensing, and infrastructure costs, organizations really need to evaluate requirements and spend money where it makes sense.

There are 4 things I often find myself repeating. 

  • HA and FT are different technologies for different purposes

  • HA/FT are not backups or DR, though they can be part of a DR strategy

  • Backups are not DR either, but are often one of the most important parts of it

  • Storage snapshots are not backups, but can help with recovery in some situations including DR

Fault Tolerance means exactly what it says, systems continue running through a fault with little to no interruption. They are perfect for mission critical services that need 24/7 uptime, but also usually the most expensive. You’re often doubling hardware, storage, networking, licensing, datacenter space, and more. It also has to be designed properly, otherwise you’re just introducing new single points of failure. What is the point of an FT system on the same piece of hardware, or connected to a single switch. 

Highly Available systems are usually more of an active/passive setup where services fail over or restart automatically with minimal downtime.

Both HA and FT have their place depending on business requirements. The issue is when people assume every system needs that level of protection. 

Another important point is that HA and FT environments will often replicate corruption, ransomware, application issues, or human mistakes. They protect against hardware failures and help minimize downtime, but they are not substitutes for backups.

For less critical systems, backups with Veeam can provide DR capabilities at a fraction of the cost without requiring expensive HA.

People 100% often overengineer plans wasting money thinking every system is critical. For myself, if our main site is down, I know for a fact many areas that can not work. Do those systems need to be operational if no one is utilizing them?   

On the other side, what if just the SAN fails, or the sprinklers went off?. That could be a disaster requiring them to be brought online at another site and business can keep running. That doesn’t mean they need to be HA/FT, but having a disaster recovery plan for situations like that, utilizing replication or backups should still be thought out.

 

For myself, I usually categorize systems something like:

  • Critical FT systems that truly cannot go down

  • Critical HA systems with minimal acceptable downtime

  • Systems restored from backups/replication

  • Lower priority systems restored best effort

  • Systems where the business has accepted the risk

In a real disaster, outages should honestly be expected. Multi-site FT with synchronous replication absolutely has its place, but it gets extremely expensive and complex very quickly. For most businesses, that level of protection only makes sense for a small number of truly critical systems.

 

END RANT! - Kidding, but after many years of explaining and dealing with this, and lately the increasing costs of hardware, there has been many meetings about this. Also, test your Backups, DR plans, FT, and HA systems frequently, in every possible scenario, because if you don’t, you will have a very bad time at 4AM on a Saturday. 

Great addition to my blog! Thanks for that, ​@Scott !


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 24, 2026

This statement is up for DEBATE!

“High availability keeps things up. Fault tolerance keeps them running through failure.”

As some have already discussed here, the cost and complexity of having HA configured for all components is high and it’s valid to question whether it’s worth it if the issue is a misconfiguration or corruption on the primary system. That will simply be copied to the secondary.

As for fault tolerance, we also need a way to detect faults. For a real world example, VB365 does not have backup health check like VBR does and there is no way to know if the backups are recoverable at all the restore points until a restore is actually initiated. So you don’t know you have a problem until you’re trying to restore to fix a problem.

So to answer your question we do over-engineer uptime and under-engineer for recovery.

It seems like you really enjoyed the debate initiative , ​@HangTen416 ! 😎😁


marco_s
Forum|alt.badge.img+9
  • Veeam Vanguard
  • May 25, 2026

Hi Madi, interesting discussion.
I’ve bumped a thread on the R&D forum about VBA’s high availability.
One of our clients has multi-region applications on AWS, and for compliance and regulatory reasons, they need to ensure that files and documents can be restored from backup very quickly.
I agree that backup infrastructures shouldn’t have the same fault tolerance as a company’s most critical systems, but sometimes there are cases like this, or cases where production systems such as databases are dependent on things like log truncation, for example.
Also, today's data protection software has an increasing the set of features. I'm thinking, for example, of data security posture management or threat analysis. In the event of an attack, the backup infrastructure needs to be available as quickly as possible, so as to reduce RTO and assist with the analysis.
It’s a truly complex topic! 😋


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 25, 2026

Hey ​@marco_s ! Thanks for sharing — really good points, especially around compliance and how broad data protection has become. Definitely adds more nuance to the discussion 🙂


leduardoserrano
Forum|alt.badge.img+6

Hello Madi! Congratulations on presenting this great point of reflection!

I believe they are complementary and not mutually exclusive strategies. Depending on the application, MVB with low RTO through replication is very valuable for minimizing downtime costs.

Modern replication solutions (including some services available in the public cloud) can provide users with a complete timeline of recovery points, allowing a consistent restoration of applications to a specific point in time before data corruption, deletion, misconfiguration, or even ransomware encryption.

All applications, including MVB and less critical ones, must be protected with backups and fast recovery strategies when they drift, break, are compromised, or become unreliable.

Basically, the Replication+Backup strategy involves creating multiple layers of resilience based on the application's criticality and business needs. Of course, backup must always be present, and the level of replication adoption is always variable.

Sincerely,

Luiz.

 


Jean.peres.bkp
Forum|alt.badge.img+8

While reading your article, I was fitting the pieces together within the AWS Well-Architected framework.

They are:

Your concept AWS Well-Architected
HA vs Recovery Reliability
Rebuild vs Always-On Operational Excellence
Backup vs Replication Security
Not everything needs HA. Cost Optimization

 

@Madi.Cristil  congratulations on the article!


coolsport00
Forum|alt.badge.img+22
  • Veeam Legend
  • May 26, 2026

Really well-written, well thought-out post Madi! 🙌🏻

I tend to agree with Chris here...there are times when a little bit of both HA & FT are needed. For my org specifically, we don’t really require FT for critical systems; minimal downtime is ok, so long as I can recover fairly quickly. Of course, there are a lot of businesses which can’t handle downtime, thus FT comes into play there.

You also bring up another good point I really think has gotten lost over the past handful of yrs or so → Replication. I really don’t hear much discussion around one of Veeam’s most precious features imo (Replication). Occasionally I’ll see a post/question about CDP or VRO, but are orgs really not using it anymore? If not, why? I certainly do. Having VMs replicated at a cold site & ready to go provides me the best opportunity to minimize downtime in the event of a “CO” outage, since I have a 2nd VBR server which handles replications solely also at my DR site. 

Anyway...many good points made Madi!


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 26, 2026

Really well-written, well thought-out post Madi! 🙌🏻

I tend to agree with Chris here...there are times when a little bit of both HA & FT are needed. For my org specifically, we don’t really require FT for critical systems; minimal downtime is ok, so long as I can recover fairly quickly. Of course, there are a lot of businesses which can’t handle downtime, thus FT comes into play there.

You also bring up another good point I really think has gotten lost over the past handful of yrs or so → Replication. I really don’t hear much discussion around one of Veeam’s most precious features imo (Replication). Occasionally I’ll see a post/question about CDP or VRO, but are orgs really not using it anymore? If not, why? I certainly do. Having VMs replicated at a cold site & ready to go provides me the best opportunity to minimize downtime in the event of a “CO” outage, since I have a 2nd VBR server which handles replications solely also at my DR site. 

Anyway...many good points made Madi!

Maybe you should write one of those good blogs of yours on replication this time 😉


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 26, 2026

While reading your article, I was fitting the pieces together within the AWS Well-Architected framework.

They are:

Your concept AWS Well-Architected
HA vs Recovery Reliability
Rebuild vs Always-On Operational Excellence
Backup vs Replication Security
Not everything needs HA. Cost Optimization

 

@Madi.Cristil  congratulations on the article!

Thank you, ​@Jean.peres.bkp ! That table is a really good visual point! 


coolsport00
Forum|alt.badge.img+22
  • Veeam Legend
  • May 26, 2026

“Maybe you should write one of those good blogs of yours on replication this time 😉”

ha...I already did like 3yrs ago 😉

Maybe a VRO one will be due….whenever I take/finish the course 🤷🏻‍♂️🙂


Madi.Cristil
Forum|alt.badge.img+8
  • Author
  • Principal Community Manager
  • May 26, 2026

“Maybe you should write one of those good blogs of yours on replication this time 😉”

ha...I already did like 3yrs ago 😉

Maybe a VRO one will be due….whenever I take/finish the course 🤷🏻‍♂️🙂

VRO would be nice too ;)