As promised in Part XV, today we will be looking at common environment architecture topics. These are very important parts of planning and operating a secure backup environment.
Of course, the backup environment cannot be viewed in isolation from the rest of the IT landscape. Everything interacts and influences each other. There are specific requirements for backup, but most of the considerations in this text also apply to the production environment. However, I will still refer to the backup environment in this text.
But – as always – it is not an aspect that alone leads to a secure environment. Therefore, in this context we will also look at some side aspects in general.
Designing an effective backup environment architecture is pivotal for maintaining data integrity, availability, and security in today's data-centric world.
What are the main threats to backup data?
- Wrong placement of backup environment components
- Hardware damage to backup systems
- Power outages or disasters at one location
- Power outages or disasters at multiple locations
- External locations for backup data
- Ransomware attacks
- Theft of access data and unauthorized access
- Intentional or unintentional damage by insiders
I have seen while writing this text that this topic gets bigger and bigger. So, we will break down this topic into several parts of the Data Backup Basics series.
In this part we will have a look at the first two threats in the list above. These are the more organizational and technical points that arise from wear and tear.
In the next part of this series, we will look more closely at the other threats to make the (backup) environment more resilient and secure. These are the points that are more likely to be caused by external natural factors and criminal activity or human error.
1. Wrong placement of backup environment components
In many cases, a backup environment will be operated in a dedicated environment in a company's own data center. There are many other uses for backup today for parts of the infrastructure in one or more clouds or for software-as-a-service, but many companies have their own IT infrastructure.
This also includes backup servers and accompanying systems such as backup proxies, backup repositories, etc.
We're starting from the "pure theory" here and want to make the backup environment as independent as possible from the production environment to be backed up. For this reason, many of these components are usually physical machines. There are sensible reasons to virtualize some parts of the backup infrastructure, but by no means all of them.
Backup servers: They contain all the backup logic, perform the backups, and manage the backup data in the repositories.
With this component, it's indeed debatable whether it should be run on physical or virtual hardware. Physical hardware is the advantage of complete independence from the environment to be backed up. The advantages of virtualization are the easier system handling and faster and easier recovery. However, I see a few constraints that must be met with virtualization to run the backup server virtually on the same platform as the systems to be backed up.
- All repositories must be located either on dedicated systems or in the cloud, never directly on the backup server.
- The backup application's configuration database must be backed up regularly (at least daily, but preferably more often) to dedicated systems outside of the production systems or in the cloud – securely encrypted and on immutable storage.
If there are no financial constraints and you are operating several backup servers a dedicated virtualization environment for your backup server is an option, too. This combines the advantages from the other two options.
The other question is the placement of this central component. As close as possible to the production systems, in a different location than the production systems? This is certainly debatable. I usually say in the most secure location that is also least affected by production.
Backup Repository: These components store the backup data. A very important and central component of the backup environment.
This function should always be placed on a system independent of production, be it a physical system or in the cloud. There are several hardware options: a dedicated server with an appropriate amount of disk space and a hardened configuration, a on-premise object storage system, an enterprise grade deduplication appliance and much more.
I would place it as close as possible to the systems whose data it stores, making it as independent and secure as possible. Due to its proximity to the original data, the risk of simultaneous destruction of the original and backup data is particularly high. Therefore, a second repository to which a copy of the data in the first repository is written is essential. This copy repository should be as far away from the original data as possible—be it a different location or a cloud. If the first repository is already in a cloud, the second repository can be placed in a different cloud or at least in a different region of the same cloud.
Backup Proxy: Consolidates and accelerates backups.
This component depends on the intended use. It should be as close and as deeply integrated as possible with the production data. Be it a proxy that has direct access to a virtual SAN or similar for backing up virtual machines. Or a machine with direct access to the SAN if backups are to be made directly from a SAN storage. The appropriate component for this component must be considered on a case-by-case basis.
For reasons of length, I will not consider other components here.
2. Hardware damage to backup systems
To avoid outages due to hardware damage, consider these recommendations. There may be other options and different kinds of hardware, you will have to see which ones are correct for the requirements of your specific environment.
Redundant Components
Modern backup servers consistently rely on redundant hardware to avoid single points of failure. Typical measures include:
- Dual/multi-PSU: If one power supply fails, the second takes over seamlessly.
- UPS and dual power supply: Servers should be connected to redundant power circuits or UPS systems to absorb power disruptions.
- Redundant fans: Hot-swap fans ensure cooling. Many servers allow hot-swapping.
- Multiple controllers: Dual RAID controllers or dual HBA ports. Multi-path IO (MPIO) is often used for SAN storage.
- Redundant network interfaces: At least two network cards per server with NIC teaming or link aggregation. If one connection fails, another takes over.
RAID Configurations and Data Security
RAID 1 with two SSDs is suitable for the operating system, the backup application, and the configuration database. This provides very fast storage with a high I/O rate, supporting the high-performance functionality of the backup application. And the data is written to both SSDs simultaneously, so the outage of one SSD does not affect the operation.
RAID 6 with HDDs is particularly suitable for backup repositories: RAID 6 distributes two parity blocks across n hard drives (usable capacity (n-2)), thus capable of withstanding two failures. Due to the two parity calculations, write performance is slightly lower than with other RAID configurations, but in my opinion, this is offset by the increased reliability.
In combination with RAID the usage of hot spare drives is very useful: In the event of a failure, standby hard drives in a cluster immediately take over the function of the defective drive, without manual intervention.
Additional security provides a cache power supply for the RAID controllers. These are using battery buffers to back up cache data to disks in the event of a power failure.
Selecting Hard Drives and Hardware
Enterprise drives (SAS or enterprise SATA) are typically used for backup servers. Compared to consumer hard drives, they are designed for continuous operation (24x7) and offer higher MTBF rates: SAS drives typically have an MTBF of approximately 1.6 million hours (SATA approximately 1.2 million).
- SAS vs. SATA: SAS drives deliver higher performance and reliability (dual-port, advanced error correction), while SATA offers high capacities at lower costs.
- Hard drive speed: Large, slow HDDs are often sufficient for backup targets, as data is written in sequential streams. SSDs are used as cache (read/write level caching) or for OS installation. Many systems, for example, allow NVMe or SATA SSD cache modules to accelerate HDD RAID arrays.
- Controller/HBA: Enterprise RAID controllers (with battery backup/supercapacitor) or HBA cards (e.g., LSI/Broadcom for SAS) with large cache capacities are standard. They should support at least RAID 6 and hot spare.
ECC RAM and Memory Integrity
Server main memory should always be equipped with ECC (Error-Correcting Code) RAM. ECC modules automatically detect and correct single-bit errors, which significantly increases data integrity. This is particularly important in data-intensive backup workloads, as memory errors could otherwise affect unnoticed stored data. Hardware solutions such as ZFS or professional backup systems generally require ECC.
Some systems also use memory quasi-mirroring (registered/ECC modules with host buffers) to continue operations in the event of a RAM module failure.
Monitoring and early warning systems
A backup server must monitor its own health. This is a crucial function to make all the of the options work at an optimal level. When you don’t notice that a component stopped working and the replacement took over, then the redundancy is lost and the next outage of this component takes your server out of operation. You need to be able to react to such an event as soon as possible.
- IPMI/Redfish: Hardware sensors are available via out-of-band management. CPU temperature, voltage, fan speed, and PSU status can be read. Tools like PRTG or Checkmk use IPMI/Redfish to immediately trigger alerts for critical values.
- SNMP and server management: Server management agents (Dell iDRAC, HPE iLO, Lenovo XClarity) deliver SNMP traps or email alerts in the event of errors (temperature, memory errors, hardware failure).
- SMART monitoring: Hard drive SMART values should be read regularly. Warnings about increasing sector errors or failure parameters allow for timely replacement before a total failure.
In the next part of this series I will discuss disaster scenarios in different infrstructure environments.