As today is World Backup Day 2024, I thought I’d share a tale from back in the day. This was from my first proper job after university, working as a First Line Analyst (i.e., Helpdesk). I was, to say the least, ‘green behind the ears,’ as the saying goes. It was also my first time navigating the world of an office environment and working with remote offices.
Let me start by setting the scene. Windows Server 2003 and Windows Server 2003 R2 were the Server Operating Systems installed on bare metal boxes. Windows XP was THE Operating System of choice, and Windows 7 had just been released. The word ‘virtualisation’ was not even something we came across. We pretty much had dedicated hardware for each server role required. Domain controller? Physical box running that role. File Server? Another dedicated box for that role. SQL Server? Yet another dedicated box. You get the picture. This, in turn, required a large amount of space to house all the servers. Additionally, installing servers meant sitting with a CD and manually performing the installation. Although there were options to automate it, we didn’t install servers that often to warrant spending the time automating it. We were constantly snowed under with user IT issues, upgrades, and remote office work.
When it came to Cyber Security, ransomware was unheard of, HTTP was the norm, and getting infected meant taking a user device offline and running some anti-malware scans on it while checking the registry.
Connectivity between remote offices was based on MPLS connections and bonded ADSL with very poor speeds. No cloud solutions or backups.
Our environment was architected to perform almost real-time replication of files to one of our remote offices. RAID 5 and VSS snapshots were the norm. NT Backup was the backup of choice as it came bundled with Windows Server 2003. Storage was expensive, which meant very limited backup versions were kept. In hindsight, relying on replication and keeping backups on-site was meant to get us out of trouble in case of failure, but it turned out to be a bad idea.
Anyway, on the day of the incident, we had a power surge that caused the UPS to go out with a bang, filling the server room with acrid black smoke and effectively taking all the servers offline. Considering that our London office was the hub of the network, all the remote offices went offline too.
While we waited for an electrician to come and check the electrics, we were pretty much down. The remote offices were able to carry out some work using the cached version of files they had on their local servers. Needless to say, we had some really unhappy users while we waited for the all-clear from the electrician.
Once we got the all-clear, we bypassed the UPS and started powering up the servers, and that’s when we knew we were in serious trouble. Our bonded ADSL routers had taken a hit and would need to be replaced. Luckily, we had support available, and as fate would have it, the engineer was in the vicinity and got us back up within an hour. That was pretty much the only luck we had.
The servers, however, were a different matter. The box running as the domain controller was completely fried. The second domain controller started up fine, as did our Exchange servers. We then began bringing up the File Server online. On boot, we had the Check Disk scan run and complete. It found some errors and tried to fix them but failed. Windows Server 2003 eventually booted to the login screen. Unfortunately, with Internet connectivity working again, it started replicating file changes to the one remote office that was supposed to take over but failed.
While the users were starting to log in again and trying to access files, they kept receiving error messages saying the file was corrupt or not able to be opened. Essentially, the files were corrupt, and we realized too late that the changes were already replicated.
Remember those backups? Well, they were residing on the fried DC box.
Key takeaways from this incident:
- Have off-site backups.
- Have offline backups.
- Verify your backups.
- Snapshots are not backups.
- RAID 5 is not a backup solution.
- VSS snapshots are not backups.
So, how did we end up having to recover files? We relied on archived copies of files from CDs that used to be taken monthly. We also had to trawl through the files saved on the local drives of remote workstations and emails. Despite our best efforts, there was quite significant data loss too.
So, 'Who needs backups anyway?' Well, on that day, we could have done with quite a few back up copies of our data.