Hello everyone
A long time ago, I was employed as a programmer at a small hosting company. About eight weeks in, I had just returned after two weeks of absence from the office due to family matters. Out of the 3 Linux sysadmins we had, we only had one left.
The only one left got tasked to replace a RAID controller on a faulty server overnight. When I came in the next morning, I got a call from the customer asking why their server was still down. Being the programmer in the company, I contacted the owner and the Windows sysadmin to see if they knew why. They gave me the cellphone number of the engineer and I reached out. All I got was a snarl from him with “I am still working on it!”
Together with the Windows sysadmin, we decided to go to the data center and see if we could help him somehow. When we entered the suite, the only thing he could say was “Everything is gone, I can no longer find the partition tables”. He gave some details on what he did and we decided to send him home to get some rest and I would try and see if I could get things working again.
This server was the (only) web and email server for that company, so the data was really important to them.
From the details I got from him is that the RAID controller would crash when a lot of data was read/written to the disks, and the only way to mitigate it was to power-cycle the server. Due to this, it was decided that the backup of that server would be disabled until the RAID controller was replaced. That decision was made around 6 months before the actual replacement of the controller. If I remember correctly it took some time for the new controller to arrive due to hardware shortage.
The machine was built with a RAID1 for OS and RAID10 for data. The engineer decided to break the RAID1/RAID10 setup in the RAID controller to have a “backup” of the data in case things should go south when replacing the controller.
You should know that most RAID controllers store metadata about the RAID set on the disks. By deleting a RAID set, that metadata gets deleted and the data on the disk becomes inaccessible.
We ended up contacting a company for data recovery, the sysadmin got fired, and I became the new sysadmin for that company.
When the server was back up and running late in the evening the same day, I immediately enabled the backup again. They were using the backup software from “Team Dark Blue” at that time. During my time there, I was working on a proposal to replace said software and I was working on testing out several different types of Backup software. I never finished that project.
Now, many, many, many years and saved bacon later, I am still a sysadmin. I do some programming in my spare time when I feel like it.
- RAID is not a backup.
- Even though your backup might cause issues on a machine NEVER disable the backup.
- Having to reboot a server a couple of times a week is way less than no data.
- Having a good and tested backup is really important. You might lose your bacon!
- Test it multiple times. Testing once is not a test. Use the capabilities available in Veeam like SureBackup and Veeam Recovery Orchestrator
I hope this might save your bacon one day.
Backup like a boss,
Maurice