I really struggled with sharing this story because it’s quite embarrassing. But, if it helps 1 person, it’s worth it. And, not all ‘save the day’ stories are about us being flawless Systems Admins/Architects...we are human, but rather how well we’ve setup and configured our recovery processes. Thanks to
I did end up saving the day, but it was at the cost of my own initial negligence the issue arose in the first place, albeit an honest mistake. Interestingly enough, a fellow former Veeam Vanguard (now Veeam employee) and I were discussing this very configuration mishap potential via direct message in the Vanguard Slack channel prior to it actually happening to me.
Catastrophe
So, what did I actually do to cause the catastrophe at my organization? First, let me start by giving a high-level overview of my backup environment. I use physical servers, with Windows OS, as combo Repo/Proxy devices for my backup environment. I also use Nimble arrays as my production and backup (separate arrays) storage. Lastly, ever since Nimble was added as a supported storage vendor for the Backup from Storage Snapshot (BfSS) feature (since v9.5 I believe), the supported configuration to set this up is to configure your Proxies similarly to DirectSAN. Without going into the minute details, you may be asking ‘how are Proxies configured to use DirectSAN transport mode’? There was a technical paper written by Bill Roth from Nimble I utilized when configuring my environment to integrate Nimble with Veeam. I was corresponding with him directly at the time, but you can view the documenation here. Part of the configuration to use DirectSAN is to expose your production datastore Volumes to the Proxy, so the Proxy can “directly” access your VM data. Are you seeing what I’m getting at from a potential mishap standpoint???
Fast-forward to a day, 6-7yrs ago, when I needed to add a Volume from my storage array to my Repo as a new Backup Job Repository. When doing so, how I configure my storage on the Windows Repo is to use Diskpart. I list all the disks, then select my new disk and create a partition, then format it for use. All good. Well, Windows disk numbering when adding new Volumes can be...well, not by “next-in-line” numbering sometimes. In other words, if you have 5 Volumes already on your Windows Repo, the number of the next Volume (disk) you add to your Repo may not be displayed as “6” in Disk Management or in Diskpart. As such, when I went to add/partition/format this new Volume as a new Repo using Diskpart, I selected the wrong disk number, which was the last number in the list, but so happened to be a datastore Volume hosting our most CRITICAL PRODUCTION DATA. When I ran my Diskpart cmd (create partition primary…..blah, blah, blah), I ended up wiping the datastore Volume hosting our critical data!!! CATASTROPHE! Honest mistake, sure; but what really bothered me about myself doing this was the fact pretty much all previous times I did this process, I would VERIFY the disk number I selected was the actual new Volume in Disk Management, as well as in Nimble Connection Manager to prevent this very thing from happening. For whatever reason, this *one* time I didn’t verify the disk number I was working with. Ugh.
Recovery
Even though this happened, I was oblivious to it because I still thought I partitioned & reformatted my new Volume rather than my datastore Volume. How did I find out I wiped out all our critical production VMs? Our Helpdesk got inundated with calls from users saying they lost connection to our critical system. When Helpdesk staff approached me, I looked in my vSphere environment and noticed all my critical VMs showed italicized and as “disconnected”. Ooops!!!
Because of how I setup my backup environment, there was multiple ways I could’ve recovered my data. Even though using Veeam Instant Recovery (IR) probably would’ve been quicker, I chose to use Volume-level Recovery from my Nimble arrays. Using this method meant I didn’t have to go through the tedious process of doing IR for multiple VMs one by one. The downside of using the Nimble Volume Restore method was (unknowningly at the time) the length of time it takes Hosts remove the Volume (datastore). Don’t ask me why, but when removing a wiped datastore from an ESXi Host, it takes about 20-30mins for the datastore to actually be removed (Host processing or something..not sure). Even so, I think this is the cleanest method of restore for this situation. Long story short, after about 3-4hrs of going through the process of removing the datastore from all my ESXi Hosts, I then recovered the same Volume to vSphere via its array-based snapshot. Because I use Veeam to orchestrate my array snapshots, and because this was(is) crtical data I back up every 30mins, the Volume snapshot I used to recover to was pretty recent to after my misconfiguration, so there was only about 10mins of data loss. (bright side )
Lessons Learned
As Systems Admins/Architects, doing repetitive tasks can lead to being negligent in verifyihg those same tasks we accomplish often. So don't be complacent when doing those tasks…..expect, and be prepared for, the unexpected. I know most of us use automation to streamline repetitive tasks, but even using automation here may have led to the same outcome.
Second, do you have a Disaster Recovery plan in place? Sure, this was no natural disaster (as Marco shared in his story) or ransomeware/malware attack (as Luis shared), but a critical recovery scenario regardless. As such, a viable DR plan was needed. So make sure to review your recovery processes regularly. We can’t foresee every disaster scenario, but we can have a general high-level recovery plan in place to help guide us during recovery when various situations occur.
Lastly → mitigate the potential of the issue reoccurring in the future. Back to what I shared earlier about discussing this configuration with a fellow Vanguard. In testing BfSS more, it became apparent to us exposing datastore Volumes to the Proxy can potentially be hazardous (oh really??? ) and is not necessary, as Veeam just needs to interact with the Volume snapshot, not the Volume itself. As such, obviously after what just happened to me, I removed all my datastore Volumes from my Proxies. Keep in mind, when doing so, though your BfSS still works, if for some reason this transport mode fails (though in the 5+ yrs of me using this, it hasn’t not worked), your jobs won’t fail over to DirectSAN mode, but rather NBD, because the datastore Volume isn’t allocated to the Proxy. For me, I’m completely ok with this Another configuration you can enable to help mitigate against this potential self-inflicted disaster is within Diskpart, configure your datastore Volumes to be “Offline Shared” using the san command, giving your Proxies only read access to the Volumes. Whether you can enable this option is dependent on the storage vendor used, so verify via their documentation.
Well, that’s my story. What an embarrasing and humbling experience! Just to make me feel better, have any of you, in some form or another, wiped any critical data by mistake? I will say, before I became a Sr SA/SE myself, while working at a previous healthcare organization, a systems admin co-worker, while walking through our datacenter, ended up tripping over a power supply (don’t ask me why it was in a walking lane) and unplugged power to a whole rack of servers. And this was before virtualization was mainstream. Not good.