Rewind about 2 years. We were short staffed (still are), behind on updates and upgrades and trying our best to get rid of legacy and unsecure infrastructure. After months of work, things were getting close to the finish line. Systems were running great and we were feeling quite safe and secure.
It all came to a sudden halt when someone said the service desk phone was ringing off the hook and no one could log in. I tried myself, and not only could I not log in, I couldn’t ping or lookup most of the VM’s on the network. I rebooted one of my workstations and received a 169 address. I knew right away I was in trouble.
I could tell DCHP was having issues on my desktop, but on my second PC I could still ping most of the servers with static IP’s. This told me DNS was also down. Great I thought at 4PM on a Friday.
We continued to do some troubleshooting and it ended up much worse. It turns out someone decided to put a VERY large file in the sysvol folder to replicate it without confirming the room. All 4 Domain Controllers went offline, but not only that, ended up in a VERY bad state. After doing some troubleshooting it was time to do my first DC restore.
Veeam being non domain joined was a life saver. I had no issues logging in. This led me to my next issue. The restores were failing each time. I think the stress of 6 people hovering over me and a few thousand not able to work had me rushing which ended up wasting about 10 more minutes until I realized everything in Veeam was using hostnames and DNS.
Lucky for me, I always fear the worst and had an offline copy of the physical servers, file servers, DC’s, and important infrastructure IP addresses on a physical printout. After modifying the hosts file I was able to restore the DC’s. Veeam support was excellent helping me get the first DC up and running and restore the ability for users to start working again. After doing a bit more research I found a security guy added a huge password file to sysvol for a password policy on the DC’s. I mean a HUGE file. He learnt a valuable lesson that day.
I learnt something this day as well. Even when everything is done using best practices, things can always surprise you and be improved. The ultimate best practice is learning from something and preparing for the next time.
I now have a few things added to my DR/Backup planning that I’d recommend as part of a best practice solution to everyone.
-Have a list of physical and other critical infrastructure IP addresses on a physical piece of paper.
That IPAM application doesn’t do well when you can’t log into the PC or forget it’s IP address.
-Create hosts files listing all critical infrastructure. Save multiple copies on your Veeam servers to save significant time in an outage. I store a copy on the proxies, repos, and main Veeam server. You don’t have to use it all the time, but just having it is important. Make sure to include at least the following.
-ESXI Hosts
-vCenter Servers
-Veeam Proxies
-Veeam server
-Veeam Repos
-SQL servers used by Veeam
(Included FQDN and Hostnames if you are extra paranoid)
Those last 2 steps will shave off 2-3 hours of wasted time if I need to do this again. Even if I have to put a static IP on my workstation to access Veeam now I can. When you don’t know the IP of your IP Management Database, or anything else, things will not end well.
Thanks again to Veeam, and Veeam support. I came out of this one a hero.