Some weeks ago the backup monitoring team of my company had an issue with the backups of one of our customers…
Where it started and where did it end...
If you want to know, keep on reading
The infrastructure of the customer :
The customer was having a Hyper-V infrastructure with a couple of Hyper-V hosts and +/- 30 VMs running on shared storage (several CSV volumes) divided on those hosts
The customer is using backup-jobs in Veeam Backup & Replication and is using replication-jobs in Veeam Backup & Replication
The backup issue :
The backups were running fine, except for 1 VM.
I don’t know anymore the exact error in Veeam…
The engineers tried several things trying to resolve this issue, but nothing helped.
At least they contacted me if I had an idea for this issue.
I quickly noted the error and mentioned that the cause of this error is not related to the backup, but that there is a Hyper-V issue for this particular VM.
I suggested two things :
- try at first a live migration of that particular VM to another host and retry the backup
- if that was not working : ask the customer when it was possible to turn down the VM and perform a quick migration of the VM to another host and perform the backup again
That engineer tried to perform a live migration of that particular VM and then the real troubles started…
The whole hyper-v host was starting to crash, just by moving 1 VM to another host, unbelievable but true!!!
This issue was being escalated to a tier 2 engineer. He also tried to solve the problem, but couldn’t find the cause or a solution. It was even that terrible he thought that the whole cluster needed to be reconfigured again… At that time all VMs were running on 1 host (just possible with the available memory per host, but of course CPU overcommited).
To an end the case was being escalated to me being the Veeam expert and also having a lot of experience with hyper-v clusters.
I performed a root-cause of this situation.
I was convinced that it was a Hyper-v issue, but what exactly...
At a certain moment I decided to create a new VM (with another name) for this particular VM and copied first the .VHDX-disks so I could rollback if necessary. At some point I wanted to delete (or rename) the original disks but it was not possible being already occupied…. ????
Then what did I notice : that particular VM with that particular name was registered on the failover cluster manager referring to host 1, at hyper-v host 1 I saw the same VM with that name what was normal, but I saw the same VM on hyper-v host 2 !!!!
Ok, of course if we perform than a live migration from host 1 to host 2 using the failover cluster, the whole setup is confused because that VM is already registered on the target host.
Hyper-V cannot cope very well with such a situation.
So I found the cause, and even worse, this was not the only VM being registered twice, there was another VM, and also 2 VMs not being registered in the failover cluster so automatic failover not being possible for those...
I removed all registrations of those VMs, and registered those again as it should and afterwards all was running fine : live-migration no issue, replication no issue, backups no issue…
So, this case started with a simple VM not being backed up, but days later it seemed that some VMs were not being properly configured in the cluster.
At the end, the customer and colleagues were very happy it was being long lasted solved and of course the Veeam software was just working as it should !