I'm seeing this error "Failed to process VM: Failed to create MachineMutex" when doing storage snapshot. The first try fails but the 2nd try succeeds and usually the issue comes and goes.
Failed to create snapshots on primary storage: Failed to wait mutex sandiscover7c39d713-bdbb-40f1-ac8d-af87774de299_b72e5bb7-87e2-4802-9043-2d6a76f9052d: timeout 600 sec exceeded
Failed to create storage snapshot for datastore abcdef: Failed to wait mutex sandiscover7c39d713-bdbb-40f1-ac8d-af87774de299_b72e5bb7-87e2-4802-9043-2d6a76f9052d: timeout 600 sec exceeded
Who has seen this issue and cause/resolve info?
Best answer by MicoolPaulView original
As the issue is a timeout, is the storage under stress?
Can you provide the following details:
Are there many virtual machines on the datastore? Is it always the same datastore? Let’s start with this.
Appreciate the reply!
Are there many virtual machines on the datastore? There are 3 VMs on 2 volumes. Two of the VMs are huge (45TB). And yes, this issue is always on the same datastores.
Is it the datastore with the huge VMs?
Both have 1 VM that is huge. It happens a lot on one of the datastores but I also see it happening to both at the same time. One datastore has 2 VMs, ( 1 VM with 45 TB) and the other datastore has 1 VM (45TB). These 2 huge VMs are part of SQL Always On.
I've had this a lot with IBM SVC, we received a private fix IIRC, but I think it was only really fixed after we updated the storage hardware.
How is the storage presented to the hosts and Veeam? ISCSI/FC/NFS? Are they in separate igroups?
To take an educated guess, as it’s a timeout intermittently, I’d look at the NetApp logs when this happens and compare them to when it works. Can you share here?
I’ll work with our storage admin to get the necessary logs and will post. Hopefully we can find relevant info to share. The volumes are nfs.
I’ve just sat down and gone through this again and here’s what we can discern:
Terminology: Mutex is a “mutually exclusive” section of code, only one processing thread is allowed to access this at a time. So with this in mind, we can focus on the other words around this. As this is a Storage Snapshot and then a SAN Discover process times out. I believe what is happening is:
Veeam is requesting that a snapshot is created for storage snapshot.
Veeam then performs a SAN discovery process to find the storage snapshot to mount it. This process is allowed to take up to 10 minutes, and then times out.
Is the NetApp presenting many other shares such as SMB/NFS shares, or anything else that could be putting the SAN under heavy stress?
My gut says this is one of the following:
Hope this helps.
Still trying to get to the issue right now. Quite difficult due to the issue being intermittent. I’ll open a ticket with support. Will share if we come up with the cause/solution. I would tend to agree that the SAN maybe under temporary stress due to the size of the VMs.
I can see the same issues and interested if there is a patch or fix for this. Thank you
Didn’t get a chance to open a ticket but we have recently been getting a lot of errors which I think is related to this one. The MUTEX error, I don’t see that anymore but but got replaced with a lot of the following error below. I have opened a ticket with Veeam and will share resolution.
Error: Failed to prepare VM for processing: [MachineSemaphore] Failed to wait for semaphore Global\PREPARING_SAN_VM_757ef7ad-998a-4b3a-9b21-243b68fa5ea0: timeout 10800000 ms exceeded 12:24:45 AM :: Error: Failed to prepare VM for processing: [MachineSemaphore] Failed to wait for semaphore Global\PREPARING_SAN_VM_757ef7ad-998a-4b3a-9b21-243b68fa5ea0: timeout 10800000 ms exceeded