Slow VM snapshot deletion on NFS volumes on ESXi hosts

  • 7 December 2020
  • 4 comments
  • 3086 views

Userlevel 7
Badge +13

Because VMware snapshots is always a topic for backups, I want to share my latest experiences with VM-snapshots on NFS. 

I had to troubleshoot a problem with system time within a VMware vSphere VM. During backup VM freezes for more than 30 seconds. When this happens, system time in VM also stops and resets to current time after the freeze. And this behavior causes massive problems in the application-layer. During troubleshooting we found a very slow VM snapshot deletion on NFS volumes on ESXi hosts.

Environment

In this setup, we were running:

  • HPE SimpliVity Hyper-converged infrastructure running in current Version.
    • Notice: SimpliVity uses NFS v3 to present volumes to their hosts.
  • VMware vSphere 6.7 in quite current version.
  • Current Version of Veeam Backup and Replication.

Symptoms

  • Slow VM snapshot deletion on NFS volumes on ESXi hosts.
    • Snapshot removal of comparable VMs, running on block storage last about 1-2 seconds. Deletion of a snapshot, hosted on a SimpliVity volumes (NFS v3) lasts at least 40 seconds.
  • During snapshot deletion period no additional IOps can be observed.
  • System time problems within VM.

Root cause

The combination of backup transport mode and NFS version cause the problem: Using NFS v3 (of any storage solution) and hot-add transport (uses virtual appliance for VMDK-mounting) mode (of any backup solution) lead to unresponsive VMs during creation and removal of snapshots.

Workaround

There are a few workarounds available, but none of them is desirable:

  • Use NFS v4 instead of v3.
    Nice, but most often we do not have the choice of the protocol version. For example SimpliVity: just v3 available!
  • Use another transport mode. Direct access and NBD (Network Block Device) is available.
    • For direct access, backup need access to NFS datastore. This is sometimes not possible (SimpliVity does not support this) or not desirable.
    • NBD is the slowest of all transport modes. Backup gets data through the management uplink of the host. There is a throughput limit within ESXi host. It can perform reasonable on 10Gbit links with parallel running jobs.
  • Continue to use hot-add mode but with a appliance on every singe host! See KB article 2010953.
  • A workaround for VMs/Application that suffer from this can be to disable all time synchronization between host and VM. There are a few operations, like vMotion, create/remove snapshot, expand VMDK, that triggers a re-synchronization of VM’s system time. You can try to disable all these triggers and keep VM’s time current by services like NTP or w32tm. See KB 1189 for disabling.

Notes

  • This issue is not related to any backup or storage solution! This problem is – in my opinion – VMware related. Please correct me, when I am wrong.
  • Issue will be a problem for time-sensitive applications. Mostly it will not matter.
  • Read VMware KB article 2010953 for more details.
  • Issue description and workaround for Veeam B&R see KB article 1681.
  • Some rather old, but for the most part still correct details about SimpliVity

4 comments

Userlevel 7
Badge +11

Excellent walkthrough Wolfgang, though I will say that using virtual appliance mode (hot-add) on NFS was always on Veeam's “not recommended” list for exactly the reasons you have mentioned. The natural step would be to leverage Direct NFS as you mentioned but of course if Simplivity blocks this that is a no-go.

I am not sure if the new vSphere 7 precision clock device could help, but maybe something to potentially test out: https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-4E6AE904-75C6-475F-8732-07E4542D7798.html

Userlevel 7
Badge +13

Excellent walkthrough Wolfgang, though I will say that using virtual appliance mode (hot-add) on NFS was always on Veeam's “not recommended” list for exactly the reasons you have mentioned. The natural step would be to leverage Direct NFS as you mentioned but of course if Simplivity blocks this that is a no-go.

I am not sure if the new vSphere 7 precision clock device could help, but maybe something to potentially test out: https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-4E6AE904-75C6-475F-8732-07E4542D7798.html

Good point to use precision clock in vSphere7. Because the VM is stunned during snapshot deletion, this will probably not be the answer. But it is worth a try! Up to now, no customer intended to use this feature.

Userlevel 7
Badge +6

Very good topic, Im learning every day :ok_hand_tone3:

Userlevel 7
Badge +11

Another idea, maybe use the pref.timeLagInMilliseconds setting in the VMX file(s) and set it to a very low value like 10 ms. VMware KB: https://kb.vmware.com/s/article/2108828

Comment