[Post digest] Performance Best Practices for VMware Snapshots


Userlevel 7
Badge +6
  • Veeam Legend, Veeam Vanguard
  • 694 comments

Because backup of a vSphere VM almost always involves taking a vSphere snapshot, this VMware blog post will be interesting for every backup administrator. 

https://blogs.vmware.com/performance/2021/06/performance-best-practices-for-vmware-snapshots.html

VMware has tested the performance impact of snapshots. Baseline performance is a VM without a snapshot. After that, performance testing is done with 1, 2 and more snapshots. Tested was default IO-tests and java application performance (SPECjbb). Tests included: vVOL, VMFS and vSAN.

Test Results:

  • Impact on vVOL depends on the storage system, because snapshots are taken there.
  • VMs on VMFS have a huge performance penalty even with one snapshot. To be more exact: the first snapshot has the greatest impact!
  • vSAN does not suffer much from snapshot with sequential workload. To be honest I think this is interesting to know but has no meaning in reality.
  • SPECjbb does not show worse performance at all.

Recommendations:

  • Let snapshots exist as short as possible.
  • vVOL snapshots create less impact - but it depends on the storage array.
  • Keep the snapshot chain as short as possible.

I recommend to read this post carefully (5 minutes). At least the graphics in the post are good to show to the guy who always forgets to delete his snapshots.


18 comments

Userlevel 7
Badge +4

Hey, finally some some decent data about the impact of snapshots. Good arguments against the colleagues who always claim that the snapshots do not matter… :sunglasses::thumbsup_tone3:

Userlevel 7
Badge +5

Really great article and gives you a better understanding for snapshotting for sure.

Userlevel 6
Badge +1

Thanks for sharing. Unfortunately, many Admins forget to delete snapshots and they also do the same to AMIs’. Some even use it as a backup method. 
- Keypoint: Let snapshots exist as short as possible!

Userlevel 6
Badge +1

Hey, finally some some decent data about the impact of snapshots. Good arguments against the colleagues who always claim that the snapshots do not matter… :sunglasses::thumbsup_tone3:

I agree but we cannot do without it in most cases. We just have to keep the chain as short as possible as @vNote42 suggested.

Userlevel 7
Badge +4

Thanks for sharing, it’s a topic that doesn’t get enough attention!

Userlevel 7
Badge +3

Thx for sharing @vNote42. As a recap : create the backups (snapshots) as fast as possible. Therefore put the backup-server as close as possible to the hypervisor if possible and use the best transport mode if using vSphere. Mostly recommended is the direct san method but then it is recommended to use a physical backup-server connected directly to the shared storage, if not possible : use the hot-add transport mode by creating 1 or multiple virtual servers as a proxy. Avoid the NBD method : slowest, so the snapshots are kept open for a longer period...

Userlevel 7
Badge +3

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

Userlevel 7
Badge +6

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.

Userlevel 7
Badge +4

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.

I don’t know about anyone else here but I’m still a vVOL sceptic, just because each storage vendor seems to have different implementations and every now and again we hear of a storage corruption story. @Gostev was tweeting about this only last month saying the same thing:

 

 

 

I REALLY want to use VVOLs because of the benefits, but data integrity needs to come first 😩 feels like having a write cache without a battery backup, it’ll work 99% of the time but that 1% is gonna hurt a LOT.

 

Anyone here using VVOLs as their primary daily workload driver? What vendor? How has your experience been?

Userlevel 7
Badge +3

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.


You're right; it really comes down to the application/VM and change rate. I also know systems which are very latency sensitive or have a high load where a snapshot deletion causes stuns with multiple seconds; even if the snapshot only existeda few minutes.

 

@MicoolPaul Unfortunately (or luckily) I have only seen VVOLS in labs but never in production. Would be great I've someone here had some long term experience to share.

Userlevel 6
Badge +2

Thank you @vNote42 very interesting !
 I hate vVOL, Nutanix & Storage Space Direct or any kind of hyperconvergence. :yum:
Only in the lab, too many HW\Software constraints.

Userlevel 7
Badge +5

I remember testing vvols on Nimble storage back in the day but it did not hold production data due to that small percentage of possible issues with data. Only seen it in labs of late.

Userlevel 7
Badge +4

I remember testing vvols on Nimble storage back in the day but it did not hold production data due to that small percentage of possible issues with data. Only seen it in labs of late.

Someone I used to work with suggested EVERY customer move to VVOLs immediately, there was extremely vocal disagreement.

 

@Link State I know what you mean about hyper convergence. I LOVE vSAN, but I love it pretty much exclusively for VDI. My reason being that ESXi cares FAR too much about what vSAN has to say on the matter.

 

I’ve had scenarios whereby vSAN decides instead of a degraded performance state it will stop processing IO for a workload entirely.

 

Example: With vSAN you can match NVMe with spinny disk. Now assuming you had NVMe that could do 2GBps and a spinny hard drive that could do 100MBps. If you have a sustained burst of IO. The writes will be committed to NVMe (across multiple nodes still) but by default the cache of writes to be demoted to spinny is exceptionally small. If you breach this small threshold, vSAN won’t throttle performance or anything like that. It marks the disk group as unhealthy and tries to rebalance all objects off of this. If you have another failure or maintenance window etc you can find yourself with a vSAN unable to migrate objects and just stops answering IO requests, causing an outage. I can understand the attempts to rebalance but when it’s not possible, surely throttled IO is better than outage?

 

If it was a normal SAN, ESXi would complain about write latency in such an event, but for vSAN it always feels that your workload comes second to vSAN’s health metrics.

Userlevel 7
Badge +5

I like vSAN but in my lab but uses too many resources so I just use my Synology DS920+ with iSCSI and NFS.

Userlevel 7
Badge +6

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.

I don’t know about anyone else here but I’m still a vVOL sceptic, just because each storage vendor seems to have different implementations and every now and again we hear of a storage corruption story. @Gostev was tweeting about this only last month saying the same thing:

 

 

 

I REALLY want to use VVOLs because of the benefits, but data integrity needs to come first 😩 feels like having a write cache without a battery backup, it’ll work 99% of the time but that 1% is gonna hurt a LOT.

 

Anyone here using VVOLs as their primary daily workload driver? What vendor? How has your experience been?

I am really a fan of the idea of vVOL. But for my taste it depends too much on vendors implementation of vVOL. If you just take the VASA provider. Some vendors put them in the controller. For others you need a separate VM or server. Some VASA providers cant be redundant. When you think how important this component is, it should always be highly available. 

Userlevel 7
Badge +5

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.

I don’t know about anyone else here but I’m still a vVOL sceptic, just because each storage vendor seems to have different implementations and every now and again we hear of a storage corruption story. @Gostev was tweeting about this only last month saying the same thing:

 

 

 

I REALLY want to use VVOLs because of the benefits, but data integrity needs to come first 😩 feels like having a write cache without a battery backup, it’ll work 99% of the time but that 1% is gonna hurt a LOT.

 

Anyone here using VVOLs as their primary daily workload driver? What vendor? How has your experience been?

I am really a fan of the idea of vVOL. But for my taste it depends too much on vendors implementation of vVOL. If you just take the VASA provider. Some vendors put them in the controller. For others you need a separate VM or server. Some VASA providers cant be redundant. When you think how important this component is, it should always be highly available. 

This is definitely a good point as if the implementation was standard across vendors it might be better.  Let's see what the future holds for vVOLs.

Userlevel 7
Badge +6

With Veeam Quick Backup there's no need to use Snapshots at all; besides temporary to create the backup 😉

But “temporary” can also last a few hours! I had already long running backup job causing a lot of problems at snapshot deletion. These kind of problems can be solved/reduced by vVOL or storage integration.

I don’t know about anyone else here but I’m still a vVOL sceptic, just because each storage vendor seems to have different implementations and every now and again we hear of a storage corruption story. @Gostev was tweeting about this only last month saying the same thing:

 

 

 

I REALLY want to use VVOLs because of the benefits, but data integrity needs to come first 😩 feels like having a write cache without a battery backup, it’ll work 99% of the time but that 1% is gonna hurt a LOT.

 

Anyone here using VVOLs as their primary daily workload driver? What vendor? How has your experience been?

I am really a fan of the idea of vVOL. But for my taste it depends too much on vendors implementation of vVOL. If you just take the VASA provider. Some vendors put them in the controller. For others you need a separate VM or server. Some VASA providers cant be redundant. When you think how important this component is, it should always be highly available. 

This is definitely a good point as if the implementation was standard across vendors it might be better.  Let's see what the future holds for vVOLs.

waiting … waiting … :sleeping:

Still waiting for a vVOL implementation that supports transparent failover with synchronous replication 

Userlevel 7
Badge +3

@vNote42 : Thanks for sharing !

Comment