Veeam: Storage Snapshots & VMware vSphere Boot from SAN – A Cautionary Tale


Userlevel 7
Badge +14

 

Today’s post is a direct result of experience in the field, and it was certainly an interesting one!

I was recently working with a customer that wanted to migrate to the Veeam Availability Suite. They had an Enterprise SAN solution that they were using for booting their VMware ESXi hosts, in addition to storing their workloads on. The customer’s virtual estate was large, with some servers that were highly transactional, and a great candidate for storage snapshots.

Everything was going well, until I performed the storage integration with Veeam, and hit a severe issue. after performing the storage rescan, all of the ESXi hosts within the cluster suffered a PSOD near simultaneously. I’ve certainly never seen that before, and upon asking all of my peers in the wider Veeam community and even some Veeam technical contacts, NOBODY had seen this before.

 

The Culprit:

 

The issue thankfully was not one that required much troubleshooting to discover what was going wrong. Each of the ESXi hosts were complaining about seeing duplicate filesystem UUIDs. So, how did this happen?

Veeam’s documentation suggests that separate initiator groups should be used when utilising Storage Snapshot functionality, however if you have Boot from SAN on the same storage array, this really should be mandatory.

 

Why CAN this Combination break VMware?

 

If you’re not putting your Veeam proxies into their own initiator group, then you’re putting them into another initiator group, and if you’re working with VMware, odds are this is the initiator group you’ve chosen to share with Veeam.

As a bit of background information, when you integrate the storage array, all of the LUNs within your included scope will be scanned by Veeam. Veeam will also use any API integrations to look for existing snapshots against these LUNs, and request the storage array to present it to your initiator group. Veeam documentation also states it will create and delete snapshots on these LUNs during the discovery process, though in reality I’ve seen this to be inconsistent.

The important point to take from the above though is that because Veeam requests the LUN gets mounted to the entire initiator group to which its proxies belong, this causes the ESXi servers to receive the snapshot(s) too, and once ESXi receives a snapshot of its filesystem presented back to it, ESXi recognises a duplicate UUID and halts itself for protection.

I’ve found two specific scenarios that trigger this behaviour. Firstly, if you have any snapshots on your boot LUNs, this will be read by Veeam during a storage rescan. Secondly, Veeam can create snapshots via its API permissions, and does so at seemingly random intervals during its storage rescan steps, which themselves are triggered at regular intervals.

Even if you don’t use Boot from SAN, you really should use a dedicated initiator group for your Veeam proxies, then you won’t suddenly stumble upon this issue in the future.


8 comments

Userlevel 7
Badge +15

It is amazing that snapshots can save you on many things but then cause so much grief elsewhere.  Great field story for sure.

Userlevel 7
Badge +3

I’ve rarely heard anything good about using boot from SAN, aside from it make it easier to perform deployments and upgrades because (as I recall) the OS is on the SAN.  But with that said, if you working with VMware at scale, it’s not really that hard to use local disks to host the OS and use host profiles to create consistency between your hosts.  Boot from SAN is something that I thought would be fun to play with, but in all honesty, I haven’t seen a need and I no longer work with systems involving more than 5 hosts or so.  I guess if I did, maybe I’d look at it, but my old environment was closer to 250-300 hosts and we still used local storage for boot.

Userlevel 7
Badge +3

It is amazing that snapshots can save you on many things but then cause so much grief elsewhere.  Great field story for sure.

Snapshots are amazing.  But yeah...there’s some pain there if done incorrectly.

Userlevel 7
Badge +4

Yikes. Nothing like a PSOD for a bit of excitement. 

 

 

Userlevel 7
Badge +3

Yikes. Nothing like a PSOD for a bit of excitement. 

 

 

Not just PSOD….PSOD in mass!  Ouch.

Userlevel 7
Badge +9

A really unique problem I would say. Would be great if Veeam would check which LUNs are actually datastores and only process those.

On the other hand I rarely see Boot from SAN in the wild. In you case ESXi bootbank and the VMs were on the same storage? So losing the storage also means losing the hosts? @MicoolPaul 

Userlevel 7
Badge +3

A really unique problem I would say. Would be great if Veeam would check which LUNs are actually datastores and only process those.

On the other hand I rarely see Boot from SAN in the wild. In you case ESXi bootbank and the VMs were on the same storage? So losing the storage also means losing the hosts? @MicoolPaul 

 

As I recall, the OS has to be on a certain LUN ID in order to boot from SAN.  Generally probably a good idea to keep that LUN away from where the VM’s sit.  And I suppose it wouldn’t necessarily have to be the same SAN, but could be.

Userlevel 7
Badge +14

A really unique problem I would say. Would be great if Veeam would check which LUNs are actually datastores and only process those.

On the other hand I rarely see Boot from SAN in the wild. In you case ESXi bootbank and the VMs were on the same storage? So losing the storage also means losing the hosts? @MicoolPaul 

Therein actually lies the problem, it needs to mount the snapshot to read it, which is when it creates the PSOD event 😆

Comment