Today’s post is a direct result of experience in the field, and it was certainly an interesting one!
I was recently working with a customer that wanted to migrate to the Veeam Availability Suite. They had an Enterprise SAN solution that they were using for booting their VMware ESXi hosts, in addition to storing their workloads on. The customer’s virtual estate was large, with some servers that were highly transactional, and a great candidate for storage snapshots.
Everything was going well, until I performed the storage integration with Veeam, and hit a severe issue. after performing the storage rescan, all of the ESXi hosts within the cluster suffered a PSOD near simultaneously. I’ve certainly never seen that before, and upon asking all of my peers in the wider Veeam community and even some Veeam technical contacts, NOBODY had seen this before.
The Culprit:
The issue thankfully was not one that required much troubleshooting to discover what was going wrong. Each of the ESXi hosts were complaining about seeing duplicate filesystem UUIDs. So, how did this happen?
Veeam’s documentation suggests that separate initiator groups should be used when utilising Storage Snapshot functionality, however if you have Boot from SAN on the same storage array, this really should be mandatory.
Why CAN this Combination break VMware?
If you’re not putting your Veeam proxies into their own initiator group, then you’re putting them into another initiator group, and if you’re working with VMware, odds are this is the initiator group you’ve chosen to share with Veeam.
As a bit of background information, when you integrate the storage array, all of the LUNs within your included scope will be scanned by Veeam. Veeam will also use any API integrations to look for existing snapshots against these LUNs, and request the storage array to present it to your initiator group. Veeam documentation also states it will create and delete snapshots on these LUNs during the discovery process, though in reality I’ve seen this to be inconsistent.
The important point to take from the above though is that because Veeam requests the LUN gets mounted to the entire initiator group to which its proxies belong, this causes the ESXi servers to receive the snapshot(s) too, and once ESXi receives a snapshot of its filesystem presented back to it, ESXi recognises a duplicate UUID and halts itself for protection.
I’ve found two specific scenarios that trigger this behaviour. Firstly, if you have any snapshots on your boot LUNs, this will be read by Veeam during a storage rescan. Secondly, Veeam can create snapshots via its API permissions, and does so at seemingly random intervals during its storage rescan steps, which themselves are triggered at regular intervals.
Even if you don’t use Boot from SAN, you really should use a dedicated initiator group for your Veeam proxies, then you won’t suddenly stumble upon this issue in the future.