Does ESXi host survive persistent boot device loss?


Userlevel 7
Badge +5
  • Veeam Legend, Veeam Vanguard
  • 550 comments

Yes, it is no backup topic. But because we discussed a reason for my testing here already:

I thought it could be interesting.

 

In this post I investigate what happens when a VMware vSphere ESXi host loses its boot device. This device is meant to be a persistent device. For non-persistent devices like USB- and SD-card, behavior is quite clear: whole ESXi OS runs in memory, no mass-write operations should be directed to the device. When it breaks, ESXi isn’t missing it and keeps running.

With a persistent device I was convinced that ESXi would die when it broke. BUT: ESXi survives. Not such a clean behavior like with non-persistent devices, but it survived.

 

Reason for testing

There is a concrete reason for this testing. I want to answer the question, if it is safe to boot a ESXi host from a single disk. No Raid, just a single disk connected to a HBA. This would be an additional option for ESXi boot device. Why? Because VMware and other server vendors do not recommend to use non-persistent boot devices any more. And there are more reasons for this:

So USB devices and SD cards tend to fail within rather short time after installing or upgrading to vSphere 7. Therefore other persistent and durable devices are needed. This post answers the question if a single disk is an option: yes it is!

 

Positive test results

  • To exclude a temporary phenomenon, I kept the host running for more than 10 days with a broken boot device.
  • vMotion was available at any time.
  • vSAN Cluster continued to function and kept using physical disks of this host.
  • Backing up VMs using NBD aka. Appliance Mode worked without problems.
  • I was able to start and stop services (like SSH).

 

How does is look like

It is quite unspectacular when the disk is broken. Everything continues to work. There is a warning in vCenter about the lost connectivity.

Logs are not written any further. So latest entries show approximately the time of failure. This happens because locations for logs are not available any more. See dead status of device when investigating with esxcfg-scsidevs.

And what is show when looking at mount points usage with command df? Errors:

A look into the root directory does not look good either. On the left side disk is broken, on the right side no failure was introduced.

 

Keep in mind

  • Booting a host with a single disk is still not a recommended option. As you can see, it is working but not appropriate for every environment.
  • A spinning hard disk has a higher probability of failure than a SSD. Therefore I would highly recommend to use a SSD in such a configuration.
  • If you implement such a solution without redundancy, I recommend to schedule a ESXi backup script. With it you can easily restore configuration when device breaks. You can use PowerCLI command Get-VMHostFirmware with parameter BackupConfiguration to do so.
    Note: If the boot device has already failed, backup can no longer be made!
  • After device is gone, you are not able to change advanced settings any more. So it is not possible to redirect logging to another disk or an external log server.

 

Conclusion

When it comes to vSphere 7, boot device like USB drives and SD cards are to be avoided. In this post I have shown that it is safe to use single disks to boot if you consider a few points. My order of recommended boot devices is therefore:

  1. Raid 1 of SSDs (preferred) or HDDs.
  2. Server vendor’s recommended boot device, like HPE’s OS Boot Devices.
  3. Single disk (SSD preferred) as shown here.

 

Notes

  • Tested Version of ESXi host was 7 U1.
  • I tested in a fully virtualized environment. But because the boot device for the ESXi-VM is represented as a VMDK, it is comparable to a disk in a physical server.
  • When troubleshooting SAN storage, my post about Permanent Device Loss could help.

9 comments

Userlevel 7
Badge +3

Extremely interesting post.  I have a homelab with 4 NUCs and they boot from USB devices for the OS as I use the SSDs (2) inside for vSAN and vFlash Cache.  I never like using SD cards at all even if there were two of them in mirror mode like some servers have.  They tend to fail much faster than most other devices.  Always loved HDD or better SSD for booting.  Better performance as well over SD/USB devices.

Also like the tip about the backup configuration of a host and will take a look at that.  At least all my VMs run on a Synology NAS so if a host goes down it is easy to rebuild.  :joy:

Userlevel 7
Badge +5

Extremely interesting post.  I have a homelab with 4 NUCs and they boot from USB devices for the OS as I use the SSDs (2) inside for vSAN and vFlash Cache.  I never like using SD cards at all even if there were two of them in mirror mode like some servers have.  They tend to fail much faster than most other devices.  Always loved HDD or better SSD for booting.  Better performance as well over SD/USB devices.

Also like the tip about the backup configuration of a host and will take a look at that.  At least all my VMs run on a Synology NAS so if a host goes down it is easy to rebuild.  :joy:

Yes, just re-install and restore config-backup :thumbsup_tone3:

Userlevel 7
Badge +4

...

For non-persistent devices like USB- and SD-card, behavior is quite clear: whole ESXi OS runs in memory, no mass-write operations should be directed to the device. When it breaks, ESXi isn’t missing it and keeps running.

...

Unfortunately it is not that clear with non-persistent devices.

In one of our environments two SD-Cards died after upgrade to ESXi 7.02a in a ten host cluster. These two servers with the failed SD-Cards became unresponsive. The VMs which were running on this hosts did work but could not moved away from the two hosts.The vSAN had some problems after the two unresponsive servers were taken out of the cluster. We had to restore some - not all - of the VMs.

I was glad that we had backups of all of these VMs, otherwise there would have been data loss.

Userlevel 7
Badge +3

Thanks for posting your results @vNote42 . It looks like the future are SSDs in RAID1 as they tend to be most stable.

But, just like @JMeixner we've had mixed results with failing SD cards. In many cases a failed SD card also took down the ESXi management services; VMs were still running but you couldn't migrate them anymore.

 

Userlevel 7
Badge +4

Thanks for posting your results @vNote42 . It looks like the future are SSDs in RAID1 as they tend to be most stable.

...

 

Yes, we will switch to SSDs in RAID1. Seems to be the most reliable option at the moment.

Userlevel 7
Badge +2

Great post @vNote42! I’m not a big fan using SD/USB in a production environment. @Chris.Childerhose , your homelab is not much different of mine ;-). I also use several NUCs and Synology NAS, but not all my VMs are running on the shared Synology NAS, but locally on SSD and NVME in the NUC and using Veeam replication to other NUCs for the most critical VMs :-). I used before USB-sticks for ESX, but I had one broken. I installed it on the SSD which also used as VMFS datastore. I love using NUCs for a home lab : great performance and low energy cost!

Userlevel 7
Badge

@vNote42 : Always like your posts, amazing !

Userlevel 7
Badge +3

Great post @vNote42! I’m not a big fan using SD/USB in a production environment. @Chris.Childerhose , your homelab is not much different of mine ;-). I also use several NUCs and Synology NAS, but not all my VMs are running on the shared Synology NAS, but locally on SSD and NVME in the NUC and using Veeam replication to other NUCs for the most critical VMs :-). I used before USB-sticks for ESX, but I had one broken. I installed it on the SSD which also used as VMFS datastore. I love using NUCs for a home lab : great performance and low energy cost!

Nice!!

Userlevel 7
Badge +5

...

For non-persistent devices like USB- and SD-card, behavior is quite clear: whole ESXi OS runs in memory, no mass-write operations should be directed to the device. When it breaks, ESXi isn’t missing it and keeps running.

...

Unfortunately it is not that clear with non-persistent devices.

In one of our environments two SD-Cards died after upgrade to ESXi 7.02a in a ten host cluster. These two servers with the failed SD-Cards became unresponsive. The VMs which were running on this hosts did work but could not moved away from the two hosts.The vSAN had some problems after the two unresponsive servers were taken out of the cluster. We had to restore some - not all - of the VMs.

I was glad that we had backups of all of these VMs, otherwise there would have been data loss.

Sounds not that great, agree! Normally I did not get such problems with failed SD cards. Maybe this can be explained by “the exception proves the rule”? :grin:

But with vSAN, other rules were already in place:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.vsan-planning.doc/GUID-B09CE19D-A3F6-408C-AE69-35F65CBE66E1.html

 

Comment