World Backup Day 2024: Being SMART and Predicting Hardware Failures

Userlevel 7
Badge +21

Hello again! Onto my second post for world backup day, this time looking at the moments leading up to a disaster. In this post we’re going to look at ways to proactively prevent failures. After all, the best backups are the (tested) ones that didn’t need to be used because everything is fine.

Let’s kick this off with a trip down memory lane. During the days that all storage media was a traditional hard drive with spinning disks, it was typically pretty obvious when a hard drive was failing, we could hear issues such as failed bearings or the scratching of a needle as these were mechanical faults that were developing. However, when we look at flash storage, which doesn’t contain any mechanical parts, the lack of being able to hear problems developing can result in people taken by surprise when storage fails.

Nearly 30 years ago, Compaq released S.M.A.R.T into the public domain, providing a standard of storage health for all vendors. S.M.A.R.T (Self-Monitoring, Analysis, and Reporting Technology, often stylised as SMART) is a continuous monitoring of storage metrics to determine overall device health and probability of failure. There are various tools available to monitor this device health system for yourself. Typically if you’re using a SAN or NAS device, this will gather the SMART metrics continuously from each individual storage device to proactively report issues. If you are utilising a server-grade system, this will typically include some form of “out of band management” such as HPE’s iLO, or Dell’s iDRAC. Centralised aggregation of metrics and management for fleets of servers exist in solutions such as HPE’s InfoSight or Dell’s OpenManage. These solutions will typically monitor these metrics for you as well. Finally we come to the endpoints. Whilst OEMs have started to include their own software to read and report on these values, if your device doesn’t include such software you’re not out of luck yet! Packages exist for Ubuntu to monitor SMART health for each hard drive, whilst on Windows free third-party tools exist such as CrystalDiskInfo. Although I mention these tools for reference, perform your own due diligence before acquiring/using any software.

By leveraging SMART, we can get notified when our storage is starting to fail, and proactively replace the storage before it does, reducing the risk of unexpected outages and downtime. To receive notifications will depend on how you are accessing your SMART metrics. Endpoint applications will typically send emails or play audible alerts, whereas iLO/iDRAC and SAN/NAS devices can provide email alerts, SNMP notifications and syslog outputs. I strongly recommend configuring these notifications to provide some level of assurance you’ll be notified if a device is reaching the end of its useful life.

Beyond SMART however, we can use metrics from other areas to detect storage failures. This next example comes straight from my own machine recently. I was noticing unexpectedly poor performance from my device, despite it being a reasonable specification system, with only flash storage. Light IO workloads were causing the PC to become completely unresponsive. I was finding that tasks as simple as opening ‘Task Manager’ were taking the best part of a minute to execute. My CPU utilisation was high, but it wasn’t actually doing anything, it was in a ready & waiting state. I started to look at the storage itself, the SMART health was showing around 60% so I suspected I might be seeing the start of a problem here. Upon further investigation I found that when I was interacting with the flash storage of my boot volume, I was seeing IO latency reaching 4-5 seconds, far higher than the microseconds that it used to respond within, and far slower than I would expect from the typical spinning disk under low utilisation. Armed with this, I performed a backup of my system, and restored it to a brand new replacement flash device, and the results were immediate, I was no longer able to replicate this storage performance impact at far more taxing workloads. I formatted the old device and placed it back into my system and attempted to download some files to the drive, and immediately saw IO spikes despite the downloads being the only IO taking place. I’ve since decommissioned this drive entirely from my system, and thanks to SMART I could be proactive in my migration to a new storage device, instead of being without a working PC when I may have been urgently attempting to carry out a task. Thanks SMART!


Userlevel 7
Badge +6

I appreciate the short history lesson on SMART.  I didn’t realize this was a Compaq innovation.  And yes...when performance is low and CPU and RAM are normally, check disk queueing, etc.  It shouldn’t be more than a few hundred milliseconds typically, often much shorter.  Times vary by device such as individual arrays, RAID sets, storage arrays, etc.

It was amazing how often I would hear some drives clicking, even in a datacenter when I was performing remote hands work for a client in their cage and could warn them ahead of time that a drive was pending failure. 

SSD’s make that all so much harder….for which I am thankful for Dell’s OpenManage/IDRAC/SupportAssist capabilities.  Also thankful that their PowerEdge servers and storage arrays can all phone home automatically and sometimes my first notification of a failure is Dell contacting me about an alert their received from my (or my client’s) hardware. 

If you’re a Dell shop and you’re not using, I’d highly recommend looking at their CloudIQ product for monitoring and trend analysis….with the app on my phone, I get push notifications when things are going south before clients even know what’s up…..

Userlevel 7
Badge +21

Another great article Michael for world backup day tomorrow.