AMD EPYC ‘Rome’ Processors Can Crash after Nearly 3 Years Uptime


Userlevel 7
Badge +20

Hi everyone,

 

I thought this was a good conversation point I wanted to share, as the more I thought about it, the more my thoughts on the subject changed.

 

AMD have released an errata listing that details how their AMD EPYC Rome (7xx2) processors can, under specific conditions, crash. I’ve included the statement in its entirety below:

I’m conflicted on this, my initial reaction was “wow they won’t fix this?!”, then I thought, well I guess it would involve a new stepping, and the chips were released in 2018, that’s not too unreasonable.

The security conscious part of me thought, oh boy who isn’t patching their servers in nearly 3 years, but even that thought then pivoted to consider that actually with some of the live/hot patching options now available, OS-level patching is becoming less tied to a reboot. Then there’s the use-cases whereby these servers are isolated and patching become less-mission critical. But it’s safe to say any of these systems aren’t getting firmware updates in any case!

 

So, what is everyone’s thoughts on this? 🤷‍♂️


6 comments

Userlevel 7
Badge +17

My initial reaction was “my servers don’t use AMD, so I’m good” 😂 But, tbh I do see their point of not providing a fix. Those are 5yrs old. How many who use those specific processors encountered the error/bug? I’m sure if it was more widespread early on, they would’ve had a fix out pretty quickly, though from the looks of it, it seems to take about 3yrs to even surface? 

Userlevel 7
Badge +14

Well at some point you should have to reboot your servers. If not for OS updates, maybe firmware updates, maintenance, replacement or something else. And if not, then you'll have a random crash at some point 😁

Edit: But that makes me wondering how you discover such issues if they're not widespread. I'm sure everyone here had an issue where the cause couldn't be found. Maybe we see such bugs more often than we think.

Userlevel 7
Badge +6

I’ve found that if you’re running linux or a linux-based OS, rebooting is far less common.  Most patching requires some form of rebooting so to me it’s pretty much a non-issue.  That said, hot patching is becoming a bit more common.  But if after 3 years you haven’t rebooted, unless you’re in some sort of non-stop, always-on configuration, you probably need to be rebooting.  If anything, and occasional reboot is a good idea, and in some cases, it’s a necessity (looking at you Citrix).  So yo me, it’s a non-issue unless you’re in one of those special use cases.

Userlevel 7
Badge +20

My initial reaction was “my servers don’t use AMD, so I’m good” 😂 But, tbh I do see their point of not providing a fix. Those are 5yrs old. How many who use those specific processors encountered the error/bug? I’m sure if it was more widespread early on, they would’ve had a fix out pretty quickly, though from the looks of it, it seems to take about 3yrs to even surface? 

Funnily enough, the article I read this originally from stated that AMD had under 40 errata listed with that generation of processors, vs nearly 200 from Intel, but we can’t see the true picture with AMD as they close errata that they’ve fixed, whether via a new stepping or firmware etc.

 

Appreciate everyone’s input on this, reminds me of the other edge cases that have appeared in the news before such as Intel’s SSDs becoming inoperable if you enabled a BIOS drive password and then tried to change or disable it, upon reboot the drive was dead. Or the infamous SanDisk issue whereby their SSDs would die after 40,000 power-on hours (yep, not active use, just being powered on).

 

Only a couple of weeks ago I was talking about SanDisk data loss: Using a SanDisk Extreme Portsble SSD? Beware Data Corruption! | Veeam Community Resource Hub, and I recall a while ago @vNote42 talking about Apple Time Capsules having a design defect in their hard drives Apple Time Capsule may break soon | Veeam Community Resource Hub

Userlevel 7
Badge +6

I recall a while ago @vNote42 talking about Apple Time Capsules having a design defect in their hard drives Apple Time Capsule may break soon | Veeam Community Resource Hub

 

Any time we get Apple into the talks, and we’re talking about things stopping working, I have to wonder if it’s not planned obsolescence.  However, it’s not fair to pick on Apple as others do this is well, but the “batterygate” incident with Apple really opened a lot of eyes about what planned obsolescence is and how it affects folks.

Back to the point, it’s pretty amazing how even accidental bugs and quirks can cause hardware to fail because of software issues.

Userlevel 7
Badge +20

Yes there’s been a few times they’ve slipped through on those things, like the infamous left handed antenna issues with iPhone 4.

 

Having worked at Apple though, I’ve got to say, I was thoroughly impressed with their attention to detail and QA process, though I won’t be talking about that in a public forum...

Comment