Facebook investigates silent data corruption by CPU


Userlevel 7
Badge +13

You can think what you want about Facebook, but they operate a massive amount of compute and storage unites. And it seems, they have some scientist investigating interesting phenomenon like this. To be honest, it is not primary backup related but in the second instance it is restore related 😀

Briefly summarized:

They found out, CPU can perform computations incorrectly. So it comes to silent data corruption by CPU. According to their observations these failures are reproducible and not transient. When you think about data-reduction technologies like compression this really can cause problems. As the following article describes, these corruptions occur at scale. Interesting is also how they test their hardware.

Read more details here:

https://blocksandfiles.com/2022/03/18/facebook-investigates-silent-data-corruption/


8 comments

Userlevel 7
Badge +20

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

 

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

 

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B  - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

 

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

 

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

 

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

Userlevel 7
Badge +13

Cool information, @MicoolPaul ! 

There are also examples more people can suffer from. For example PowerShell: foreach (statement) and ForEach-Object (cmdlet) does not always return the same stuff! See here - search for “Attention”

Can remember the Pentium-bug, it was a topic of my first job interview 👴🏼

Userlevel 7
Badge +20

Thanks for sharing definitely a different topic.

Userlevel 7
Badge +7

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 

Userlevel 7
Badge +13

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 

… as Michael said: it is complex … 

Userlevel 7
Badge +7

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 

… as Michael said: it is complex … 

Yep, it is indeed

Userlevel 7
Badge +13

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

 

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

 

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B  - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

 

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

 

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

 

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

This is outstanding. When I read Dolphin my mind just slipped to the virus 😅

Google too is investigating about that:

https://www.reddit.com/r/sysadmin/comments/nrypj6/cpus_silent_errors_causing_problems_for_google/

Userlevel 7
Badge +4

Thanks for sharing @vNote42

Comment