Skip to main content

Facebook investigates silent data corruption by CPU


vNote42
Forum|alt.badge.img+13
  • On the path to Greatness
  • 1246 comments

You can think what you want about Facebook, but they operate a massive amount of compute and storage unites. And it seems, they have some scientist investigating interesting phenomenon like this. To be honest, it is not primary backup related but in the second instance it is restore related 😀

Briefly summarized:

They found out, CPU can perform computations incorrectly. So it comes to silent data corruption by CPU. According to their observations these failures are reproducible and not transient. When you think about data-reduction technologies like compression this really can cause problems. As the following article describes, these corruptions occur at scale. Interesting is also how they test their hardware.

Read more details here:

https://blocksandfiles.com/2022/03/18/facebook-investigates-silent-data-corruption/

8 comments

MicoolPaul
Forum|alt.badge.img+23
  • 2362 comments
  • April 20, 2022

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

 

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

 

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B  - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

 

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

 

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

 

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug


vNote42
Forum|alt.badge.img+13
  • Author
  • On the path to Greatness
  • 1246 comments
  • April 20, 2022

Cool information, @MicoolPaul ! 

There are also examples more people can suffer from. For example PowerShell: foreach (statement) and ForEach-Object (cmdlet) does not always return the same stuff! See here - search for “Attention”

Can remember the Pentium-bug, it was a topic of my first job interview 👴🏼


Chris.Childerhose
Forum|alt.badge.img+21
  • Veeam Legend, Veeam Vanguard
  • 8512 comments
  • April 20, 2022

Thanks for sharing definitely a different topic.


dips
Forum|alt.badge.img+7
  • Veeam Legend
  • 808 comments
  • April 20, 2022

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 


vNote42
Forum|alt.badge.img+13
  • Author
  • On the path to Greatness
  • 1246 comments
  • April 20, 2022
dips wrote:

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 

… as Michael said: it is complex … 


dips
Forum|alt.badge.img+7
  • Veeam Legend
  • 808 comments
  • April 20, 2022
vNote42 wrote:
dips wrote:

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere. 

… as Michael said: it is complex … 

Yep, it is indeed


marcofabbri
Forum|alt.badge.img+13
  • On the path to Greatness
  • 990 comments
  • April 20, 2022
MicoolPaul wrote:

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

 

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

 

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B  - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

 

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

 

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

 

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

This is outstanding. When I read Dolphin my mind just slipped to the virus 😅

Google too is investigating about that:

https://www.reddit.com/r/sysadmin/comments/nrypj6/cpus_silent_errors_causing_problems_for_google/


Forum|alt.badge.img+4
  • Experienced User
  • 576 comments
  • April 22, 2022

Thanks for sharing @vNote42