Facebook investigates silent data corruption by CPU

Userlevel 7

+22

MicoolPaul
On the path to Greatness
2293 comments
2 years ago
20 April 2022

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

Userlevel 7

+13

vNote42
Author
Veeam Vanguard
1246 comments
2 years ago
20 April 2022

Cool information, @MicoolPaul !

There are also examples more people can suffer from. For example PowerShell: foreach (statement) and ForEach-Object (cmdlet) does not always return the same stuff! See here - search for “Attention”

Can remember the Pentium-bug, it was a topic of my first job interview 👴🏼

Userlevel 7

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
6980 comments
2 years ago
20 April 2022

Thanks for sharing definitely a different topic.

Userlevel 7

+7

dips
Veeam Legend
730 comments
2 years ago
20 April 2022

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

Userlevel 7

+13

vNote42
Author
Veeam Vanguard
1246 comments
2 years ago
20 April 2022

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

… as Michael said: it is complex …

Userlevel 7

+7

dips
Veeam Legend
730 comments
2 years ago
20 April 2022

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

… as Michael said: it is complex …

Yep, it is indeed

Userlevel 7

+13

marcofabbri
Veeam Legend
991 comments
2 years ago
20 April 2022

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

This is outstanding. When I read Dolphin my mind just slipped to the virus 😅

Google too is investigating about that:

https://www.reddit.com/r/sysadmin/comments/nrypj6/cpus_silent_errors_causing_problems_for_google/

I

Userlevel 7

+4

Inder
Experienced User
577 comments
2 years ago
22 April 2022

Thanks for sharing @vNote42

Facebook investigates silent data corruption by CPU

8 comments

Comment

Comment

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded