Facebook investigates silent data corruption by CPU

+23

MicoolPaul
2370 comments
3 years ago
April 20, 2022

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

Michael Paul - Opinions are my own and do not necessarily reflect the opinion of Veeam | https://micoolpaul.com | Mastodon: @micoolpaul@masto.nu | Bluesky: @micoolpaul.com

+13

vNote42
Author
On the path to Greatness
1246 comments
3 years ago
April 20, 2022

Cool information, @MicoolPaul !

There are also examples more people can suffer from. For example PowerShell: foreach (statement) and ForEach-Object (cmdlet) does not always return the same stuff! See here - search for “Attention”

Can remember the Pentium-bug, it was a topic of my first job interview 👴🏼

Wolfgang | vnote42.net | @vNote42

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8596 comments
3 years ago
April 20, 2022

Thanks for sharing definitely a different topic.

+7

dips
Veeam Legend
814 comments
3 years ago
April 20, 2022

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

Dipen N. K. | IT Security Specialist | Veeam Legend 2022 - 2025 | Cyber Security Space VUG Leader

+13

vNote42
Author
On the path to Greatness
1246 comments
3 years ago
April 20, 2022

dips wrote:

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

… as Michael said: it is complex …

Wolfgang | vnote42.net | @vNote42

+7

dips
Veeam Legend
814 comments
3 years ago
April 20, 2022

vNote42 wrote:

dips wrote:

Thanks @vNote42 & @MicoolPaul for sharing. Really interesting topics. I’ve always thought corruption would always be caused at the storage and memory level due to the dynamic nature or data being processed but just goes to show it can happen anywhere.

… as Michael said: it is complex …

Yep, it is indeed

Dipen N. K. | IT Security Specialist | Veeam Legend 2022 - 2025 | Cyber Security Space VUG Leader

+13

marcofabbri
On the path to Greatness
990 comments
3 years ago
April 20, 2022

MicoolPaul wrote:

Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.

For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.

After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause

Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.

Focusing on nmadd instructions:

The Wii’s PowerPC architecture would do: -(A * C ± B)

When emulating on an x86-64 processor though, the equation is: -(A * C) ± B

And finally, when using AArch64, the equation is: -(A * C) - B - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)

The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.

The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.

My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.

With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…

Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

This is outstanding. When I read Dolphin my mind just slipped to the virus 😅

Google too is investigating about that:

https://www.reddit.com/r/sysadmin/comments/nrypj6/cpus_silent_errors_causing_problems_for_google/

Backups are like pizza, I love them. | Linkedin: @marco-fabbri-it

I

+4

Inder
Experienced User
576 comments
3 years ago
April 22, 2022

Thanks for sharing @vNote42

Inder | www.thenetworkdna.com | Twitter @inder8588

Facebook investigates silent data corruption by CPU

8 comments

Comment

Comment

Related topics

Geluid wordt steeds onderbroken. Volgende nummer begint zomaar.icon

Problemen doorskippen binnen nummer Youtube Musicicon

Herstarten van een tune-in zender na tv kijken.icon

Sonos stopt, slaat over en is traagicon

Sonos en Spotify, bij jou ook vreselijk?

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded