Thanks for sharing @vNote42, I love seeing this kind of stuff as it highlights just how complex these processors are.
For example, there’s a game emulation project called “Dolphin” that emulates the Nintendo Gamecube & Nintendo Wii, every now and again I see their progress reports. One of their outstanding bugs was online multiplayer between Wii consoles and Dolphin emulation, under specific scenarios the game would desync between the two endpoints and rollback to before the desync.
After a huge amount of investigation, they found the issue was related to the ‘nmadd’ and ‘nmsub’ instructions. These are negations of ‘madd’ and ‘msub’, changing a positive to a negative or vice versa within the same instruction. These instructions are really a sequence of mathematical equations in a single instruction. But with simplicity can cause
Turns out that the reason for the desync was due to the negation being performed at a different step within the sequence depending on CPU architecture.
Focusing on nmadd instructions:
The Wii’s PowerPC architecture would do: -(A * C ± B)
When emulating on an x86-64 processor though, the equation is: -(A * C) ± B
And finally, when using AArch64, the equation is: -(A * C) - B - (Which is actually x86-64’s nmsub equation so who knows what’s going on here!)
The full breakdown of this can be found at https://dolphin-emu.org/blog/2021/09/07/dolphin-progress-report-august-2021/ but the root cause of the issue was identified that PowerPC would give a result of -0 but x86-64 and AArch64 would give +0.
The reason I mention this is because these tiny, theoretically indifferent values, can cause all sorts of problems in code, and when code is manipulating data, such as the compression @vNote42 talks about, it’s easy to see how corruption can occur. There truly is no substitute for backup validation.
My final story to add to this, is the Intel Pentium FDIV bug. When Intel were producing the original P5 Pentiums, 5 of the 1,066 array cells were incorrectly downloaded into the etching equipment, resulting in these values being etched at 0 instead of +2. Causing floating point inaccuracies.
With Intel’s track record of making a bad problem worse, they acknowledged the flaw but originally claimed the end-user would have to prove they were affected by this. Leading to significant bad press and Intel offering to replace all impacted processors, but of course, Intel being Intel, still said that OEMs and resellers couldn’t participate in this, the end user had to make the claim themselves…
Read more about this here: https://en.wikipedia.org/wiki/Pentium_FDIV_bug