ReFS issues with latest Windows Server Updates (KB5009624, KB5009557, KB5009555)


Userlevel 7
Badge +12
  • On the path to Greatness
  • 1273 comments

Just a quick post; the latest Windows Server updates for 2012R2, 2019 and 2022 (haven’t seen 2016) can cause ReFS issues. After the installation, ReFS volumes are shown as RAW and are no longer accessible; so if you’re using ReFS for your repositories, then take be aware of this when installating the update. Besides that, those updates can also cause bootloops for domain controllers and break Hyper-V.

Depending on your Windows Server version one of the following updates could have caused the issue: KB5009624, KB5009557, KB5009555

If you’re affected then removing the mentioned update should solve the issue. Don’t (!) try to repair the ReFS volume, because this could cause a dataloss.

This should remind everyone why 3-2-1 for Backups is so important. :wink:

Further information:

https://www.bleepingcomputer.com/news/microsoft/new-windows-server-updates-cause-dc-boot-loops-break-hyper-v/

https://forums.veeam.com/veeam-backup-replication-f2/beware-possible-raw-refs-volumes-after-installing-january-updates-t78634.html

Update #1:

Microsoft has pulled the updates, thanks @Mildur 

https://www.bleepingcomputer.com/news/microsoft/microsoft-pulls-new-windows-server-updates-due-to-critical-bugs/amp/

Update #2:

Microsoft has released the updates again and it looks like they didn't fix them. (Thanks @MicoolPaul )

https://www.bleepingcomputer.com/news/microsoft/microsoft-resumes-rollout-of-january-windows-server-updates/

Update #3:

Microsoft has released an out-of-band update, which can possibly resolve the issues:

https://www.bleepingcomputer.com/news/microsoft/microsoft-releases-emergency-fixes-for-windows-server-vpn-bugs/

 


106 comments

Userlevel 5
Badge

Okay, so the only way to get to the new ReFS version is to create a new disk and copy the data from the 1.2 to the 3.x volume? There’s no way to manually upgrade to the new ReFS version?

Userlevel 7
Badge +17

Okay, so the only way to get to the new ReFS version is to create a new disk and copy the data from the 1.2 to the 3.x volume? There’s no way to manually upgrade to the new ReFS version?

When you copy the data from volume to volume you will loose all block cloning savings….

Userlevel 5
Badge

Yep, but this specific volume fortunately doesn’t contain any Veeam backup data.

Userlevel 1

Hello all, just wanted to throw my 2 cents in. Please delete or otherwise if not ok.

This is a thread i’ve seen as having an active conversation about these issues.

I don’t use Veeam, but MABS v3.

 

My server is a Dell R720 with one RAID array, containing 3 volumes; OS (NTFS), Recovery (NTFS), Data (ReFS). Windows Server 2019 Standard.

 

Up until the 20th Jan this server was operating fine (at least as far as I could tell, I was using it).

It installed KB5009557 automatically on 21st @ 4am, and since then, it started BSOD during boot, with:

SYSTEM THREAD EXCEPTION NOT HANDLED

ReFS.SYS

Checking the MEMORY.DMP file, tells this error to have code 0x0000007e.

 

What was stranger was I couldn’t boot from a Server 2019 ISO, as this also BSOD.

Used a Server 2016 ISO to boot, and rename refs.sys to refs.sys.bak, which allowed me to boot windows, without ReFS support, so the volume showed as RAW.

 

Created a support case with Microsoft, more on that later.

 

Found this article, and discovered the OOB update KB5010791.

Tricky to install it, as I needed refs.sys in place to be successful, but had to rename it back to boot windows. It did update the file though; 10.0.17763.2458.

But still BSOD.

 

Used refsutil.exe to assess the volume. Looked all fine:

PS C:\windows\system32> refsutil salvage -QS F: C:\refsutil
Microsoft ReFS Salvage [Version 10.0.11070]
Copyright (c) 2015 Microsoft Corp.

Local time: 1/24/2022 9:53:14

ReFS version: 3.4
Boot sector checked.
Cluster Size: 4096 (0x1000).
Cluster Count: 3265642496 (0xc2a5c000).
Superblocks checked.
Checkpoints checked.
4558 container table entry pages processed (0 invalid page(s))
1 container index table entry pages processed (0 invalid page(s)).
Container Table checked.

Processing 1403 of 1404 object table pages (99%)...

Object Table checked.

Examining identified metadata disk data for versioning and consistency.
134777 disk clusters analyzed (100%)...

Examining volume with signature 45246377 for salvageable files.
4558 container table entry pages processed (0 invalid page(s)).
1 container index table entry pages processed (0 invalid page(s)).
Validating discovered table roots on volume with signature 45246377.

36926 table roots validated (100%).
Enumerating files from discovered tables on volume with signature 45246377.

36926 tables enumerated (100%).
Command Complete.

Run time = 611 seconds.

Tried uninstalling both KBs.

I also tried manually copying a previous version of refs.sys. 10.0.17763.2452, 10.0.17763.2330. Still BSOD.

 

After waiting for Microsoft support to give me something useful, for a week since creating the case, finally took things more into my own hands.

Used refsutil to copy anything that might be useful to keep, deleted the ReFS volume, reinstated refs.sys, rebooted, and the BSOD went away.

Updated fully with both KB5009557 + KB5010791. Still no BSOD.

Recreated the ReFS volume, configured MABS to use it again. Still no BSOD.

 

Microsoft support have been about as helpful as a wet towel. Giving me basic instructions that I’ve already tried myself (sfc, dism, uninstall kb). They did get my MEMORY.DMP and claimed to be analysing it since, but so far have only given me the stop code 0x7e, which I’ve also got myself. Slow to respond to emails, and even ‘strongly suggest’ I contact Microsoft Premier Support Services. No, not spending more money on MS. Tried to get them to confirm if ReFS and Fixed Disks were a known issue they were working on; no word on that.

 

So this leaves me wondering.

Did something happen to my volume, a corruption maybe, that then broke refs.sys?

Or did their update, break something in my volume, which then broke refs.sys?

I was using this volume for other things (a window share), so I’ve now split those workloads out to separate NTFS volumes, but surely that can’t be it.

I read that ReFS when they get up in space used it can fall over. Is that true, or still true?

Userlevel 7
Badge +13

Okay, so the only way to get to the new ReFS version is to create a new disk and copy the data from the 1.2 to the 3.x volume? There’s no way to manually upgrade to the new ReFS version?

When you copy the data from volume to volume you will loose all block cloning savings….

True, but I can’t see other solutions...

Userlevel 7
Badge +13

I read that ReFS when they get up in space used it can fall over. Is that true, or still true?

Never heard of this, let’s see what community have to say about it.

Userlevel 1

I read that ReFS when they get up in space used it can fall over. Is that true, or still true?

Never heard of this, let’s see what community have to say about it.

Yeah I just re-found the article. ServerFault. May 2019. “When ReFS volumes get filled up to 60-80% they tend to lock up. If you have ReFS user data hashing enabled situation turns even worse.”

So I guess people have gone over that with no issues … ?

Userlevel 7
Badge +12

Welcome to the Veeam Community @ianv. The community is open for everyone, even if your not a Veeam user :wink:

It’s hard to say what happend in your case and if you did experience something different. I think BSOD weren’t caused by this update, “only” the ReFS volumes turned inaccessible (RAW).

‘strongly suggest’ I contact Microsoft Premier Support Services.

I really love their support for such suggestions. They introduce an issue but won’t help you in anyway or admit that it could be a bug, without going through their premier support, which is by the way, not easily accessible at all.

Userlevel 7
Badge +20

Welcome to the Veeam Community @ianv. The community is open for everyone, even if your not a Veeam user :wink:

It’s hard to say what happend in your case and if you did experience something different. I think BSOD weren’t caused by this update, “only” the ReFS volumes turned inaccessible (RAW).

‘strongly suggest’ I contact Microsoft Premier Support Services.

I really love their support for such suggestions. They introduce an issue but won’t help you in anyway or admit that it could be a bug, without going through their premier support, which is by the way, not easily accessible at all.

I‘ve actually never had a successful resolution via Microsoft Premier Support, everything has always been out of scope or no root cause found, and I’ve always been given my credits back! 😆 I must need to get Microsoft Super Premier Justice League Support for that…

Userlevel 7
Badge +12

The first problem is, that we all are probably at a very high knowledge level; perhaps just a bit below or same as the premier support. So we have already tried everything and hope that they'll be able to do any wonders. And second, regarding bugs, they won't have any better insights or connections. So you'll need to spend very much time and have luck, to really get to the point where someone can help you. And at that point the problem has either magically disappeared (I look at you Windows Update) or you've setup a fresh system because you couldn't wait. Don't get me wrong, the premier support people are great and very knowledgeable, but like I said they're not magicians

 

Anyone found a solution to solve this on Windows 2012 R2, REFS Volume showing RAW? I tried to remove bad patch, even installed out-of-band patch, still no luck. 

Userlevel 2

If still RAW after removing the Jan patch (patches?), then suggests there is some real refs corruption.
If on vmware try the disable hotplug method as a double check.
And/or try attaching the disk to an unpatched system (any OS version) as another check.

If real corruption is *probably* a co-incidence it happened at the time of patching.
There are some 3rd party tools that can scan disks/volumes and that can understand refs
eg

https://www.r-studio.com/

https://www.isobuster.com/

https://www.diskgenius.com/

For WS2019 their is a built-in refsutil tool  (which can also be copies to and run on WS2016, but doubt will work on WS2012R2

NOte - only do the recover attempt as a last resort, after confirming not hitting the know issues that make it appear RAW even though there is no corruption.

 

Badge

Apparently, this is still an issue. I had 6 updates automatically applied (KB5011564, KB5011560, KB5010462, KB5010419, KB5010395, KB5009595) I thought I had figured out which one was the culprit but I was mistaken and did not have the time to narrow it down so I removed all 6, rebooted, and the ReFS drive no longer showed as RAW. I do not find any of the KB’s that I listed above as matching the other’s in this thread, has anyone else experienced this and found out which one it was?

Userlevel 7
Badge +20

Apparently, this is still an issue. I had 6 updates automatically applied (KB5011564, KB5011560, KB5010462, KB5010419, KB5010395, KB5009595) I thought I had figured out which one was the culprit but I was mistaken and did not have the time to narrow it down so I removed all 6, rebooted, and the ReFS drive no longer showed as RAW. I do not find any of the KB’s that I listed above as matching the other’s in this thread, has anyone else experienced this and found out which one it was?

We have not applied any of these updates to our ReFS servers due to this problem.  I am surprised MS has not addressed it yet and we cannot have our Veeam services down.

Badge

Apparently, this is still an issue. I had 6 updates automatically applied (KB5011564, KB5011560, KB5010462, KB5010419, KB5010395, KB5009595) I thought I had figured out which one was the culprit but I was mistaken and did not have the time to narrow it down so I removed all 6, rebooted, and the ReFS drive no longer showed as RAW. I do not find any of the KB’s that I listed above as matching the other’s in this thread, has anyone else experienced this and found out which one it was?

We have not applied any of these updates to our ReFS servers due to this problem.  I am surprised MS has not addressed it yet and we cannot have our Veeam services down.

We have been employing the same approach as you and these shouldn’t have been applied but someone else had a different idea. I will need to isolate which update caused it, thanks for the response.

Userlevel 7
Badge +20

Apparently, this is still an issue. I had 6 updates automatically applied (KB5011564, KB5011560, KB5010462, KB5010419, KB5010395, KB5009595) I thought I had figured out which one was the culprit but I was mistaken and did not have the time to narrow it down so I removed all 6, rebooted, and the ReFS drive no longer showed as RAW. I do not find any of the KB’s that I listed above as matching the other’s in this thread, has anyone else experienced this and found out which one it was?

We have not applied any of these updates to our ReFS servers due to this problem.  I am surprised MS has not addressed it yet and we cannot have our Veeam services down.

We have been employing the same approach as you and these shouldn’t have been applied but someone else had a different idea. I will need to isolate which update caused it, thanks for the response.

No problem - we use a patch management system so we don’t approve those updates for the specific servers so they never get installed.

Badge

Apparently, this is still an issue. I had 6 updates automatically applied (KB5011564, KB5011560, KB5010462, KB5010419, KB5010395, KB5009595) I thought I had figured out which one was the culprit but I was mistaken and did not have the time to narrow it down so I removed all 6, rebooted, and the ReFS drive no longer showed as RAW. I do not find any of the KB’s that I listed above as matching the other’s in this thread, has anyone else experienced this and found out which one it was?

We have not applied any of these updates to our ReFS servers due to this problem.  I am surprised MS has not addressed it yet and we cannot have our Veeam services down.

We have been employing the same approach as you and these shouldn’t have been applied but someone else had a different idea. I will need to isolate which update caused it, thanks for the response.

No problem - we use a patch management system so we don’t approve those updates for the specific servers so they never get installed.

We do too, but another engineer decided to release them. It is what it is and won’t happen again if I have anything to say about it!

Userlevel 7
Badge +12

I haven't heard any bad news since some time, so it will be interesting to see which update is causing your problem.

Userlevel 2

What OS?
If 2012R2 and if on a system like vmware where the disks have characteristics of removable, you will alway have issues.
With the patches removed, and the volume visible, can you check the refs version number
fsutil fsinfo refsinfo x:
If its v1.2   then that is the old version which will never work on ‘removable’ drives now
v3.x should be ok with latest updates (since Feb)

If this is a vmware system, and refs v1.2 disks, then the only option is a vmware configuartion chnage to disable hotplug

Disabling the HotAdd/HotPlug capability in virtual machines (vmware.com)


If none of the above, please see if the ReFS event log (under applicationas and services logs) says anything   eg about the version number
eg if somehow the v3.x refs disk have been attached to a later OS and updated to a later v3.x version, then they will no longer be readable if put them back on the earlier OS
 

Userlevel 4

hi guys, i have very bad experiece with ReFS on WS2016, one of my customer has around 60TB+ of backups on ReFS volume and after some windows update and reboot, the volume was “RAW”. from this time i not using ReFS on 2012R2,2016,2019… i cannot belive it. maybe on WS2022 the support is better. In the past i wrote some post, that ReFS is not suported for storage arrays, which are connected via iSCSI. its true with 2022 ? 

tom.

Userlevel 7
Badge +3

@regnor thank you for the heads up on this!!!

Userlevel 7
Badge +12

@TKA The storage or RAID controller needs to be certified by Microsoft or listed in the Windows Server catalog. In general you should at least have a battery backed RAID controller in place. Without it a powerless could corrupt your ReFS volume. I'm not sure about iSCSI but would say if the storage is descent it shouldn't be an issue.

For your case; what storage do you have? Is it a NAS?

Userlevel 4

@TKAThe storage or RAID controller needs to be certified by Microsoft or listed in the Windows Server catalog. In general you should at least have a battery backed RAID controller in place. Without it a powerless could corrupt your ReFS volume. I'm not sure about iSCSI but would say if the storage is descent it shouldn't be an issue.

For your case; what storage do you have? Is it a NAS?

Now the customer has a “new backup server” with WS2019 on DELL R720 with 10 GbE ethernets. The backup target is Storage Array - Infotrend DS1024 (dual controller, 4 GB RAM per ctl with BBU). But from my bad experience with Refs and lose around 60TB + of backup, we are still on NTFS :) now we ordered a new array with capacity 180 TB, so maybe is time to change to ReFS for “performance tier” 60 TB and 120 TB for ObjectStorage volume as “archive” .) 

Userlevel 7
Badge +17

Up to now I had no bad experience with ReFS and Windows Server 2016 and 2019. I am using ReFS with local disks and iSCSI targets….

Userlevel 7
Badge +6

What OS?
If 2012R2 and if on a system like vmware where the disks have characteristics of removable, you will alway have issues.
With the patches removed, and the volume visible, can you check the refs version number
fsutil fsinfo refsinfo x:
If its v1.2   then that is the old version which will never work on ‘removable’ drives now
v3.x should be ok with latest updates (since Feb)

If this is a vmware system, and refs v1.2 disks, then the only option is a vmware configuartion chnage to disable hotplug

Disabling the HotAdd/HotPlug capability in virtual machines (vmware.com)


If none of the above, please see if the ReFS event log (under applicationas and services logs) says anything   eg about the version number
eg if somehow the v3.x refs disk have been attached to a later OS and updated to a later v3.x version, then they will no longer be readable if put them back on the earlier OS
 

 

Funny that this popped up again.  A couple days ago I ran across another 2012 R2 machine that was showing the REFS volumes as RAW and disabling hotplug worked perfectly without having to uninstall the offending KB that was applied to the server.  In the past, I had a client that this happened to a couple of times and I did both at the time so wasn’t sure which fixed it.  Can confirm disabling hotplug worked perfectly!

Comment