Solved

Troubleshooting a very weird issue right now


Userlevel 7
Badge +3

I’ve randomly had two backups corrupt at one of my clients. Not sure why it happened, but they did. On one of them, I deleted from disk all of the restore points on the local datastore, and took a new full. I left the cloud data alone, and as we speak, the new restore points are peacefully transferring offsite, no issue. The second backup has been a different story…

 

Because of the nature of the server, I decided to take a new full and leave the old data in place, just in case. The new full local backup was fine, but the offsite failed. I tried running an active full of the offsite, but same deal; the offsite failed. So now I am trying a new strategy. I took a look at the restore points and 99% were corrupted (I believe this was the offsites), so I decided to just wipe everything and start from scratch, because I think the corrupted restore points are causing issues with the new data going offsite. The offsite failures have been throwing an error I’ve never seen before. I’ll paste it below. I’ll keep you all updated on what ultimately happens with this. The new local full is running as I type this.

 

Processing *sever name* Error: Bad Data.
Failed to call CryptDecrypt
AesAlg failed to decrypt, keySet: ID: 45ee103e03b54e69630b13affc7ecb56 (session), keys: 1, repair records: 1 (master keys: 1057cad7009d95a034d2d62be82b7040)
Failed to upload disk. Skipped arguments: [servername_1-flat.vmdk];
Agent failed to process method {DataTransfer.SyncDisk}.
Exception from server: Bad Data.
Failed to call CryptDecrypt
AesAlg failed to decrypt, keySet: ID: 45ee103e03b54e69630b13affc7ecb56 (session), keys: 1, repair records: 1 ( 

icon

Best answer by bp4JC 10 February 2023, 14:18

View original

18 comments

Userlevel 7
Badge +20

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

Userlevel 7
Badge +3

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

I believe they’re on a QNAP.

Userlevel 7
Badge +3

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

The thing that doesn’t make sense is the local backups succeed. It’s the offsite copyjob that fails.

Userlevel 7
Badge +20

Is the copy job configured to use a different source than you’re expecting?

 

I wouldn’t expect the copy job to be talking about VMDKs, but the VBR/VIB files

Userlevel 7
Badge +20

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

I believe they’re on a QNAP.

Can you do a SureBackup test on your local backups to be 100% they’re good? I don’t trust QNAP/Synology in production after my scars 😔

Userlevel 7
Badge +3

Is the copy job configured to use a different source than you’re expecting?

 

I wouldn’t expect the copy job to be talking about VMDKs, but the VBR/VIB files

Just double-checked the target and it’s correct. It’s our offsite datastore.

Userlevel 7
Badge +3

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

I believe they’re on a QNAP.

Can you do a SureBackup test on your local backups to be 100% they’re good? I don’t trust QNAP/Synology in production after my scars 😔

I can try that. Haven’t done it before. A new full is running currently. I am starting from scratch. I cleared out everything.

Userlevel 7
Badge +20

Is the copy job configured to use a different source than you’re expecting?

 

I wouldn’t expect the copy job to be talking about VMDKs, but the VBR/VIB files

Just double-checked the target and it’s correct. It’s our offsite datastore.

Sorry I meant the source for the BCJ 🙂

Userlevel 7
Badge +20

Hey, that error looks like bad data at the source, what NAS/SAN are the VMs running on?

I believe they’re on a QNAP.

Can you do a SureBackup test on your local backups to be 100% they’re good? I don’t trust QNAP/Synology in production after my scars 😔

I can try that. Haven’t done it before. A new full is running currently. I am starting from scratch. I cleared out everything.

If you’re new to SureBackup, don’t worry about roles and ping tests etc, you just wanna see heartbeat (OS has booted and VMware tools is responding) and monitoring the console view within VMware to see that the VM has booted to OS successfully, then you can build from there

Userlevel 7
Badge +3

Is the copy job configured to use a different source than you’re expecting?

 

I wouldn’t expect the copy job to be talking about VMDKs, but the VBR/VIB files

Just double-checked the target and it’s correct. It’s our offsite datastore.

Sorry I meant the source for the BCJ 🙂

I checked the job settings and it’s pulling from the backup job for this server and it’s moving to our offsite datastore. Nothing has changed. This copyjob has been running fine for over a year. I’m not sure what’s happening. Is there a specific place in the BCJ’s settings I should check?

Userlevel 4
Badge +1

Exception from server: Bad Data.
Failed to call CryptDecrypt

 

Had such things in the past and it looks like corrupt data… Also, I’ve seen these issues in combination with QNAPs. Frankly speaking, I would never ever buy such things again for the backups, they are just not resilient enough for those purposes. Also, if I’m not mistaken, they have implemented RAID on a software level which seems to be the cause why those silent data corruption happens here and then.

 

Has anybody seen corrupted data on a hardware RAID? I haven’t and so for me it’s clear: Don’t use NAS devices any longer (which is btw recommended by Anton Gostev).

 

I know this doesn’t help in the current situation, but probabyl in the future ;) 

Userlevel 7
Badge +20

Exception from server: Bad Data.
Failed to call CryptDecrypt

 

Had such things in the past and it looks like corrupt data… Also, I’ve seen these issues in combination with QNAPs. Frankly speaking, I would never ever buy such things again for the backups, they are just not resilient enough for those purposes. Also, if I’m not mistaken, they have implemented RAID on a software level which seems to be the cause why those silent data corruption happens here and then.

 

Has anybody seen corrupted data on a hardware RAID? I haven’t and so for me it’s clear: Don’t use NAS devices any longer (which is btw recommended by Anton Gostev).

 

I know this doesn’t help in the current situation, but probabyl in the future ;) 

That’s correct, it’s only software RAID. I had a scheduled power outage at a DC, took 4x of them down gracefully, 2x of them showed RAW file systems upon power up, all the data was gone. Last time I used them!

Userlevel 4
Badge +1

exactly, Michael, your last words are the crucial ones: you just can’t have such a quality for sensible data.

Userlevel 5
Badge +3

Few inputs here.

 

@bp4JC, I’d strongly recommend review this on a support case. It’s probable the backups indeed have some corruption, but let the support Engineers guide you through this confirmation. If you want to try it yourself, use the Veeam Validator with the /backup flag (/file will try to make a connection on its own to the target you send it, so if it’s a UNC path for a network share, it tries to start its own session instead of using what’s in VBR; this can work, but very often causes some headaches). Support can help you with checking which Restore Point in particular is suspect and give more targeted commands

 

@MicoolPaul Backup Copy will get deep inside the backup file to get at blocks, it doesn’t parse on just the top level storage container. Backup Copy is pretty “intelligent” in that it’s not just copying restore points, it opens the source backup file, parses its contents, and moves truly only the relevant data; this is why you’ll see it references FIBs (File in Backup) like VMDK, VHDX, etc.

 

There is a slight chance Failed to call CryptDecrypt can be an interruption between the gateway/source share, but I want to be clear it can equally mean that indeed there is file level corruption.

(

Imagine the call stack of Gateway => NAS looking like

Gateway: “Hey NAS friend, I need to read this offset from a file for CryptDecrypt”,

NAS:“here you go, one block coming right up!”,

cosmic rays or something malform the packet/payload and the CryptDecrypt gets this malformed payload

Gateway: “wtf m8?”

)

 

But I think it’s best to check it out with Support to confirm what you’re dealing with.

 

Userlevel 7
Badge +3

Alright, update/resolution for everyone. I now have a successful offsite, but it took getting rid of everything and starting from scratch. I had to Delete From Disk all of my restore points from the cloud datastore. Of note, Every one of them was listed as corrupt (with the exception of the incompletes that had just tried to go out from the new full and subsequent incrementals that ran).

 

I’m seeing some interesting input on QNAP and NAS use in backups, and I’ll definitely take that into consideration. Thank you so much for your input, everyone! This community is wonderful!

Userlevel 7
Badge +20

Alright, update/resolution for everyone. I now have a successful offsite, but it took getting rid of everything and starting from scratch. I had to Delete From Disk all of my restore points from the cloud datastore. Of note, Every one of them was listed as corrupt (with the exception of the incompletes that had just tried to go out from the new full and subsequent incrementals that ran).

 

I’m seeing some interesting input on QNAP and NAS use in backups, and I’ll definitely take that into consideration. Thank you so much for your input, everyone! This community is wonderful!

Just an extra comment on this, depending on whether you’re going “bargain basement” QNAP level, or decent spec, you might find a low-end Dell or HPE server to be far more cost effective. I had a quote the other year for a Synology 10GbE model end up more expensive than a Dell NX series server, the only thing I disliked about those was that they used 2x 600GB disks in RAID1 as their boot OS. Whereas you could use a Dell PowerEdge R740XD/XD2 to get a lot of spindles. (And if you REALLY need a lot of spindles, the HPE Apollos are great).

Userlevel 7
Badge +3

Alright, update/resolution for everyone. I now have a successful offsite, but it took getting rid of everything and starting from scratch. I had to Delete From Disk all of my restore points from the cloud datastore. Of note, Every one of them was listed as corrupt (with the exception of the incompletes that had just tried to go out from the new full and subsequent incrementals that ran).

 

I’m seeing some interesting input on QNAP and NAS use in backups, and I’ll definitely take that into consideration. Thank you so much for your input, everyone! This community is wonderful!

Just an extra comment on this, depending on whether you’re going “bargain basement” QNAP level, or decent spec, you might find a low-end Dell or HPE server to be far more cost effective. I had a quote the other year for a Synology 10GbE model end up more expensive than a Dell NX series server, the only thing I disliked about those was that they used 2x 600GB disks in RAID1 as their boot OS. Whereas you could use a Dell PowerEdge R740XD/XD2 to get a lot of spindles. (And if you REALLY need a lot of spindles, the HPE Apollos are great).

I don’t do any of the purchasing for us, but I know we always try and provide the best quality we can for the budgets our clients have. I emailed my CEO and CTO about the comments regarding the QNAPS and NAS use. The CEO responded about it. I really appreciate that input from you all.

Userlevel 7
Badge +10

3-2-1 Rule.

Comment