Solved

Backup Copys to Synology "Stg.CheckMetadataCorrupt"


Userlevel 2

Good morning; I have been having some problems with our current setup, I'm ok with smashing my hardware and starting over, but I want something I know will work! I am having issues with a backup copy job moving our backups off-site to another location.

We have Veeam installed on a server in our primary data center, this server has 64TB, and all our primary backups are stored here. We have a backup copy job that pushes backups off-site over a VPN to Synology DS416play. At first, we set this up as an NFS share connected to a repository on the Veeam server; doing it, this was very slow, and backups failed. Then we changed this to an SMB share connected to a repository on the Veeam server; again, this was slow, and jobs would never finish.

Then we created a LUN and attached the Veeam server with ISCSI creating a drive letter and pointing the repository to this. The ISCSI method increased the speed jobs were flying and working! But over time, we started getting "Stg.CheckMetadataCorrupt" errors? Doing an Active Full would resolve this, but only a few days or weeks; this has become a maintenance nightmare!

Before moving to the Synology, we had a large USB HD connected to a windows ten box; this worked for years and worked well, but I never liked the thought of a single drive and the limited space. We had the Synology DS416play sitting around, so we deployed it thinking this would give us some redundancy and more space.  

My goal is to develop a better solution; maybe this problem is linked to the DS416play or how we are setting this up? Maybe Veeam? I know it's not the WAN; we used a slow POS USB drive for years and never had a problem. Is there another lower-cost storage appliance with a Veeam integration that would be better for what I am trying to do?

icon

Best answer by Amore514 14 July 2022, 17:42

View original

16 comments

Userlevel 4
Badge +2

Just some input -- the 100% best improvement you could do would be to stick a Gateway/Repository server next to the QNAP on the remote site and connect the QNAP via:

  1. iscsi mount to the server as a Repository Server. You can even get away with just Windows 10/11 here as long as it’s sized appropriately for the concurrent tasks: https://helpcenter.veeam.com/docs/backup/vsphere/system_requirements.html?ver=110#backup-repository-server
  2. SMB/NFS share that mounts to the gateway server. Cool advantage here is you can use Linux for an NFS gateway and save a Windows license cost

The reason having a Repository Server/Gateway on the DR site helps is that the Veeam datamover agents get deployed there, so the topology changes. I’m guessing previously with NFS/SMB the gateway was on the production site, right? So you have the WAN in-between and any sensitivity there breaks the connection because of how these protocols work. With the datamover agent, Veeam has resiliency built in, doubly so for Backup Copies, and it knows how to handle temporary “blips” in the network and you can even tweak it pretty heavily.

Basically, putting a Veeam infrastructure component on the remote site helps to handle the WAN in between. You can still station-wagon the NAS to the local site and import the backups that way in a disaster, and all it would cost is just a simple server on the DR side.

Userlevel 7
Badge +8

That makes more sense 🙂

 

Id be sure to test this prior to needing it however, speaking purely from experience:

 

You’ll need to be able to move the host & network (VLANs etc) in addition to the storage, for accessing the iSCSI storage when such a situation is necessary. Plus I found out the hard way that these NAS systems really aren’t all that great. I was assisting a company complete an office move, 4x QNAP NAS’ across two different models, gracefully shut down the OS & QNAPs in the appropriate order, migrated to new site, two of the four QNAPs were showing RAW storage to the OS, the LUNs had been corrupted. It wasn’t a common model out of the 4 either.

 

If this is a cost-conscious thing, have you considered tape and just shipping those tapes to the other site?

Userlevel 2

@Amore514: Just an idea for a different apporach. As you’re already copying your backups offsite via WAN/VPN, have you thought about offloading your backups to a cloud object storage (S3)? This would probably be more stable and offers you addtional security, for example immutable backups.

Yep we are using the Amazon S3 storage, this works great but restoring is slow. Because of the slow restore i wanted something a bit more local and a 3ed set just in case. 

Playing devils advocate, would your WAN not be the bottleneck anyway in this scenario? I of course don’t know the detail of your platform here and might be completely wrong, have you tested a recovery to see the performance speed vs S3?

Trust me i am open to all advice! Yes you are correct, restoring over the WAN would be SLOW! but this site is only 15 miles away, if i needed to restore everything i would pick it up and bring it to the data center :-)

Userlevel 7
Badge +8

@Amore514: Just an idea for a different apporach. As you’re already copying your backups offsite via WAN/VPN, have you thought about offloading your backups to a cloud object storage (S3)? This would probably be more stable and offers you addtional security, for example immutable backups.

Yep we are using the Amazon S3 storage, this works great but restoring is slow. Because of the slow restore i wanted something a bit more local and a 3ed set just in case. 

Playing devils advocate, would your WAN not be the bottleneck anyway in this scenario? I of course don’t know the detail of your platform here and might be completely wrong, have you tested a recovery to see the performance speed vs S3?

Userlevel 2

I am almost a week in to turning off the write cache and so far everything is still working! 

Userlevel 2

@Amore514: Just an idea for a different apporach. As you’re already copying your backups offsite via WAN/VPN, have you thought about offloading your backups to a cloud object storage (S3)? This would probably be more stable and offers you addtional security, for example immutable backups.

Yep we are using the Amazon S3 storage, this works great but restoring is slow. Because of the slow restore i wanted something a bit more local and a 3ed set just in case. 

Userlevel 7
Badge +6

@Amore514: Just an idea for a different apporach. As you’re already copying your backups offsite via WAN/VPN, have you thought about offloading your backups to a cloud object storage (S3)? This would probably be more stable and offers you addtional security, for example immutable backups.

Userlevel 7
Badge +6

Looking forward to your feedback @Amore514 to see if disabling the write cache solved the issue. In general those NAS systems tend to be not too reliable and stable. Especially, in my opinion, the smaller consumer-grade ones aren’t built for such use cases like a backup target. So far I haven’t seen any problems with the bigger ones like the Synology rack systems.

Do you know if any smaller inexpensive NAS systems that have a direct Veeam integration? 

Unfortunately I don't know any. I can only say, that the Rackstations didn't cause any problems so far, though I suspect that the smaller ones also only do software RAID. If I could, I would go with a small server and install Windows/Linux on it.

Userlevel 7
Badge +8

Thanks for the update, you should find it’s fine now that write cache is disabled. Have you also confirmed you have header & data digests configured?

Userlevel 2

Hi @Amore514, write cache being disabled will certainly help. I had this with QNAPs years ago and had to remove this to resolve the problems.

 

Unfortunately these QNAP/Synology grade systems are using software defined RAID controllers, with no battery backed write cache, meaning that writes can and do end up getting missed, leading to the corruption you’re finding here. It will ‘hurt’ performance disabling the write cache, but your writes will actually be hitting the storage prior to any confirmations, so it’s better to take the performance hit and then not have to worry about doing an active full due to corruption, any saved time is immediately lost here!

 

I also want to clarify, are you using iSCSI over your VPN? or is it iSCSI to a local system that is accessed over VPN?

 

If you’re trying to use iSCSI over VPN, this will be your biggest issue, iSCSI clients don’t like latency or fragmented/reordered packets, and you’ll be hitting a lot of both of these categories with iSCSI over VPN.

At first we were connecting our Veeam server directly to the ISCSI target over the VPN, after the 3ed time seeing the "Stg.CheckMetadataCorrupt" error we installed a small jump box at the remote site, linked that to the ISCSI target and created a repository on that box. Making this change did increase the performance but over time the the "Stg.CheckMetadataCorrupt" error came back. 

I wish the Veeam software could see this error and auto start an active full, kind of like self healing it self.. 

 

Userlevel 2

Looking forward to your feedback @Amore514 to see if disabling the write cache solved the issue. In general those NAS systems tend to be not too reliable and stable. Especially, in my opinion, the smaller consumer-grade ones aren’t built for such use cases like a backup target. So far I haven’t seen any problems with the bigger ones like the Synology rack systems.

Do you know if any smaller inexpensive NAS systems that have a direct Veeam integration? 

Userlevel 2

Not sure but maybe the difference is I am backing up over a WAN (Hardware VPN Cisco to Cisco), but like i also said in my OP i have been doing this for years to a USB drive with out any problem.

Can you check if you have write cache turned ON or OFF? I would love to know. 

 

ISCSI is the way (unless you can use local disks) - better performance than SMB and NFS, I believe mostly due to multipathing.  NFS would be preferred over SMB if you had to use a network protocol.  That said, how is your volume formatted?  REFS has been known to occasionally cause issues on ISCSI volumes to NAS’s and can be seen in the health checks at the end of the backups, or so I’ve read per @Gostev.  With that said, I haven’t personally experienced it (to my knowledge...some of that is semi-transparent), but going forward, any NAS repo’s I have will be using NTFS and not REFS…..possibly might try XFS with a linux repo server, but haven’t done the research and tried it out yet.

 

Edit:  Just read your comment above and noted you’re trying to disable caching on the drives.  To my knowedge, I haven’t see that issue and I have serveral Synology and QNAP NAS’s in place across my client base.  Not to say it wouldn’t happen, but if you need me to check on any of my Synology’s, I certainly can to see if any settings might differ.  My internal backups are using a virtual backup server with to a Synology NAS connected as a RDM disk in VMware that is passed along to the Server 2012 R2 repo server with a NTFS formatted volume….guessing that’s going to be a pretty similar configuration.

 

Userlevel 7
Badge +8

Hi @Amore514, write cache being disabled will certainly help. I had this with QNAPs years ago and had to remove this to resolve the problems.

 

Unfortunately these QNAP/Synology grade systems are using software defined RAID controllers, with no battery backed write cache, meaning that writes can and do end up getting missed, leading to the corruption you’re finding here. It will ‘hurt’ performance disabling the write cache, but your writes will actually be hitting the storage prior to any confirmations, so it’s better to take the performance hit and then not have to worry about doing an active full due to corruption, any saved time is immediately lost here!

 

I also want to clarify, are you using iSCSI over your VPN? or is it iSCSI to a local system that is accessed over VPN?

 

If you’re trying to use iSCSI over VPN, this will be your biggest issue, iSCSI clients don’t like latency or fragmented/reordered packets, and you’ll be hitting a lot of both of these categories with iSCSI over VPN.

Userlevel 7
Badge +6

Looking forward to your feedback @Amore514 to see if disabling the write cache solved the issue. In general those NAS systems tend to be not too reliable and stable. Especially, in my opinion, the smaller consumer-grade ones aren’t built for such use cases like a backup target. So far I haven’t seen any problems with the bigger ones like the Synology rack systems.

Userlevel 7
Badge +3

ISCSI is the way (unless you can use local disks) - better performance than SMB and NFS, I believe mostly due to multipathing.  NFS would be preferred over SMB if you had to use a network protocol.  That said, how is your volume formatted?  REFS has been known to occasionally cause issues on ISCSI volumes to NAS’s and can be seen in the health checks at the end of the backups, or so I’ve read per @Gostev.  With that said, I haven’t personally experienced it (to my knowledge...some of that is semi-transparent), but going forward, any NAS repo’s I have will be using NTFS and not REFS…..possibly might try XFS with a linux repo server, but haven’t done the research and tried it out yet.

 

Edit:  Just read your comment above and noted you’re trying to disable caching on the drives.  To my knowedge, I haven’t see that issue and I have serveral Synology and QNAP NAS’s in place across my client base.  Not to say it wouldn’t happen, but if you need me to check on any of my Synology’s, I certainly can to see if any settings might differ.  My internal backups are using a virtual backup server with to a Synology NAS connected as a RDM disk in VMware that is passed along to the Server 2012 R2 repo server with a NTFS formatted volume….guessing that’s going to be a pretty similar configuration.

Userlevel 2

After posting this, I found this link

https://forums.veeam.com/veeam-backup-replication-f2/synology-nas-as-repo-t77177.html

My ISCSI drive is formatted as NTFS, but I had the write cache turned ON for each disk. I just turned this off; I will report back in a few days. I will be surprised but willing to give it a shot if this works!

Comment