Struggling with our datacenter backups - How are you approaching this?


Userlevel 1

Good day… We’re a small MSP with a cloud environment of about 120 VMs mostly on HyperV (our target platform) with perhaps 40% still on VMware. We run a 10Gbps core with Veeam running on a dedicated server, sub-interfaces into each customer network for guest interaction and 12 14TB spindles behind a MegaRaid 9361-8i controller. We use a separate set of 12 2TB disks for our internal systems. 

We’re on the hook for 90 days of backups per SLA and the last set offsite (we try to keep at least the last 30 days, but we’ve been adjusting this). We’re running on an older generation server with 2 Xeon 2620 v2’s (6C @ 2.1), 96GB of RAM and REFs on the repos. We’ve tried using 1TB of cachecade on the MegaRaid, but the size seems too small and it actually hurts us on performance. We’ve turned it off for now.

We have been struggling to keep up with our backup workloads, despite the box seeming to be okay from a resource perspective at the OS level. We’re right at the concurrent process limits based on the number of REPOs, OS, Proxy, etc. Jobs run long and then the copy jobs get hung up. Since they get hung up, the next set of backup jobs fail. We can’t seem to find the sweet spot for ensuring our backups are reliable (through consistency checks), not growing out of hand (real fulls that we can’t keep due to space issues and synthetic fulls that don’t really seem to work (they still take up tons of space)). We ran shadow protect for the longest time with no issues, but the model is completely different (we ran in-guest only and then offloaded copies using ImageManager). It kept up with collapses and transfers without incident. 

We’re trying to figure out where to focus next. Add another 6 or 12 spindles and make the R50 (2x6 R5 now) bigger, adding another volume and putting it in the scale out repo as another extent, changing out the procs for more clock cycles and cores, getting 10Gb (current 1) to our secondary datacenter, getting rid of REFs? We can’t seem to get it right and I’m wondering if any others have struggled in this manner? Tech support has not really made many recommendations and IMO, most of the guidance is “squishy”. Anyone out there in a similar situation and how is your platform working?  Thanks in advance!


13 comments

Userlevel 7
Badge +17

Most Veeam processes are using processor cores exclusively. You have 2 processors with 6 cores each in your backup server. This is not much….

Without knowing your environment better my first assumption is that you have not enough cores to utilize all your processes.

Userlevel 1

Thanks @JMeixner for the quick suggestion. We’ve ordered two new 8c @ 3.3 CPU’s as a test (only so far we can go with the underlying system-board). From an OS perspective, we rarely see the current cores utilized much over 50%. Perhaps it’s due to constraints elsewhere, but we’re going after that first as it’s a relatively cheap fix. I should state that the underlying source data is coming off a mostly SSD hybrid SAN, multipathed using 2 10Gb NICs on each of the hosts. We’ve considered the off-host proxy, but the license cost is prohibitive to get the support on the SAN we need for that feature (and, we’re not sure how much it will impact us). We feel like REFs is a problem, but I think it’s mostly because we rarely use it and it’s difficult to really tell if it’s working or not.

Userlevel 7
Badge +17

There is a really cool session at the veeamon site:

Architect’s Desk: Sizing of Veeam Backup & Replication, Proxies and Repositories from Tim Smith and Billy Cashwell.

In this session sizing of veeam environments is explained. Perhaps you can find a point which is misconfigured in your environment.

Browse to www.veamon.com and register. Then search for this session.

Userlevel 7
Badge +20

It’s amazing how quick SSDs can demolish your IO queue on a spinny disk array like that. Your backup job statistics, Do they show target as the bottleneck?

 

How many concurrent jobs at peak time and how many concurrent tasks are they running?

 

You may have multiple bottlenecks here and fixing one may not make much difference until the others also come away. Definitely sounds like insufficient CPU, probably light on RAM as well. ReFS sizing should be 1GB RAM per 1TB storage then add your OS and other Veeam roles to the equation also… I would definitely investigate if it’s your storage though.

 

Also is the OS at least Windows Server 2016 with the ReFS volume sized at 64k? If not then you won’t be benefiting from fast clone.

Userlevel 7
Badge +7

check this!

Veeam Goodies - rhyshammond.com

VSE (veeambp.com)

Userlevel 7
Badge +13

If you suspect the problem is with ReFS, test IO performance there. I would recommend using tools like IOmeter or DiskSpd. To check performance between proxy and storage, think of performing IO tests on storage at proxy too. Means: create a LUN for the proxy and test performance for this LUN directly on the proxy as well.

But normally no job fails because of poor performance. They are just slow.

Also check tips form @MicoolPaul :

  • Windows Server at least 2016? Current patches installed?
  • is ReFS formatted with 64k allocation unit size?
Userlevel 7
Badge +7

test refs

 

Userlevel 7
Badge +17

Fine example @Link State . Thank you :sunglasses::thumbsup_tone3:

Userlevel 1

Appreciate the guidance. We’re using 64k blocks on REFs, but I believe also on the array (as opposed to 256k as I’ve now read is suggested). Also didn’t realize the RAM overhead of REFs, so that’s a good point too. Going to check out the sizing video and report back. Much appreciate the quick and detailed community support!

Another engineer working on this same issue - it seems the issue might be related to ReFS corruption. Looking at another thread it seems there is no corruption protection with ReFS but is offered with S2D.

We do in also see event logs related to ReFS failures, an event occurred about a month ago with a memory module on this machine which was causing the machine to repeatedly bluescreen, it ended up corrupting some backup files which we thought was fully resolved after health checks, removing the offending backup file (and its backwards chain) and restarting fulls.

Event log message; REFS Event ID 133
Detail: “The file system detected a checksum error and was not able to correct it. The name of the file or folder is ...”

Anybody run into ReFS checksum issues? We are thinking we need to completely rebuild the volume.


Answering some of the above questions;

  1. Windows Server at least 2016? Current patches installed?
    1. Windows Server 2019 w/ 1809, all critical/recommended patches installed
      1. Refs.sys version 10.0.17763.1971
  2. ReFS formatted with 64k allocation unit size?
    1. fsutil fsinfo refsinfo d:
      REFS Volume Serial Number :       0xc800ca4e00ca4364
      REFS Version   :                  3.4
      Number Sectors :                  0x0000003faa1e0000
      Total Clusters :                  0x000000007f543c00
      Free Clusters  :                  0x0000000031bcb961
      Total Reserved :                  0x000000000040085a
      Bytes Per Sector  :               512
      Bytes Per Physical Sector :       4096
      Bytes Per Cluster :               65536
      Checksum Type:                    CHECKSUM_TYPE_NONE

    2. fsutil fsinfo volumeinfo d:
      Volume Name : PrimaryBackupArray
      Volume Serial Number : 0xca4364
      Max Component Length : 255
      File System Name : ReFS
      Is ReadWrite
      Not Thinly-Provisioned
      Supports Case-sensitive filenames
      Preserves Case of filenames
      Supports Unicode in filenames
      Preserves & Enforces ACL's
      Supports Sparse files
      Supports Reparse Points
      Returns Handle Close Result Information
      Supports Named Streams
      Supports Open By FileID
      Supports USN Journal
      Supports Integrity Streams
      Supports Block Cloning
      Supports Sparse VDL
      Supports File Ghosting

  3. We have a total of 25 backup jobs and 3 copy jobs

    1. ​​​​​​​​​Within the backup jobs the VMs attached vary between 1 and 12, average is around 4-5 VMs

Let me know if I can provide any more details.​​​​​​​

Userlevel 7
Badge +17

Normally ReFS should correct most problems itself.

Found some discussions about similar issues. For example:

https://social.technet.microsoft.com/Forums/ie/en-US/54480c20-3bd9-4144-a734-b719c7aa58ac/refs-the-file-system-detected-a-checksum-error-and-was-not-able-to-correct-it?forum=winserverfiles

Did you check the SMART data of your disks? All ok there?

Userlevel 7
Badge +7

Another engineer working on this same issue - it seems the issue might be related to ReFS corruption. Looking at another thread it seems there is no corruption protection with ReFS but is offered with S2D.

We do in also see event logs related to ReFS failures, an event occurred about a month ago with a memory module on this machine which was causing the machine to repeatedly bluescreen, it ended up corrupting some backup files which we thought was fully resolved after health checks, removing the offending backup file (and its backwards chain) and restarting fulls.

Event log message; REFS Event ID 133
Detail: “The file system detected a checksum error and was not able to correct it. The name of the file or folder is ...”

Anybody run into ReFS checksum issues? We are thinking we need to completely rebuild the volume.


Answering some of the above questions;

  1. Windows Server at least 2016? Current patches installed?
    1. Windows Server 2019 w/ 1809, all critical/recommended patches installed
      1. Refs.sys version 10.0.17763.1971
  2. ReFS formatted with 64k allocation unit size?
    1. fsutil fsinfo refsinfo d:
      REFS Volume Serial Number :       0xc800ca4e00ca4364
      REFS Version   :                  3.4
      Number Sectors :                  0x0000003faa1e0000
      Total Clusters :                  0x000000007f543c00
      Free Clusters  :                  0x0000000031bcb961
      Total Reserved :                  0x000000000040085a
      Bytes Per Sector  :               512
      Bytes Per Physical Sector :       4096
      Bytes Per Cluster :               65536
      Checksum Type:                    CHECKSUM_TYPE_NONE

    2. fsutil fsinfo volumeinfo d:
      Volume Name : PrimaryBackupArray
      Volume Serial Number : 0xca4364
      Max Component Length : 255
      File System Name : ReFS
      Is ReadWrite
      Not Thinly-Provisioned
      Supports Case-sensitive filenames
      Preserves Case of filenames
      Supports Unicode in filenames
      Preserves & Enforces ACL's
      Supports Sparse files
      Supports Reparse Points
      Returns Handle Close Result Information
      Supports Named Streams
      Supports Open By FileID
      Supports USN Journal
      Supports Integrity Streams
      Supports Block Cloning
      Supports Sparse VDL
      Supports File Ghosting

  3. We have a total of 25 backup jobs and 3 copy jobs

    1. ​​Within the backup jobs the VMs attached vary between 1 and 12, average is around 4-5 VMs

Let me know if I can provide any more details.

check:

Repository Design - Veeam Backup & Replication Best Practice Guide

Block Repositories - Veeam Backup & Replication Best Practice Guide

Lo zen e l’arte del corretto dimensionamento (veeam.com)

Userlevel 7
Badge +12

@SMExi ReFS can detect corruptions of files, but as you write it can only repair such corruptions if you utilize S2D. If Health check or the backup validator detect errors with your backups files, then I would recommend to rebuild the volumes or at least start independent backups chains with active full backups. I would also recommend to create new jobs to be sure that no existing block cloning is used.

https://community.veeam.com/blogs-and-podcasts-57/veeam-utilities-veeam-backup-validator-353

Comment