Largest VM you backup?



Show first post

52 comments

Userlevel 7
Badge +8

R&D Test Archive file server:  720 TB Allocated; 526 TB consumed but running 45x 16TB ZFS compressed volumes…   “DU” shows 985 TB used when all volumes are added up…  Cleanup scripts blow away 100TB of data per week based on various retention policies.   If I ever have to start this one from scratch again, I’ll be rather upset…  I’ve been baby-sitting this since it ran on an HP-UX ServiceGuard cluster and was 8x 2TB volumes, through migrations to RHEL 6.x VM, Solaris 10 Physical, a site relocation (without downtime - thanks VEEAM Replication!!!!), and finally migrating to RHEL 7 Veritas cluster (Physical)…  It just keeps growing and won’t die!

Runner-Up:  Level 4 support/developer NFS server: 625TB Allocated, 552TB used - SINGLE logical volume spanning dozens of PVs!

Current beasts of my environment (this is just one of five major sites for our team):

  • 61 TB (80 TB Allocated) CentOS 6.10 NFS Server
  • 5x 9-14 TB MS SQL Server clusters (2x servers each with independent SAN storage - these are physical workloads)
  • 24 TB (40 TB Allocated) CentOS 6.10 NFS Server
  • 26 TB (32 TB Allocated) CentOS 6.10 NFS Server
  • 74 TB (120 TB Allocted) Alma Linux 9 NFS Server (just rebuilt the OS disk and moved data disks from former CentOS 6.10)
  • Single job with over two dozen build servers each with 6-20 TB of workspaces (81 TB used of 200 TB Allocated)
  • 30+ MySQL and related Linux database servers totaling 31 TB
  • Single job with 5 Windows file servers; three of those are 20-24TB (48TB Allocated each
  • And the “beasts” are only 60% of this site’s protected capacity…   Not to mention double-digits PB of object storage!

I have found that with these large file servers, I have to start the jobs out with just one or two of their mount points, then un-”exclude” another mount point for each backup until I get the whole system backing up and get CBT nice and happy.

@ejfarrar That is impressive.

On that 522TB server, with the SINGLE volume. how big are your PV’s? 

What do you size your windows file servers Volumes\disks to as well? 

 

Those are some VERY large servers to back up. Are they accessed pretty heavy or mostly archive data?

 

Now that I am hitting the 100TB range for a few I’m finding smaller vmdk’s at least allows me some concurrency in Veeam for the backups.  My coworker would rather just size the VMDK’s to 64 TB and forget about it lol :) 

Userlevel 2

@ejfarrar That is impressive.

On that 522TB server, with the SINGLE volume. how big are your PV’s? 

Sure - make me go look!   - The 552TB (522 typo?), the VG for that LV has 40x 16TB (640TB) for that particular volume plus a second VG of 4x 64TB PVs for a 256TB LV…   So my numbers were off on that physical server….  I didn’t build that server personally so I try not to mess with it outside of backups...

What do you size your windows file servers Volumes\disks to as well? 

These Windows file servers have various disk sizes - each share is its own VMDK based on approved project requirements for each of those teams.  This is an AD-integrated set of servers with strict retention policies…  Most volumes are at most 2TB but a couple of the broad-audience volumes are- 8TB.  I do guest filesystem indexing on these (most of my systems do except for the beasts).

Those are some VERY large servers to back up. Are they accessed pretty heavy or mostly archive data?

If it is “big” it is heavily used in my world.  The test archive server for R&D gets hit heavy during automated testing overnight and on weekends with writes, and mostly reads during the day.  It’s kind of like a yo-yo; anywhere from 6 to 40TB gets written on any given night, and every week 40-200TB gets cleaned up.

Now that I am hitting the 100TB range for a few I’m finding smaller vmdk’s at least allows me some concurrency in Veeam for the backups.  My coworker would rather just size the VMDK’s to 64 TB and forget about it lol :) 

There are advantages and disadvantages to each method.  Active Full backups really take hits on performance (regardless of storage-integrated or network-based transfers) on the larger VMDKs.  The arrays have a hard time cleaning up and compacting them, but at the same time an array using dedupe/compression can make use of large volumes in those aspects.

As far as VEEAM is concerned, my experience is that smaller VMDKs are going to give you the best performance during backups and recovery. I’m not an array or SAN expert, but the arrays behind my protected systems as well as my repositories do just fine with whatever size volumes are presented but cleanup/reclaim/defrag/compacting is obviously quicker on smaller volumes.  I’ve been pushing VEEAM B&R to find breaking points since version 7…  One thing that has always made me reconsider using  single large VMDKs is that if you ever need to do a recovery from storage snapshots, you’re only going to get the VMDK of the first hard drive on a VM easily - anything more than that and you are in for a lengthy manual process… If you have compute/storage capacity, sandbox your scenarios and test it out - VEEAM can do what you need it to; you just have to iron out the details or your one-off’s (hopefully not under pressure)...

 

Userlevel 7
Badge +8

@ejfarrar Thanks for that very interesting writeup and response.   I agree, management vs performance vs requirements are always a challenge for sizing.  I’ve been messing with queue depth and performance a lot lately on some of our analytics VM’s and databases tuning things but there is also something nice about having a larger volume from a management perspective. 

As a SAN guy, if you run dedupe too, having large volumes on the back end is beneficial as well. 

There really is no right way to do it or right answer, there is however a wrong way when things become unmanageable or the performance is bad. 😄

Userlevel 2

These impressive numbers gave me a headache but it is very interesting.

I'm curious how you saved the file server r&d? Agent? File Share jobs?
On what type of repo and what retention? Object store my read is good?

How long on the activefull that initiates the backup chain?

That one uses the agent on one of the two nodes of the cluster.  My wallet cringes at the new licensing model for file share jobs when I think of my servers (even though it technically isn’t my money, keeping costs down enhances profits which enhances bonuses and salaries).  I have moved to purely scale-out repositories for our block systems but we are about to re-assess since our S3 object storage on-prem is about to double in size (its already in double-digit PB)…  We are exploring various scenarios where block and object storage are used for all types of workloads…

I have about 800TB of scale-out repository capacity plus about the same on our dedupe appliance where our backup copies go…

All except the >130TB systems are now running on reverse-incremental backups with once a month active full backups.  Retention for on-disk restore points is 7 days, backup copy retention is 5x weekly, 3x monthly, 4x quarterly.  The exception to that is basically all of the big *nix file servers.  They have a 4 day retention policy on disk (offset by 7 day retention + 1 weekly on storage snapshots) and 2x weekly on backup copy.  A couple large *nix file servers are under legal retention policy for over a decade, so those ones have some longer backup copy restore points kept.

For big systems that don’t have active full backups frequent enough to minimize the restore points to our policy, a synthetic full backup runs (usually once a week) so we aren’t locked into old full backups which don’t line up with the retention policies..

Userlevel 7
Badge +8

Awesome topic @Scott!

 

Some great conversation here around handling Backup & restoration of these beasts. I’d chime in that unless it’s ransomware I’d generally use quick rollback if I needed to restore a volume. If it was ransomware and the customer didn’t have space for an extra copy I’d ask them to engage their infosec response teams if leaving the OS disk is sufficient and purging the rest. Mileage may vary.

 

As for largest VM, I had a really interesting one for sizing. Customer was somewhere between 100-150TB of data on a file server. What’s that big you might ask? It was high resolution, lossless, images of the ocean floor and other such ocean-related imagery.

 

Because it was lossless, the compression algorithm of Veeam had a whale of a time (pun intended 🐳) compressing the data. IIRC the full backup was 20-30TB. The main kicker was that the imagery was updated periodically, so that entire 100-150TB of data was wiped from their production storage, and new data sets were supplied. But again, it was 99.9% uncompressed images. Other system data churn was about 1-2GB per day.

 

Fast Clone kept the rest of the data requirements in check to avoid unnecessary repository consumption, and the backup repo could handle the data overall easily. Retention was also a saving grace as although they relied on their backups for “oh can we compare this to 2-3 weeks ago”, IIRC they only retained about 6 weeks of imagery. As for restorations, once this wasn’t the live dataset they only wanted to compare specific images, not entire volumes.

 

Glad I talked them into a physical DAS style repo instead of the 1Gbps SMB Synology design they were going for!

Very cool.

 

I didn’t want to ask “What type of data” as I know there are many rules and regulations, but it’s very interesting finding out what people have to store and keep.   I have a TON of photos and videos that make up most of my data. I can’t really get into what for, but it doesn’t compress very well. Lucky for me I have a ton of databases that compress really well to make up for it. 

 

It’d be pretty neat looking at those ocean floor images. 

 

Good point with fast clone if you need a quick restore. I find in most cases I’m doing file level restores off these monster file servers as people delete things, or more often accidently copy something and lose track of where it went. We’ve all had a mouse slip. To go one step further, ADAudit is a great tool for finding those. Every time someone requests a restore I do a search to see who, when and where the file was deleted from. half the time someone slipped and dropped the folder another level down. I move it back saving duplicate data. 

 

Another thing I try and do, is keep low churn on the same servers. that way some of the real big archive servers don’t end up growing anymore.  I find organizing by year in subfolders is a great way to talk to a department/division and say, “Can we archive everything pre 2015 etc.” or How many years of X do we need to keep for legal purposes.

 

 

 

Userlevel 7
Badge +8

My biggest vm being backed up was a “Giant” 2TB SQL Server, and a 1.5TB Oracle Server.

they were not so big, but very critical, both running in Mechanic Discs, so the backups took so long to perform, and some times, the machines jammed, so they were backed up at night, out of working ours, and every time we did a change, fingers crossed for not messing the Full backup. 😂.

 

I’m a huge fan of redundancy. I used to have some fiber / SAN infrastructure I inherited from the guy before me that would cause outages every upgrade, change, reboot etc.  Backups would cause systems to halt too.

 

I basically started fresh and rebuilt it all from the ground up. It works fine now but I still get nervous. lol

Userlevel 7
Badge +8

What kind of workload is it for you, that “needs” those monster VMs? 

For our customers with VMs of ~30TB max. it’s mostly fileservers that have gone nuts during decades of lacking governance… 😉

There we usually have the discussion about NAS backup being an alternative with much better parallelity throughout the whole job run. Especially when V12 brings us NAS2tape.

Monster VMs tend to be the onces rolling in via a single thread at the end of a VM backup job.

Challenge with NAS backup for Windows/Linux servers is, that it gets waaaaaay more expensive. 1VM=1VUL vs. 30TB=60VUL…

I looked into NAS backup, Can’t afford it.

 

lets say i have 17 file servers at 50 TB each.

I have sockets currently 2 CPU hosts,  I could handle all those servers on 1-2 hosts, so 4 sockets max

Even at a 6-1 ratio for conversion that is 24 VUl.

 

Now, if I switched to VUL license, full VM backups =17 VUL, but I can handle another 50 VM’s probably on these servers so I’m still ahead with sockets, but pay upfront so VUL’s could potentially work here at slight increase.

 

17*50=850TB.  I’m going to let you guess how much that comes out to in NAS backup / VUL licenses.

At 500GB a VUL, that is like 1700 instead of 17.          

 

Our file servers have crazy growth, mostly from video and legal requirements of how much data we have to keep. Some of it is 60+ years, some I have been told “forever”

 

The DB, Application, Web, and other servers are all reasonable in my environment.

 

With our DR and backup requirements too, it means keeping multiple copies of these on multiple SANS and Tape.   My vendors all love me and probably owe me a few more lunches lol.

 

 

 

 

 

 

Userlevel 7
Badge +10

My biggest vm being backed up was a “Giant” 2TB SQL Server, and a 1.5TB Oracle Server.

they were not so big, but very critical, both running in Mechanic Discs, so the backups took so long to perform, and some times, the machines jammed, so they were backed up at night, out of working ours, and every time we did a change, fingers crossed for not messing the Full backup. 😂.

 

I was made aware of an 18TB Oracle backup (with the RMAN plugin)

Userlevel 7
Badge +17

My biggest vm being backed up was a “Giant” 2TB SQL Server, and a 1.5TB Oracle Server.

they were not so big, but very critical, both running in Mechanic Discs, so the backups took so long to perform, and some times, the machines jammed, so they were backed up at night, out of working ours, and every time we did a change, fingers crossed for not messing the Full backup. 😂.

 

I was made aware of an 18TB Oracle backup (with the RMAN plugin)

OK, my Oracle databases are not that big. The biggest is around 5 -6 TB. But it is growing 😎

Userlevel 7
Badge +10

My largest is about a 120TB file server….which is getting ready to grow as the client is starting to put a lot of 4k video on it.  It’s been rough getting it through error checking backup defrags.  Unfortunately, we don’t have enough space on the repo to setup incremental full’s due to size.  It’s a process…..and we’re still trying to find a better way for it.

The 120 TB File server is that a VM, NAS or Lin/Win System?

Userlevel 7
Badge +8

Well, Just got told I need to add another 50TB to this server, and I have another 300TB incoming,   It’s quite critical too.

 

I’m going to have to create a few servers most likely to spread the load, but maybe I’ll do a test in Veeam after to see how long it takes.

 

*Sigh* I’m going to need a few more SANs just for backup data, and a few in production. lol.   

Userlevel 7
Badge +10

Well, Just got told I need to add another 50TB to this server, and I have another 300TB incoming,   It’s quite critical too.

 

I’m going to have to create a few servers most likely to spread the load, but maybe I’ll do a test in Veeam after to see how long it takes.

 

*Sigh* I’m going to need a few more SANs just for backup data, and a few in production. lol.   

Whoa, Scott! Pushing it!

Userlevel 7
Badge +17

Ok, let’s turn this around ….. smallest VM?

Some tiny Linux VM with something between 3 and 6 GB… Was some very small Linux variant, don’t remember which one though….

Userlevel 7
Badge +10

Ok, let’s turn this around ….. smallest VM?

Well in Rickatron Labbin, I do many powered off "empty disk" VMs but I have backed up the SureBackup network appliance that has no disks.

Userlevel 7
Badge +7

Oh God 😲! 
How many time to backup this one ?

For me the max was around 10TB.

Daily… And the SAP logs at least every hour…

Backup is not the big problem after the initial full backup, I am afraid of a complete restore…

I have told the VM owner to split their VMDKs into several VMDKs with at most 1 TB size. So we can restore with several sessions and with one for these big VMs….

 

I have this issue at work with growth, DFS has been a life saver moving stuff but trying to get other people to keep the VMDK’s down.  Every time I look someone seems to create 25TB+ VMDK files lol.

 

My biggest issue is Tape.   Even with 50 VMDK’s, its still a single VBK file.

 

Due to my many file servers I do weekly to tape as it takes a few days going to 8 drives.  If I were to din incremental they usually fail during that time.  I like having a weekly full incase I lose a tape or something catastrophic happens.

 

I hope in future versions of Veeam there is the ability to split a VBK file into multiple files for tape backup performance using multiple drives. 

 

 

 

 

Are you using GFS at all?

Userlevel 7
Badge +6

My largest is about a 120TB file server.  It’s been rough getting it through error checking backup defrags.  Unfortunately, we don’t have enough space on the repo to setup incremental full’s due to size.  It’s a process…..and we’re still trying to find a better way for it.

Out of curiosity, how long do the backups and health checks, etc take?

On my 115TB server I had to turn the health checks off completely. To be fair though, the Veeam SAN had no flash disk or SSD’s so it wasn’t a monster of a SAN. 

I may just need to disable health checks.  I don’t like that idea, but the alternative is running checks/maintenance that takes days or weeks and causes several backups to be missed.

Userlevel 7
Badge +6

Oh God 😲! 
How many time to backup this one ?

For me the max was around 10TB.

Daily… And the SAP logs at least every hour…

Backup is not the big problem after the initial full backup, I am afraid of a complete restore…

I have told the VM owner to split their VMDKs into several VMDKs with at most 1 TB size. So we can restore with several sessions and with one for these big VMs….

 

I have this issue at work with growth, DFS has been a life saver moving stuff but trying to get other people to keep the VMDK’s down.  Every time I look someone seems to create 25TB+ VMDK files lol.

 

My biggest issue is Tape.   Even with 50 VMDK’s, its still a single VBK file.

 

Due to my many file servers I do weekly to tape as it takes a few days going to 8 drives.  If I were to din incremental they usually fail during that time.  I like having a weekly full incase I lose a tape or something catastrophic happens.

 

I hope in future versions of Veeam there is the ability to split a VBK file into multiple files for tape backup performance using multiple drives. 

 

 

 

 

With V12 you get real single VM backup files. This may help with your tape problem...

 

Pretty sure you should get separate files if you split the VMDK’s into separate jobs.  It’s a bit of a pain, but might be helpful.

Userlevel 2

R&D Test Archive file server:  720 TB Allocated; 526 TB consumed but running 45x 16TB ZFS compressed volumes…   “DU” shows 985 TB used when all volumes are added up…  Cleanup scripts blow away 100TB of data per week based on various retention policies.   If I ever have to start this one from scratch again, I’ll be rather upset…  I’ve been baby-sitting this since it ran on an HP-UX ServiceGuard cluster and was 8x 2TB volumes, through migrations to RHEL 6.x VM, Solaris 10 Physical, a site relocation (without downtime - thanks VEEAM Replication!!!!), and finally migrating to RHEL 7 Veritas cluster (Physical)…  It just keeps growing and won’t die!

Runner-Up:  Level 4 support/developer NFS server: 625TB Allocated, 552TB used - SINGLE logical volume spanning dozens of PVs!

Current beasts of my environment (this is just one of five major sites for our team):

  • 61 TB (80 TB Allocated) CentOS 6.10 NFS Server
  • 5x 9-14 TB MS SQL Server clusters (2x servers each with independent SAN storage - these are physical workloads)
  • 24 TB (40 TB Allocated) CentOS 6.10 NFS Server
  • 26 TB (32 TB Allocated) CentOS 6.10 NFS Server
  • 74 TB (120 TB Allocted) Alma Linux 9 NFS Server (just rebuilt the OS disk and moved data disks from former CentOS 6.10)
  • Single job with over two dozen build servers each with 6-20 TB of workspaces (81 TB used of 200 TB Allocated)
  • 30+ MySQL and related Linux database servers totaling 31 TB
  • Single job with 5 Windows file servers; three of those are 20-24TB (48TB Allocated each
  • And the “beasts” are only 60% of this site’s protected capacity…   Not to mention double-digits PB of object storage!

I have found that with these large file servers, I have to start the jobs out with just one or two of their mount points, then un-”exclude” another mount point for each backup until I get the whole system backing up and get CBT nice and happy.

Userlevel 2

Ok, let’s turn this around ….. smallest VM?

64 Bytes?  x17,000 copies…  1.3 GB Thin Provisioned (1 GB of that is the ISO used by all 17,000 VMs):

 

Stress-testing some network services (DHCP, ARP Tables, etc) - using PowerCLI to clone a single VM 1000 times per run of the script (this is where I found that I could actually hit the maximum number of VMs per ESXi host AND per Cluster (not to mention VM Folders, virtual distributed switches, etc)).  Started this off with “TinyCore Linux” Live CD VM with just a couple tweaks to the boot ISO to make the VM generate a hostname on bootup based on its MAC address and the date/time it booted up…  This 2MB Thin-Provisioned VM with 1vCPU, 128MB RAM, 2MB Video RAM was converted to a template (leave the datastore ISO disk as part of the VM) and it was used to deploy these VMs.  Top couple rows of report:

Name

Status

Start time

End time

Size

Read

Transferred

Duration

Details

cxo-nimnetadm-vbr-A618

Success

12:14:02 AM

12:15:29 AM

1.3 GB

0 B

64 B

0:01:27

 

cxo-nimnetadm-vbr-A617

Success

12:14:02 AM

12:15:33 AM

1.3 GB

0 B

64 B

0:01:31

 

cxo-nimnetadm-vbr-A616

Success

12:14:02 AM

12:15:18 AM

1.3 GB

0 B

64 B

0:01:16

 

cxo-nimnetadm-vbr-A615

Success

12:14:02 AM

12:15:35 AM

1.3 GB

0 B

64 B

0:01:33

 

 

Userlevel 7
Badge +9

My largest is about a 120TB file server.  It’s been rough getting it through error checking backup defrags.  Unfortunately, we don’t have enough space on the repo to setup incremental full’s due to size.  It’s a process…..and we’re still trying to find a better way for it.

Out of curiosity, how long do the backups and health checks, etc take?

Good question! @JMeixner raised a concern on restoration. Have you had to deal with this @dloseke?

Userlevel 7
Badge +8

These impressive numbers gave me a headache but it is very interesting.

I'm curious how you saved the file server r&d? Agent? File Share jobs?
On what type of repo and what retention? Object store my read is good?

How long on the activefull that initiates the backup chain?

Userlevel 7
Badge +8

I was made aware of a 98 TB VM backed up by a customer in South Africa. And some Windows Servers in the ½ PB range as well.

½ PB is pretty good.  I know that VMware, Windows etc all support these monsters, but the manageability, portability, and time for backups is crazy.      I guess it’s like the previous generation of techs never allowing volumes over 2TB.  lol. That doesn’t scale today, but either I’m getting older and out of touch, or that is a ton of data in one spot. 🤣

Userlevel 7
Badge +8

What kind of workload is it for you, that “needs” those monster VMs? 

For our customers with VMs of ~30TB max. it’s mostly fileservers that have gone nuts during decades of lacking governance… 😉

There we usually have the discussion about NAS backup being an alternative with much better parallelity throughout the whole job run. Especially when V12 brings us NAS2tape.

Monster VMs tend to be the onces rolling in via a single thread at the end of a VM backup job.

Challenge with NAS backup for Windows/Linux servers is, that it gets waaaaaay more expensive. 1VM=1VUL vs. 30TB=60VUL…

Userlevel 7
Badge +8

I’ll add, I’m not anti VUL or NAS backup 😀

I’ll probably get some VUL licenses for NAS backup going forward, I’m just going to be picky about where I use them.  NAS backup is awesome and I plan to use it, but more for backing up a NAS or share rather than a Windows Server VM.

 

Each backup should be treated it’s own way when it comes to requirements and pricing. If you have a NAS, it’s the perfect product.

Userlevel 7
Badge +6

Yep...always going to be Linux.  I think my Pihole server is pretty darned small….but still a bit oversized I’m pretty sure at 16GB.  But I have seen VM’s down in the 2GB or less range.

Comment