Veeam Job Design Patterns: Splitting, Grouping, Scheduling, and Consistency

Forum|Forum|1 month ago
April 6, 2026
9 comments
139 views

eblack
Influencer

1. Why Job Design Matters

Anybody can create a backup job. Open the wizard, add VMs, pick a repository, set a schedule, click finish. The job runs. The question is whether it runs well at scale, whether it makes SureBackup testing practical, whether it handles application consistency correctly, and whether it does not create operational headaches six months from now when the environment has grown.

Job design is the difference between a backup environment that runs itself and one that requires constant intervention. Bad job design produces jobs that run too long, compete for proxy and repository resources, break application log chains, make restore operations slow, and make SureBackup configurations a nightmare. This article covers the patterns that work and the anti-patterns that do not.

2. Per-VM Backup Chains vs Per-Job Chains

VBR offers two backup file layout options: per-VM backup chains (one backup file chain per VM) and per-job backup chains (one backup file chain for the entire job). This setting is configured on the repository, not on the job.

Per-VM chains create separate .vbk and .vib files for each VM in the job. Each VM has its own independent chain. You can restore, compact, or health-check one VM without touching the others. Per-VM chains also improve write performance because VBR can write to multiple files in parallel. The tradeoff is more files on the repository file system and slightly higher metadata overhead.

Per-job chains create one .vbk and one .vib for the entire job. All VM data is interleaved in the same file. Synthetic operations, health checks, and backup copy operations cannot start until every VM in the job finishes processing. If one VM takes an hour longer than the rest, everything waits. Per-job chains also mean that corruption in the backup file potentially affects all VMs in the job, not just one.

Per-VM chains are the correct default for almost every deployment. The best practice guide recommends them for any job with more than a handful of VMs. Enable per-VM backup files on the repository before creating jobs that target it.

3. Job Sizing: How Many VMs Per Job

The best practice guide recommends 50 to 200 VMs per job as a good working range. Field experience from Veeam Vanguards and VMCA-certified architects suggests up to 300 VMs per job works well when using per-VM backup chains. Beyond 300, management complexity increases and the risk of a single slow VM delaying post-job synthetic operations becomes significant.

The floor is more important than the ceiling. Do not create one job per VM. One job per VM means hundreds of jobs in a moderately sized environment. Each job creates its own session entry, its own schedule, its own synthetic full window, and its own merge process. The VBR database load from hundreds of individual jobs running concurrently degrades performance. The sweet spot for concurrent running jobs is 80 to 100. If you have 500 VMs in 500 individual jobs, all scheduled at the same time, you are past that limit on day one.

4. Grouping Patterns

How you group VMs into jobs depends on your operational priorities. There are four common patterns, and most environments use a combination.

Group by RPO/SLA Tier

Tier 1 VMs (databases, domain controllers, critical apps) go in a job that runs every 4 hours. Tier 2 VMs (application servers, internal tools) go in a job that runs every 12 hours. Tier 3 VMs (dev, test, utility) go in a job that runs daily. Each tier has a different schedule, different retention, and potentially a different repository (faster storage for Tier 1, cheaper storage for Tier 3). This is the most common pattern in enterprise environments.

Group by Application Stack

Put all VMs in an application stack into the same job. The web server, app server, and database server for a specific application all back up together. This pattern ensures that all components of the application have restore points from the same time window. It also makes SureBackup easier because you can test the entire application stack in a single SureBackup job by pointing it at one backup job.

Group by OS Type

Windows VMs in one job, Linux VMs in another. This improves deduplication ratios because VMs with the same OS share more common blocks (OS files, system libraries). The dedup improvement is most noticeable on the first full backup and diminishes on incrementals where application data dominates the change. This pattern is worth considering if deduplication is a significant factor in your storage cost.

Group by Location/Cluster

VMs on the same ESXi cluster or in the same site go in the same job. This optimizes transport mode selection (hot-add proxies work best when VMs and proxies are on the same cluster) and reduces cross-site network traffic for environments with multiple locations.

5. Scheduling: Parallel vs Chaining vs Staggered

Parallel (recommended). Schedule all jobs to start at the same time (or within a few minutes of each other). Let VBR's built-in task scheduler handle resource allocation. VBR queues tasks based on available proxy and repository task slots. If you have 10 jobs with 50 VMs each and your proxies can handle 20 concurrent tasks, VBR runs 20 VMs at a time and queues the rest. This is the fastest approach because VBR fills every available proxy and repository slot continuously.

Chaining (not recommended). Job B starts only when Job A finishes. If Job A runs long or fails, Job B is delayed. The backup window extends. If you have five chained jobs and the first one takes twice as long as expected, the last job starts hours late. The best practice guide explicitly recommends against chaining because it defeats VBR's intelligent load balancing. The one exception is very large VMs (50+ TB) where synthetic operations on one job can saturate the repository I/O and you need to serialize those specific jobs.

Staggered. Job A starts at 8:00 PM, Job B at 8:05 PM, Job C at 8:10 PM. This is a legacy pattern from products that could not manage concurrency. VBR handles concurrency natively. Staggering adds no benefit over parallel scheduling when proxy and repository task limits are configured correctly. The only argument for staggering is if you want visual separation in the console log, which is not worth the tradeoff of a longer backup window.

6. Backup Copy Job Design

Backup copy jobs move restore points from the primary repository to a secondary target (another repository, cloud, tape). The design question is whether to create one backup copy job per backup job or consolidate multiple backup jobs into fewer copy jobs.

One copy job per backup job is the simplest design. Each backup job has a matching copy job. Easy to understand, easy to manage, easy to troubleshoot. The downside is more jobs in the console and more copy sessions running concurrently.

Consolidated copy jobs reduce the number of copy jobs by pointing multiple backup jobs at one copy job. This requires that the backup jobs use per-VM backup chains (the copy job pulls individual VM restore points, not entire job files). This pattern works well when the copy target is a slow link (WAN, cloud) and you want to control how many concurrent copy streams hit the link.

Schedule backup copy jobs to start after the primary backup window. If your backup jobs run from 8 PM to 2 AM, schedule the copy job window to start at 3 AM. This avoids primary and copy jobs competing for the same repository I/O simultaneously.

7. SQL Always-On Availability Groups

All nodes in a SQL Always-On Availability Group must be in the same backup job. This is not optional. VBR coordinates transaction log processing across AG nodes. If the nodes are in different jobs, log chain consistency breaks. The restore will fail or produce an inconsistent database state.

VBR detects which node is the primary and which are secondaries. For secondary nodes, VBR uses a copy-only VSS backup type (VSS_BT_COPY) to avoid interfering with the AG's native log chain. For nodes that are primary for all their AGs, it uses a full VSS backup type (VSS_BT_FULL). The COPY flag applies per node, not per database. If a node is secondary for even one AG, VBR sets VSS_BT_COPY for the entire node. In active/active configurations where both nodes host primary AGs and secondary AGs, both nodes get copy-only backups. This is by design and does not break transaction log processing, but it means native SQL maintenance plans that rely on VSS_BT_FULL will not see Veeam's backup as a full.

Enable application-aware processing for the job. Enable transaction log backup if you need point-in-time recovery. Increase the cluster timeout values (SameSubnetThreshold, CrossSubnetThreshold) to prevent failover during the snapshot creation window. KB1744 covers the specific timeout values to set.

8. Exchange DAG Consistency

Exchange Database Availability Groups follow the same rule as SQL AGs: all DAG nodes must be in the same job. VBR coordinates with the Exchange VSS Writer to freeze only passive database copies, leaving active copies untouched. If the DAG nodes are in different jobs, the VSS coordination fails and you risk freezing an active database copy, which can trigger a failover.

Increase the cluster timeout values before your first backup run. The default timeout values are aggressive enough that the brief VM freeze during snapshot creation can trigger a failover in some environments. The Veeam best practice guide and KB1744 specify the recommended values: SameSubnetThreshold to 20, SameSubnetDelay to 2000, CrossSubnetThreshold to 40, CrossSubnetDelay to 4000.

For Exchange DAGs on virtual machines, use image-level backup with application-aware processing. For physical Exchange servers or VMs with RDM disks, use Veeam Agent with a failover cluster job type.

9. SureBackup Testability

Job design directly affects how easy or hard it is to configure SureBackup. SureBackup tests restore points from a specific backup job. If your job design groups VMs by application stack, you can create a SureBackup job that tests the entire stack: boot the database server, verify it responds, boot the app server, verify it connects to the database, boot the web server, verify it responds on port 443.

If your job design scatters the VMs of a single application across multiple jobs (the database server is in "Tier 1 Backup" and the web server is in "Tier 2 Backup"), testing the full application stack in SureBackup requires configuring SureBackup to pull from multiple backup jobs. This works but is more complex to set up and maintain.

The practical advice: group VMs by application stack in the backup job if you plan to use SureBackup for full-stack validation. If you group by RPO tier instead, accept that your SureBackup tests will be per-tier rather than per-application, or build more complex SureBackup configurations that reference multiple jobs.

10. The Anti-Patterns

One VM per job. Creates hundreds of jobs. Overloads the VBR database. Makes the console unusable. Every synthetic operation runs as a separate session. The scheduler cannot optimize because each job is an independent entity. Use per-VM backup chains on the repository instead of per-VM jobs.

One massive job with every VM. A single job with 1,000 VMs means one session, one synthetic window, and one failure domain. If synthetic full takes 12 hours on the combined chain, no backup copy can start for 12 hours. Split into multiple jobs of 50 to 200 VMs each.

Job chaining for resource control. Use proxy and repository task slot limits instead. VBR's scheduler handles concurrency. Chaining defeats it. If Job A fails, everything downstream stops. Set task limits on the proxy (max concurrent tasks) and repository (max concurrent tasks) and let VBR queue work automatically.

SQL AG nodes in separate jobs. Transaction log chain breaks. Restore produces inconsistent state. All AG nodes must be in the same job. No exceptions.

Mixing 4-hour RPO and 24-hour RPO VMs in the same job. The job schedule is set to 4 hours to meet the Tier 1 RPO. The Tier 3 VMs in the same job get backed up every 4 hours even though they only need daily. This wastes proxy cycles, repository storage, and backup window time on unnecessary restore points. Split by RPO tier.

Ignoring the synthetic full window. Synthetic full operations are I/O intensive on the repository. If three jobs with synthetic full enabled on Saturday all target the same repository, Saturday becomes the day the repository is saturated for hours. Stagger synthetic full days across jobs targeting the same repository. Job A gets synthetic full on Saturday, Job B on Sunday, Job C on Monday.

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
Forum|Forum|1 month ago
April 6, 2026

Great points to consider for jobs definitely. Nice way to think about things for your jobs. Great post.

+22

coolsport00
Veeam Legend
Forum|Forum|1 month ago
April 6, 2026

Good post Eric. Really good things to consider when creating various Jobs. Well done 👍🏻

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

eblack
Author
Influencer
Forum|Forum|1 month ago
April 6, 2026

Good post Eric. Really good things to consider when creating various Jobs. Well done 👍🏻

Thanks!

kciolek
Influencer
Forum|Forum|1 month ago
April 6, 2026

great article @eblack! thanks for sharing!

Ken Ciolek | SHI Labs Data Protection & Storage Lead | Object First Ace | Commvault Global Ambassador

Jason Orchard-ingram micro
VUG Leader
Forum|Forum|1 month ago
April 6, 2026

Awesome Breakdown @eblack.

Tommy O'Shea
Veeam Legend
Forum|Forum|1 month ago
April 7, 2026

This is a great writeup that encapsulates exactly what I see in the field when optimizing Veeam implementations. Great advice all around.

The only thing I can think to add is that for the parallel job processing, jobs that must start as soon as possible after their scheduled start can be configured to use the “High Priority” checkbox. This ensures that during the resource allocation stage, those jobs get brought to the front of the line.

Tommy O’Shea, VMCE, VMCE-SP, VMCA

+13

lukas.k
Influencer
Forum|Forum|1 month ago
April 7, 2026

Very nice writeup and a good coverage on all the important parameters.

My 2 cents on this based on my field experience:

Afaik there is no longer a “hard” VM limit per job. I’ve seen deployments with around 800 VMs in die field that is working fine. It is really important to put enough effort into the component sizings here, they will be considered as limitation.

There are always good reasons for specific job designs or specific settings. In general, I don’t recommend using per-job backup chains anymore (you would skip a lot of flexibility here), same for group-by-OS job. Veeam can technically work with Windows and Linux OS in the same job (for years now) so - if there is no exact reason for it - I would no longer pay attention to that for “generic” VMs.

Again - there are always reasons for exceptions, as we all know from the field. :)

LK | Enterprise Architect @ Veeam Software | Former Veeam Vanguard | Security Specialist

eblack
Author
Influencer
Forum|Forum|1 month ago
April 7, 2026