Skip to main content

Veeam v13: Disaster Recovery Runbooks and Documentation

  • March 13, 2026
  • 6 comments
  • 42 views

eblack
Forum|alt.badge.img

 

Most Veeam environments are technically solid. The backup jobs run. The hardened repo is configured. Immutability is on. But when someone asks "can you walk me through exactly what happens if the backup server goes down at 2am on a Saturday," the answer is usually a pause followed by "well, we would figure it out."

That gap is what this article is about. A DR runbook is not a Veeam config export. It is an operational document that a trained engineer who has never touched your environment can pick up and execute under pressure. This article covers how to build one, what it needs to contain, how to test it, and how to produce the kind of audit ready documentation that satisfies a compliance reviewer or a customer asking for proof of recoverability.

 

1. What a DR Runbook Actually Is

A runbook is a step by step operational procedure for a specific failure scenario. It is not architecture documentation. It is not a Veeam best practices guide. It is a document that answers one question: given this specific failure, what do I do, in what order, and how do I know it worked.

For Veeam environments, you need at minimum one runbook per critical failure scenario. The scenarios that matter most in production:

 

Scenario

Scope

RTO Target

VBR server total loss

Rebuild or restore the backup server itself

4 to 8 hours

Backup repository corruption or loss

Recover from offsite copy or immutable backup

2 to 4 hours to restore operations

Ransomware event

Isolate, assess, restore from clean immutable restore point

Scenario dependent, document the decision tree

Primary site loss (DR failover)

Failover replicas or restore to DR site

Per SLA, typically 1 to 4 hours

Single VM or workload recovery

Restore specific VM or files from backup

15 to 60 minutes

Cloud Connect tenant data recovery

MSP specific: restore tenant workloads from cloud repository

Per tenant SLA

 

 

A runbook you have never tested is a hypothesis, not a procedure. Every runbook in this article needs a test date and a test result before it goes into production use.

 

2. VBR Server Recovery Runbook

The VBR server is the most critical single point of failure in most Veeam environments. Losing it does not lose your backup data, but it does lose your ability to restore until it is rebuilt. This runbook covers a full rebuild from the Veeam configuration backup.

Prerequisites Before You Need This Runbook

  • Veeam configuration backup is scheduled and running to a location outside the VBR server (network share, object storage, or separate repo)
  • VBR installer media is accessible offline (ISO or downloaded installer stored separately)
  • License file or license portal credentials are documented and stored in your password manager
  • PostgreSQL credentials for the VBR configuration database are documented (v13 migrated to PostgreSQL; the SA password set during install is required for restore)
  • Service account credentials for VBR (the account VBR services run under) are documented

 

Recovery Steps

  1. Provision replacement server. Match or exceed original hardware/VM specs. Install Windows Server (same version as original). Join to domain if applicable. Apply current patches.
  2. Install Veeam VBR v13. Run the installer. Select the same installation path as the original. When prompted for PostgreSQL, use the same PostgreSQL SA password as the original installation. Do not configure any infrastructure during setup.
  3. Stop Veeam services before importing config. In Services, stop all Veeam services before running the configuration restore. Running a config restore against a live VBR instance causes conflicts.
  4. Run configuration restore. Open VBR console. Go to Home tab, click the VBR menu (top left), select Configuration Backup, then Restore. Point to the most recent configuration backup file. Enter the encryption password if the config backup was encrypted (it should be).
  5. Verify infrastructure reconnection. After restore completes, open Backup Infrastructure. Verify all managed servers show as connected. Re-enter credentials for any server that shows as disconnected. This is common for VSA connected servers where the Analytics Service needs to re-register.
  6. Verify repository access. Open Backup Repositories. Confirm all repositories are accessible and backup chains are visible under each repo. If using a hardened Linux repo, re-enter the single use credentials to re-establish the connection.
  7. Run a test restore. Select a non critical VM. Run an Instant Recovery to verify the full restore path is working before declaring recovery complete.
  8. Re-register with Veeam ONE. If Veeam ONE is in use, re-add the rebuilt VBR server in Veeam ONE configuration. The Analytics Service will reinstall automatically.

 

!

Configuration backup encryption password is the single most common recovery blocker. If this password is lost, the config backup cannot be restored. Store it in your password manager and in a sealed physical document in a secure location. Not in the same system the config backup protects.

 

3. Ransomware Response Runbook

Ransomware runbooks are different from other DR runbooks because the first phase is not restoration. It is isolation and assessment. Restoring before you know what was hit and whether the infection vector is closed is how you restore infected data and extend the incident.

Phase 1: Isolate

  1. Isolate the VBR server from the network immediately if there is any indication it was reached. Veeam ONE malware detection alarms are the first signal in most environments.
  2. Do not shut down affected systems. Memory forensics may be needed. Isolate at the network switch or firewall level first.
  3. Verify the hardened repository is intact. SSH to the hardened repo server. Confirm the Veeam service user account has not been modified and immutability flags are set on backup files.
  4. Pull the active alarm list from Veeam ONE. Document every alarm that fired in the 72 hours before detection. This establishes the timeline.

 

Phase 2: Assess

  1. Identify the last clean restore point for each affected workload using the VBR console. Look for restore points that predate the earliest indicators of compromise.
  2. Use Veeam ONE Alarm History to identify the first backup job that may have backed up encrypted data. Back up from that point.
  3. Check SureBackup results history. The most recent successful SureBackup run gives you a verified clean restore point baseline.
  4. Engage your incident response process. DR runbook execution is parallel to, not a replacement for, security incident response.

 

Phase 3: Restore

  1. Restore to an isolated network segment first. Do not restore directly to production until the infection vector is confirmed closed.
  2. Use Secure Restore for all workloads if antivirus scanning is configured. This scans the restore point before mounting it.
  3. Restore in priority order per your workload priority matrix (covered in Section 6 of this article).
  4. Document every restore action with timestamp, operator, restore point date, and target. This is your incident record.

 

4. Replica Failover Runbook

This runbook applies to environments using Veeam replication to a DR site. It covers planned failover (maintenance or migration) and unplanned failover (primary site loss).

Planned Failover

  1. In VBR console, go to Home, then Replicas, then Ready. Identify the VMs to fail over.
  2. In VBR, right click the replica and select Planned Failover. This synchronizes one final delta before switching, minimizing data loss.
  3. Confirm the replica powers on at the DR site. Verify network connectivity and application health before proceeding.
  4. Update DNS or load balancer entries to point to the DR site IPs.
  5. Notify stakeholders that failover is complete and applications are running from DR.

 

Unplanned Failover

  1. In VBR console, go to Home, then Replicas, then Ready. Identify affected VMs.
  2. In VBR, right click the replica and select Failover Now. Select the most recent restore point or a specific point in time if the most recent point may be suspect.
  3. Verify replica is running. Test application connectivity before updating DNS.
  4. Document the restore point used and the timestamp of failover.
  5. Begin failback planning immediately. Unplanned failover means your DR site is now your primary. This is a temporary state.

 

 

Failover without a documented failback plan is an incomplete runbook. Every failover procedure needs a corresponding failback procedure or you will be making it up under pressure at the worst possible time.

 

5. Testing Your Runbooks

A runbook that has never been executed is not a runbook. It is a draft. Every runbook needs a test protocol and a documented test history.

Test Types and Cadence

 

Test Type

What It Validates

Recommended Cadence

Tabletop exercise

Team knows the runbook, roles are clear, decision points are understood

Quarterly

SureBackup automated verification

Backup data is restorable, application starts correctly

Weekly per job

Instant Recovery test

Full VM restore path works end to end

Monthly, rotating workloads

Full runbook execution in isolated environment

Entire procedure works as documented

Annually minimum, twice a year for critical workloads

Replica failover test

Replicas are current and failover completes within RTO

Twice a year

 

Documenting Test Results

Every test needs a record. Minimum fields for each test record:

  • Date and time of test
  • Operator who performed the test
  • Runbook version tested
  • Workloads included in the test
  • RTO achieved vs RTO target
  • Steps that failed or deviated from the runbook
  • Runbook updates made as a result
  • Sign off by team lead or manager

 

This test record is what you hand to an auditor. It is also what tells you whether your RTO targets are realistic before you discover they are not during an actual incident.

 

6. Workload Priority Matrix

Not all workloads recover in the same order. A priority matrix documents which systems come back first, who owns the decision, and what the dependency chain looks like. Without this, recovery devolves into whoever is loudest on the phone gets their system first.

 

Tier

Description

Examples

RTO Target

Tier 1

Infrastructure dependencies. Nothing else recovers without these.

Domain controllers, DNS, core networking, authentication

Under 1 hour

Tier 2

Business critical applications. Revenue or operations stop without these.

ERP, core databases, primary file servers, email

1 to 4 hours

Tier 3

Important but not immediately blocking.

Secondary applications, reporting systems, dev environments used in production

4 to 8 hours

Tier 4

Can wait. Recovery can be deferred until Tier 1 to 3 are stable.

Dev, test, sandbox, non production workloads

Next business day

 

The priority matrix should be reviewed and signed off by business stakeholders, not just IT. The business owns the priority decision. IT implements it.

 

7. Compliance and Audit Documentation

For MSPs, enterprise environments under compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS), and any organization that has made contractual SLA commitments, runbooks are not optional. They are evidence. Here is what auditors and compliance frameworks actually ask for:

 

Document

What It Proves

Framework Relevance

DR runbook with version history

Documented recovery procedures exist and are maintained

SOC 2 CC9 and A1, ISO 27001 A.17, HIPAA 164.308(a)(7)

Test records with RTO results

Recovery procedures have been validated

SOC 2 CC9.1, PCI DSS 12.10.2

Workload priority matrix with stakeholder sign off

Recovery prioritization is defined and approved

ISO 27001 A.17.1.2

Backup job success logs (30 to 90 days)

Backups are running and completing successfully

SOC 2 A1, HIPAA 164.310(d)(2)

SureBackup verification reports

Backup data is verified restorable, not just present

SOC 2 A1.2, PCI DSS 12.10

Encryption key management documentation

Backup encryption keys are stored and accessible

SOC 2 CC6, HIPAA 164.312(a)(2)(iv)

 

Veeam ONE scheduled reports cover most of the operational evidence automatically. The Protected VMs report, Failed Job History, and SureBackup results can all be scheduled to email to a compliance inbox on a regular cadence, building an evidence trail without manual effort.

 

8. Runbook Template Structure

Every runbook in your library should follow a consistent structure so any engineer can pick it up without having to understand a new format under pressure.

 

STANDARD RUNBOOK TEMPLATE STRUCTURE

RUNBOOK: [Scenario Name]

Version: [x.x] | Last Tested: [Date] | Owner: [Team/Name]

Last Updated: [Date] | Classification: [Internal/Confidential]

 

SCENARIO

Brief description of the failure condition this runbook addresses.

 

SCOPE

Which systems, workloads, and sites are covered.

 

PREREQUISITES

What must be in place before executing this runbook.

Include: access requirements, tool locations, credential sources.

 

RTO TARGET

The recovery time objective this runbook is designed to meet.

 

DECISION CRITERIA

Under what conditions should this runbook be invoked?

Who has authority to invoke it?

 

STEPS

1. [Action] -- [Expected outcome] -- [Verification]

2. ...

 

ESCALATION

If step X fails or RTO is exceeded, contact: [Name, Role, Contact]

 

ROLLBACK

If recovery cannot proceed, what is the fallback state?

 

TEST HISTORY

Date | Operator | Result | RTO Achieved | Notes

[Date] | [Name] | Pass/Fail | [Time] | [Notes]

 

KEY TAKEAWAYS

✓  A runbook is an operational procedure for a specific failure scenario, not architecture documentation. It must be executable by a trained engineer under pressure.

✓  Cover at minimum: VBR server loss, ransomware response, replica failover, and single workload recovery.

✓  The ransomware runbook starts with isolation and assessment, not restoration. Restoring before the vector is closed extends the incident.

✓  Every runbook needs a test record. RTO targets that have never been validated are guesses.

✓  A workload priority matrix owned by the business, not IT, prevents recovery chaos when multiple systems are down simultaneously.

✓  Veeam ONE scheduled reports (Protected VMs, Failed Job History, SureBackup results) build your compliance evidence trail automatically.

✓  The config backup encryption password is the single most common recovery blocker. Store it outside the system it protects.

 

Published on anystackarchitect.com  |  Author: Eric Black  |  Veeam v13 Series

6 comments

coolsport00
Forum|alt.badge.img+21
  • Veeam Legend
  • March 13, 2026

Nice article ​@eblack . Kinda like a DR plan (ish) 😊

Appreciate the share!


kciolek
Forum|alt.badge.img+1
  • Influencer
  • March 13, 2026

nice article! thanks for the DR plan!


  • New Here
  • March 13, 2026

I appreciate the formula approach to the DR plan. Easily repeatable, it's far too easy to get lost into the weeds when you have so many different platforms as part of the DR matrix. Kudos to the table in Section 7. Compliance and Audit Documentation, simple and easy to ingest. Well done, Eric.


eblack
Forum|alt.badge.img
  • Author
  • Experienced User
  • March 13, 2026

I appreciate the formula approach to the DR plan. Easily repeatable, it's far too easy to get lost into the weeds when you have so many different platforms as part of the DR matrix. Kudos to the table in Section 7. Compliance and Audit Documentation, simple and easy to ingest. Well done, Eric.

 

Excellent points, thanks!


NicBackup
Forum|alt.badge.img+4
  • Veeam Vanguard
  • March 13, 2026

Very nice, thanks for sharing!


Chris.Childerhose
Forum|alt.badge.img+21

Thanks for sharing this.  Great reference and something I am looking to put together for my company.