Most Veeam environments are technically solid. The backup jobs run. The hardened repo is configured. Immutability is on. But when someone asks "can you walk me through exactly what happens if the backup server goes down at 2am on a Saturday," the answer is usually a pause followed by "well, we would figure it out."
That gap is what this article is about. A DR runbook is not a Veeam config export. It is an operational document that a trained engineer who has never touched your environment can pick up and execute under pressure. This article covers how to build one, what it needs to contain, how to test it, and how to produce the kind of audit ready documentation that satisfies a compliance reviewer or a customer asking for proof of recoverability.
1. What a DR Runbook Actually Is
A runbook is a step by step operational procedure for a specific failure scenario. It is not architecture documentation. It is not a Veeam best practices guide. It is a document that answers one question: given this specific failure, what do I do, in what order, and how do I know it worked.
For Veeam environments, you need at minimum one runbook per critical failure scenario. The scenarios that matter most in production:
| Scenario | Scope | RTO Target |
| VBR server total loss | Rebuild or restore the backup server itself | 4 to 8 hours |
| Backup repository corruption or loss | Recover from offsite copy or immutable backup | 2 to 4 hours to restore operations |
| Ransomware event | Isolate, assess, restore from clean immutable restore point | Scenario dependent, document the decision tree |
| Primary site loss (DR failover) | Failover replicas or restore to DR site | Per SLA, typically 1 to 4 hours |
| Single VM or workload recovery | Restore specific VM or files from backup | 15 to 60 minutes |
| Cloud Connect tenant data recovery | MSP specific: restore tenant workloads from cloud repository | Per tenant SLA |
|
| A runbook you have never tested is a hypothesis, not a procedure. Every runbook in this article needs a test date and a test result before it goes into production use. |
2. VBR Server Recovery Runbook
The VBR server is the most critical single point of failure in most Veeam environments. Losing it does not lose your backup data, but it does lose your ability to restore until it is rebuilt. This runbook covers a full rebuild from the Veeam configuration backup.
Prerequisites Before You Need This Runbook
- Veeam configuration backup is scheduled and running to a location outside the VBR server (network share, object storage, or separate repo)
- VBR installer media is accessible offline (ISO or downloaded installer stored separately)
- License file or license portal credentials are documented and stored in your password manager
- PostgreSQL credentials for the VBR configuration database are documented (v13 migrated to PostgreSQL; the SA password set during install is required for restore)
- Service account credentials for VBR (the account VBR services run under) are documented
Recovery Steps
- Provision replacement server. Match or exceed original hardware/VM specs. Install Windows Server (same version as original). Join to domain if applicable. Apply current patches.
- Install Veeam VBR v13. Run the installer. Select the same installation path as the original. When prompted for PostgreSQL, use the same PostgreSQL SA password as the original installation. Do not configure any infrastructure during setup.
- Stop Veeam services before importing config. In Services, stop all Veeam services before running the configuration restore. Running a config restore against a live VBR instance causes conflicts.
- Run configuration restore. Open VBR console. Go to Home tab, click the VBR menu (top left), select Configuration Backup, then Restore. Point to the most recent configuration backup file. Enter the encryption password if the config backup was encrypted (it should be).
- Verify infrastructure reconnection. After restore completes, open Backup Infrastructure. Verify all managed servers show as connected. Re-enter credentials for any server that shows as disconnected. This is common for VSA connected servers where the Analytics Service needs to re-register.
- Verify repository access. Open Backup Repositories. Confirm all repositories are accessible and backup chains are visible under each repo. If using a hardened Linux repo, re-enter the single use credentials to re-establish the connection.
- Run a test restore. Select a non critical VM. Run an Instant Recovery to verify the full restore path is working before declaring recovery complete.
- Re-register with Veeam ONE. If Veeam ONE is in use, re-add the rebuilt VBR server in Veeam ONE configuration. The Analytics Service will reinstall automatically.
| ! | Configuration backup encryption password is the single most common recovery blocker. If this password is lost, the config backup cannot be restored. Store it in your password manager and in a sealed physical document in a secure location. Not in the same system the config backup protects. |
3. Ransomware Response Runbook
Ransomware runbooks are different from other DR runbooks because the first phase is not restoration. It is isolation and assessment. Restoring before you know what was hit and whether the infection vector is closed is how you restore infected data and extend the incident.
Phase 1: Isolate
- Isolate the VBR server from the network immediately if there is any indication it was reached. Veeam ONE malware detection alarms are the first signal in most environments.
- Do not shut down affected systems. Memory forensics may be needed. Isolate at the network switch or firewall level first.
- Verify the hardened repository is intact. SSH to the hardened repo server. Confirm the Veeam service user account has not been modified and immutability flags are set on backup files.
- Pull the active alarm list from Veeam ONE. Document every alarm that fired in the 72 hours before detection. This establishes the timeline.
Phase 2: Assess
- Identify the last clean restore point for each affected workload using the VBR console. Look for restore points that predate the earliest indicators of compromise.
- Use Veeam ONE Alarm History to identify the first backup job that may have backed up encrypted data. Back up from that point.
- Check SureBackup results history. The most recent successful SureBackup run gives you a verified clean restore point baseline.
- Engage your incident response process. DR runbook execution is parallel to, not a replacement for, security incident response.
Phase 3: Restore
- Restore to an isolated network segment first. Do not restore directly to production until the infection vector is confirmed closed.
- Use Secure Restore for all workloads if antivirus scanning is configured. This scans the restore point before mounting it.
- Restore in priority order per your workload priority matrix (covered in Section 6 of this article).
- Document every restore action with timestamp, operator, restore point date, and target. This is your incident record.
4. Replica Failover Runbook
This runbook applies to environments using Veeam replication to a DR site. It covers planned failover (maintenance or migration) and unplanned failover (primary site loss).
Planned Failover
- In VBR console, go to Home, then Replicas, then Ready. Identify the VMs to fail over.
- In VBR, right click the replica and select Planned Failover. This synchronizes one final delta before switching, minimizing data loss.
- Confirm the replica powers on at the DR site. Verify network connectivity and application health before proceeding.
- Update DNS or load balancer entries to point to the DR site IPs.
- Notify stakeholders that failover is complete and applications are running from DR.
Unplanned Failover
- In VBR console, go to Home, then Replicas, then Ready. Identify affected VMs.
- In VBR, right click the replica and select Failover Now. Select the most recent restore point or a specific point in time if the most recent point may be suspect.
- Verify replica is running. Test application connectivity before updating DNS.
- Document the restore point used and the timestamp of failover.
- Begin failback planning immediately. Unplanned failover means your DR site is now your primary. This is a temporary state.
|
| Failover without a documented failback plan is an incomplete runbook. Every failover procedure needs a corresponding failback procedure or you will be making it up under pressure at the worst possible time. |
5. Testing Your Runbooks
A runbook that has never been executed is not a runbook. It is a draft. Every runbook needs a test protocol and a documented test history.
Test Types and Cadence
| Test Type | What It Validates | Recommended Cadence |
| Tabletop exercise | Team knows the runbook, roles are clear, decision points are understood | Quarterly |
| SureBackup automated verification | Backup data is restorable, application starts correctly | Weekly per job |
| Instant Recovery test | Full VM restore path works end to end | Monthly, rotating workloads |
| Full runbook execution in isolated environment | Entire procedure works as documented | Annually minimum, twice a year for critical workloads |
| Replica failover test | Replicas are current and failover completes within RTO | Twice a year |
Documenting Test Results
Every test needs a record. Minimum fields for each test record:
- Date and time of test
- Operator who performed the test
- Runbook version tested
- Workloads included in the test
- RTO achieved vs RTO target
- Steps that failed or deviated from the runbook
- Runbook updates made as a result
- Sign off by team lead or manager
This test record is what you hand to an auditor. It is also what tells you whether your RTO targets are realistic before you discover they are not during an actual incident.
6. Workload Priority Matrix
Not all workloads recover in the same order. A priority matrix documents which systems come back first, who owns the decision, and what the dependency chain looks like. Without this, recovery devolves into whoever is loudest on the phone gets their system first.
| Tier | Description | Examples | RTO Target |
| Tier 1 | Infrastructure dependencies. Nothing else recovers without these. | Domain controllers, DNS, core networking, authentication | Under 1 hour |
| Tier 2 | Business critical applications. Revenue or operations stop without these. | ERP, core databases, primary file servers, email | 1 to 4 hours |
| Tier 3 | Important but not immediately blocking. | Secondary applications, reporting systems, dev environments used in production | 4 to 8 hours |
| Tier 4 | Can wait. Recovery can be deferred until Tier 1 to 3 are stable. | Dev, test, sandbox, non production workloads | Next business day |
The priority matrix should be reviewed and signed off by business stakeholders, not just IT. The business owns the priority decision. IT implements it.
7. Compliance and Audit Documentation
For MSPs, enterprise environments under compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS), and any organization that has made contractual SLA commitments, runbooks are not optional. They are evidence. Here is what auditors and compliance frameworks actually ask for:
| Document | What It Proves | Framework Relevance |
| DR runbook with version history | Documented recovery procedures exist and are maintained | SOC 2 CC9 and A1, ISO 27001 A.17, HIPAA 164.308(a)(7) |
| Test records with RTO results | Recovery procedures have been validated | SOC 2 CC9.1, PCI DSS 12.10.2 |
| Workload priority matrix with stakeholder sign off | Recovery prioritization is defined and approved | ISO 27001 A.17.1.2 |
| Backup job success logs (30 to 90 days) | Backups are running and completing successfully | SOC 2 A1, HIPAA 164.310(d)(2) |
| SureBackup verification reports | Backup data is verified restorable, not just present | SOC 2 A1.2, PCI DSS 12.10 |
| Encryption key management documentation | Backup encryption keys are stored and accessible | SOC 2 CC6, HIPAA 164.312(a)(2)(iv) |
Veeam ONE scheduled reports cover most of the operational evidence automatically. The Protected VMs report, Failed Job History, and SureBackup results can all be scheduled to email to a compliance inbox on a regular cadence, building an evidence trail without manual effort.
8. Runbook Template Structure
Every runbook in your library should follow a consistent structure so any engineer can pick it up without having to understand a new format under pressure.
| STANDARD RUNBOOK TEMPLATE STRUCTURE |
| RUNBOOK: [Scenario Name] |
| Version: [x.x] | Last Tested: [Date] | Owner: [Team/Name] |
| Last Updated: [Date] | Classification: [Internal/Confidential] |
|
|
| SCENARIO |
| Brief description of the failure condition this runbook addresses. |
|
|
| SCOPE |
| Which systems, workloads, and sites are covered. |
|
|
| PREREQUISITES |
| What must be in place before executing this runbook. |
| Include: access requirements, tool locations, credential sources. |
|
|
| RTO TARGET |
| The recovery time objective this runbook is designed to meet. |
|
|
| DECISION CRITERIA |
| Under what conditions should this runbook be invoked? |
| Who has authority to invoke it? |
|
|
| STEPS |
| 1. [Action] -- [Expected outcome] -- [Verification] |
| 2. ... |
|
|
| ESCALATION |
| If step X fails or RTO is exceeded, contact: [Name, Role, Contact] |
|
|
| ROLLBACK |
| If recovery cannot proceed, what is the fallback state? |
|
|
| TEST HISTORY |
| Date | Operator | Result | RTO Achieved | Notes |
| [Date] | [Name] | Pass/Fail | [Time] | [Notes] |
| KEY TAKEAWAYS ✓ A runbook is an operational procedure for a specific failure scenario, not architecture documentation. It must be executable by a trained engineer under pressure. ✓ Cover at minimum: VBR server loss, ransomware response, replica failover, and single workload recovery. ✓ The ransomware runbook starts with isolation and assessment, not restoration. Restoring before the vector is closed extends the incident. ✓ Every runbook needs a test record. RTO targets that have never been validated are guesses. ✓ A workload priority matrix owned by the business, not IT, prevents recovery chaos when multiple systems are down simultaneously. ✓ Veeam ONE scheduled reports (Protected VMs, Failed Job History, SureBackup results) build your compliance evidence trail automatically. ✓ The config backup encryption password is the single most common recovery blocker. Store it outside the system it protects. |
Published on anystackarchitect.com | Author: Eric Black | Veeam v13 Series
