Veeam Failure Forensics: Reconstructing What Went Wrong from Session Logs

Forum|Forum|3 days ago
April 2, 2026
4 comments
36 views

eblack
Influencer

1. This Is Not Troubleshooting

Troubleshooting is what you do when a job fails and you want to fix it. Forensics is what you do when a job has been failing for three weeks and nobody noticed, a VM dropped out of protection and nobody caught it, or a ransomware event hit and you need to prove your last clean restore point. The question is not "how do I fix this job." The question is "what exactly happened, when did it start, what was affected, and can I prove it."

Forensics requires a different mindset than break-fix. You are not looking for the current error. You are reconstructing a timeline. You need session history, task-level detail, bottleneck data, and alarm records across a window of days or weeks. VBR stores all of this. The trick is knowing where to look and how to correlate it.

2. Where the Logs Live

VBR keeps two categories of evidence: the configuration database (PostgreSQL in v13, formerly SQL Server Express) and the file system logs.

Database (Session and Task Records)

Every job run creates a session record. Every VM processed within that session creates a task record. These records include start time, end time, result (Success, Warning, Failed), data transferred, processing rate, bottleneck statistics, and the reason text for warnings and failures. You access this data through PowerShell (Get-VBRBackupSession, Get-VBRTaskSession) or the REST API (/api/v1/sessions). This is your primary forensic data source.

File System Logs

The on-disk logs live at %ProgramData%\Veeam\Backup on the VBR server. On the Linux VSA, the path is /var/log/VeeamBackup. Inside that directory, VBR creates subfolders organized by job name. Each subfolder contains log files for every session of that job. These are the detailed logs that Veeam Support asks for when you open a case. They contain the per-second operational detail that the database session records summarize.

Log Location	Content	Forensic Use
%ProgramData%\Veeam\Backup\<JobName>	Per-session job logs. Agent and task logs within each session folder.	Detailed timeline of each job run. Exact error messages. Snapshot creation/removal timing. Data mover operations.
%ProgramData%\Veeam\Backup\Svc.VeeamBackup.log	Main Veeam Backup Service log	Service restarts, scheduler activity, job dispatch records. When jobs were queued vs when they actually started.
%ProgramData%\Veeam\Backup\Audit	FLR audit logs (CSV format)	Who restored what, when. Critical for incident response documentation.
Proxy and repository server logs	Data mover logs on each managed component	Network transfer issues, storage write failures, timeout details that are not visible in the VBR server logs alone.

LOG RETENTION

VBR rotates log files automatically. The default retention is controlled by the log level and file size settings. For forensic purposes, if you suspect you need logs from more than a few weeks ago, check whether the old log files have been rotated out. If they have, you are left with the database session records as your only timeline source. This is why the automated evidence pipeline (covered in the audit article in this series) matters. Export session data to an archive on a schedule.

3. The Session Hierarchy: Job, Session, Task

Understanding this hierarchy is the foundation of forensic analysis.

Job: A backup job is the configuration object. It defines which VMs are protected, which repository stores the data, the schedule, retention, and processing options. A job is a template. It does not contain results.

Session: Every time a job runs, VBR creates a session. The session has a start time, end time, and aggregate result. A session can be Success (all VMs processed without errors), Warning (some VMs had non-fatal issues), or Failed (one or more VMs could not be processed). The session result is the rollup. It does not tell you which specific VMs had problems.

Task: Within each session, VBR creates one task per VM (or per workload, for agent jobs). The task has its own start time, end time, result, data read, data transferred, processing rate, bottleneck, and reason text. The task is where the forensic detail lives. A session that shows "Warning" might have 49 successful tasks and 1 task with a CBT reset. You need the task level to find the specific VM.

4. Building a Timeline from PowerShell

The first forensic step is always the same: build a timeline of what happened. PowerShell gives you the fastest path to session and task data across a date range.

Pull All Failed and Warning Sessions for the Last 30 Days

# Connect to VBR

Connect-VBRServer -Server localhost

# Get all backup sessions from the last 30 days that were not Success

$cutoff = (Get-Date).AddDays(-30)

$sessions = Get-VBRBackupSession | Where-Object {

$_.EndTime -ge $cutoff -and $_.Result -ne "Success"

} | Sort-Object CreationTime

# Export the timeline

$sessions | Select-Object JobName, CreationTime, EndTime, Result,

@{N="Duration";E={$_.EndTime - $_.CreationTime}},

@{N="Warnings";E={($_ | Get-VBRTaskSession | Where-Object {$_.Status -eq "Warning"}).Count}},

@{N="Failures";E={($_ | Get-VBRTaskSession | Where-Object {$_.Status -eq "Failed"}).Count}} |

Export-Csv -Path "C:\temp\forensic-timeline.csv" -NoTypeInformation

This gives you a CSV with every non-successful session, its timestamp, which job it belonged to, how long it ran, and how many VMs had warnings or failures. Sort by CreationTime and you have your timeline.

Drill Into a Specific Job's Task History

# Get the last 14 days of sessions for a specific job

$jobName = "Backup Job - Production SQL"

$cutoff = (Get-Date).AddDays(-14)

$sessions = Get-VBRBackupSession | Where-Object {

$_.JobName -eq $jobName -and $_.EndTime -ge $cutoff

} | Sort-Object CreationTime

foreach ($session in $sessions) {

$tasks = $session | Get-VBRTaskSession

foreach ($task in $tasks) {

[PSCustomObject]@{

SessionStart = $session.CreationTime

VMName = $task.Name

Status = $task.Status

Reason = $task.Info.Reason

Duration = $task.Info.Progress.Duration

ReadSize = [math]::Round($task.Info.Progress.ReadSize / 1GB, 2)

TransferSize = [math]::Round($task.Info.Progress.TransferedSize / 1GB, 2)

}

} | Export-Csv -Path "C:\temp\forensic-job-detail.csv" -NoTypeInformation

Now you can see exactly which VMs failed or warned in each session, what the reason was, how much data was processed, and how long it took. If a VM started failing on a specific date, the timeline shows it. If a VM's transfer size suddenly spiked (indicating a CBT reset forced a full re-read), the data is there.

PERFORMANCE NOTE

Get-VBRBackupSession returns all sessions and can be slow in environments with thousands of sessions. For faster queries, use the .NET method directly: [Veeam.Backup.Core.CBackupSession]::GetByJob($job.Id) and filter on the result. This queries the database by job ID rather than pulling the entire session table into memory.

5. Reading the Bottleneck Data

VBR tracks bottleneck statistics for every task. The bottleneck is the component in the data path that consumed the most time during processing. It tells you where the slowdown is. The possible values are Source, Proxy, Network, and Target.

Bottleneck	Meaning	Forensic Implication
Source	The source storage or hypervisor is the slowest component. Data is being read slower than the proxy can process it.	Storage latency on the production side. Overloaded ESXi host. vSAN contention. Snapshot consolidation delays.
Proxy	The proxy CPU or memory is the constraint. Data arrives faster than the proxy can compress and deduplicate it.	Undersized proxy. Too many concurrent tasks for the core count. Compression set to High or Extreme.
Network	The network between proxy and repository is the bottleneck.	Bandwidth saturation. Backup traffic sharing a link with production. Missing 10 GbE upgrade. WAN link capacity for backup copy jobs.
Target	The repository is the slowest component. Data arrives faster than the repository can write it.	Slow repository storage. Too many concurrent streams to the repository. Repository disk full or near capacity. Deduplication appliance throttling.

The bottleneck is displayed in the session statistics in the VBR console and is available in the task session data through PowerShell. For forensic purposes, the bottleneck pattern over time matters more than a single reading. If the bottleneck shifted from Source to Target two weeks ago, something changed on the repository side. Correlate the shift with infrastructure changes: firmware updates, storage migrations, new VMs added to the job, repository disk filling up.

6. Reading the Task Logs

When PowerShell session data tells you what failed but not exactly why, the on-disk task logs provide the detail. Navigate to %ProgramData%\Veeam\Backup\<JobName> and find the session folder matching the timestamp of the failure. Inside, you will find task log files for each VM processed in that session.

The task log is a text file with timestamped entries showing every operation VBR performed for that VM: snapshot creation, data mover startup, block read operations, transport mode selection, data transfer, snapshot removal, and any errors encountered. For forensic purposes, the key patterns to search for are:

Snapshot operations: Search for "Creating snapshot" and "Removing snapshot." The time between these two entries is the snapshot lifetime. If the snapshot lifetime is hours instead of minutes, something is wrong with the backup processing speed or the snapshot removal is stuck. Long snapshot lifetimes cause VM performance degradation and can trigger vSphere alarms.

Transport mode fallback: Search for "Using transport mode." If VBR selected NBD when you expected hot-add, the proxy was not available on the correct host or the SCSI hot-add failed. Transport mode fallback is a common cause of unexpectedly slow backups.

CBT resets: Search for "Changed block tracking cannot be enabled" or "CBT is not enabled." A CBT reset forces a full re-read of the VM's disk instead of an incremental. This dramatically increases backup duration and data transfer. If CBT resets are happening repeatedly for the same VM, the underlying cause is usually a snapshot consolidation issue, a storage vMotion, or a hypervisor bug.

Timeout errors: Search for "timed out" or "Operation timed out." Timeouts during snapshot creation indicate an overloaded ESXi host. Timeouts during data transfer indicate network or storage issues. The timestamp in the log tells you exactly when the timeout occurred, which you can correlate with infrastructure monitoring data.

7. Identifying Silent Protection Gaps

The worst forensic finding is not a failed job. It is a VM that was never in a job at all. Or a VM that was removed from a job months ago and nobody noticed. Or a job that has been completing with Warning for weeks because one VM keeps failing but the job result is not Failed because the other 49 VMs succeed.

Find Unprotected VMs

# Get all VMs known to VBR across all managed hypervisors

$allVMs = Find-VBRViEntity -VMsAndTemplates | Where-Object { $_.Type -eq "VM" }

# Get all VMs that are in at least one backup job

$protectedVMs = Get-VBRBackup | ForEach-Object { $_.GetObjects() } |

Select-Object -ExpandProperty Name -Unique

# Find VMs that are NOT in any backup job

$unprotected = $allVMs | Where-Object { $_.Name -notin $protectedVMs }

$unprotected | Select-Object Name, Path | Export-Csv "C:\temp\unprotected-vms.csv" -NoTypeInformation

Write-Host "$($unprotected.Count) unprotected VMs found"

Run this script monthly. If you are doing forensics after an incident, run it immediately. The output tells you which VMs have no backup at all. Combine this with the Veeam ONE Protected VMs report for the same data in a recurring scheduled report.

Find VMs That Have Been Failing Silently

# Find VMs that have failed in every session for the last 7 days

$cutoff = (Get-Date).AddDays(-7)

$allTasks = Get-VBRBackupSession | Where-Object { $_.EndTime -ge $cutoff } |

ForEach-Object { $_ | Get-VBRTaskSession }

$failingVMs = $allTasks | Group-Object Name | Where-Object {

($_.Group | Where-Object { $_.Status -eq "Failed" }).Count -eq $_.Count

} | Select-Object Name, Count

$failingVMs | Export-Csv "C:\temp\silently-failing-vms.csv" -NoTypeInformation

This catches the VMs that fail in every single session but get buried in Warning-level job results because the job as a whole does not fail. These are the VMs that have been unprotected for days or weeks without anyone noticing.

8. Correlating with Veeam ONE Alarm History

Veeam ONE maintains an alarm history that records every alarm that fired, when it fired, and when it was resolved or acknowledged. If Veeam ONE is deployed, this history is your second forensic data source. It answers the question "were we alerted about this problem and if so, when?"

The key alarms for forensic analysis are: "Backup job finished with warnings," "Backup job failed," "VM backup RPO violation," "Repository free space is low," and the malware detection alarms. Export the alarm history for the investigation period and cross-reference it with the session timeline from step 4.

If Veeam ONE alarmed on the failure and nobody responded, the forensic finding is a process gap (nobody is watching the alerts). If Veeam ONE did not alarm because the alarm was disabled or the threshold was too permissive, the finding is a configuration gap. If Veeam ONE is not deployed at all, the finding is a monitoring gap. Each finding has a different remediation.

9. The Forensic Report

After the investigation, document what you found. A forensic report for a backup failure investigation should contain:

1. Timeline: Chronological sequence of events from the first indicator of the problem to detection. Include session timestamps, task results, and any infrastructure events that correlate.

2. Scope of impact: Which VMs were affected. How many restore points were missed. What the actual RPO gap is (the time between the last successful backup and the point of detection).

3. Root cause: The specific technical reason the failure occurred. Not "the job failed." The actual cause: CBT reset due to storage vMotion, proxy overcommitted after new VMs were added, repository filled to 98% and writes failed, credential expired on managed server.

4. Detection gap: How long the problem existed before it was detected. Why it was not detected sooner (no monitoring, alarm suppressed, alert fatigue, nobody checking email notifications).

5. Remediation: What was done to fix the immediate problem and what changes will prevent recurrence (monitoring improvements, capacity planning, job redesign, process changes).

This report is your deliverable. For MSPs, it goes to the customer. For internal teams, it goes to management. For incident response, it goes into the incident documentation. For auditors, it goes into the evidence archive.

10. Preventing the Next One

Forensics teaches you what failed. Prevention keeps it from happening again. Every forensic investigation should produce at least one improvement to the monitoring or operational process.

Enable RPO monitoring in Veeam ONE. The RPO violation alarm fires when a VM's last successful backup is older than a threshold you define. If your RPO is 24 hours, set the alarm to fire at 25 hours. This catches any VM that misses a backup cycle regardless of the job-level result.

Monitor at the task level, not the job level. A job that completes with Warning is not OK if the same VM is the warning every night. Build a PowerShell report that flags any VM with consecutive failures across 3 or more sessions. Run it daily.

Run the unprotected VM script monthly. Every new VM added to the environment is unprotected until someone adds it to a backup job. If your environment uses dynamic job inclusion (backup by tag, by folder, by resource pool), verify that the dynamic scope is actually catching new VMs.

Archive session data outside VBR. The VBR database retains session history based on its own housekeeping schedule. If you need to investigate something from 6 months ago, the database might not have it. Export session and task data to a separate store on a weekly schedule. A CSV export to a file share costs nothing and gives you a forensic archive that survives database cleanup.

Set repository capacity alarms at 70%, not 90%. By the time a repository hits 90%, you are days away from backup failures. At 70%, you have time to plan capacity additions or offload to the capacity tier. The alarm should create a ticket, not just send an email that gets buried.

Key Takeaways

✓ Forensics is not troubleshooting. Troubleshooting fixes the current problem. Forensics reconstructs the timeline of what happened, when it started, what was affected, and proves it with evidence.

✓ VBR stores forensic data in two places: the PostgreSQL database (session and task records accessible via PowerShell and REST API) and the file system logs at %ProgramData%\Veeam\Backup (detailed per-session operational logs).

✓ The session hierarchy is Job, Session, Task. The job is the configuration. The session is one run of the job. The task is one VM within that run. Forensic detail lives at the task level.

✓ Build the timeline first. Pull all failed and warning sessions for the investigation window. Then drill into task-level data for specific jobs. Export everything to CSV before the database rotates it out.

✓ Bottleneck data (Source, Proxy, Network, Target) tells you where the slowdown is. The pattern over time matters more than a single reading. A shift in bottleneck indicates an infrastructure change.

✓ Silent protection gaps are worse than failed jobs. A VM that is not in any backup job, or a VM that fails in every session but is hidden by job-level Warning results, is unprotected. Script the detection and run it regularly.

✓ Correlate session data with Veeam ONE alarm history to determine whether the problem was detected and whether anyone responded. The detection gap is often the most important forensic finding.

✓ Every forensic investigation should produce a report (timeline, scope, root cause, detection gap, remediation) and at least one monitoring improvement to prevent recurrence.

Published on anystackarchitect.com

+21

coolsport00
Veeam Legend
Forum|Forum|3 days ago
April 2, 2026

Really good troubleshooting post Eric. Digging the PoSH scripts the past couple posts 👍🏻 Very helpful!

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

eblack
Author
Influencer
Forum|Forum|3 days ago
April 2, 2026

Really good troubleshooting post Eric. Digging the PoSH scripts the past couple posts 👍🏻 Very helpful!

Thanks!

kciolek
Influencer
Forum|Forum|3 days ago
April 2, 2026

great article! this is great information to have!

Ken Ciolek | SHI Labs Data Protection & Storage Lead | Object First Ace | Commvault Global Ambassador

Jason Orchard-ingram micro
VUG Leader
Forum|Forum|2 days ago
April 3, 2026

Great article,