Question

Veeam Backup Service completly stuck randomly


Userlevel 4
Badge

I learned a lot during the last few weeks about Veeam and I am kind of exited about this software. I resolved almost all problems, but there is still one issue I cant get a grip on. Ich have several backup jobs in a chain. No parallel Jobs. I have a lightning fast proxy (Veeam suggests a maximum of 24 parallel jobs).

My persisting promlem is, that jobs get stuck absolutly randomly. Sure, after hours the stuck job fails, goes into retry1 and finishes successfull. But thats not satisfying… Is there anything I missed about jobs getting stuck? Just for your information: Jobs launch and run. They get stuck somewhere after 20%, 50% or 90%. So abolutly random… I recognized that the “veeam backup service” gets completly frozen. This service cant be stopped by anything else than a server restart.


39 comments

Userlevel 7
Badge +6

Chaining is no longer considered best practice as you lose out on the parallel task capabilities.  Also, note that proxy and repository roles are capable of parallel tasks, but it’s going to be dependent on how many CPU cores you have.  I max it to the number of cores available.

Finally, for the failures, grabbing the logs and telling us what the error is when it fails would be helpful as we can’t do much without more detail on the actual failure.

Userlevel 4
Badge

Well I doubt its caused by a lack of CPU power, because some days all jobs run smooth and success without retry and within an abslolut acceptable time (incremental as well as full). The failures I am facing are absolutly random. But I will collect logs and post them. Tankes in advance :)

Userlevel 4
Badge

I went through some log files but couldnt find anything unusal. The logs basically show that I tried to stop hanging jobs and then restarted the backup server… because of the tons of log information, could some1 advice me, which log file(s) is/are most important?

Userlevel 4
Badge

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Userlevel 7
Badge +6

Hello @Michail 

what transport mode do the proxy use? did you configured Antivirus exclusion list? what is the bottleneck in the job progress?

Userlevel 4
Badge

Hello Moustafa, good to see you again :) I use transport mode “Virtual Appliance” and I added all Veaam sources to exclusions. The bottleneck on the failed job was: Source 30% > Proxy 8% > Network 10% > Target 6%. The bottleneck on the successfull retry was: Load: Source 0% > Proxy 7% > Network 0% > Target 3%  

 

   

 

Userlevel 7
Badge +6

Do you have any solution doing snapshots? SRM or any replication solution? may be it fails or stuck because there is another solution is taking a snapshot in the same time.

Userlevel 4
Badge

No snapshots or replications around. I have a subsequent backup copy job, but this job doesnt start until the restore point of the failed job appeared...

Userlevel 7
Badge +10

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Hi @Michail are you using gateway server on a deduplication appliance? 
how many concurrent tasks have you set on storage? 

Are you using synthetic full? 

Thanks 

 

Userlevel 4
Badge

Hi Andanet, I use a very simple and basic setup. I run Veeam on a virtual Windows 2022 Server, which is installed on an ESXi 6.5 Host. The only Software on the Server is Veeam B&R 12. This backup server is my standard gateway also. There is no duplication or deduplication.

I have no storage infrastructure on Veeam. I just have backup repositories, one proxy and jobs. I run 2 subsequent backup jobs, one subsequent backup copy job and one subsequent RDX full backup job. All jobs are scheduled one after the other.

Userlevel 4
Badge

After two days of all successfull backup jobs, tonight one backup job got stuck at 20%, then at 21% in retry1 and now at 24% in retry2. All three atempts together took 12 hours so far… The job itself is an incremental backup job of a fileserver with an SQL server installed (nothing more). Data on the server have been almost unchanged from yesterday, so the throughput shows almost read only at an average 114 MB/s. Then througput speed suddenly drops from 185 MB/s to 0KB/s… I am a bit clueless...

Userlevel 4
Badge

In retry3 the backup job now got stuck in exactly the same completion position as in retry2 (processed 573,3 GB). Thats exactly the same processed data as in retry2. Could that mean that the data I try to backup is kind of corrupted or something like that? As I mentioned above, yesterday and the day before all backups ran smooth and successfull….

Userlevel 4
Badge

Unfortunatly I cant close this topic yet. I updated Veeam to version 12.0.0.1420 recently, because I read that the new version solves some issues. Well, it didnt solve mine… I still have the problem that Veeam backup jobs gets stuck completely randomly. The new version seems to make the problem even worse, because the stuck job doesnt fall into retry. I really optimized the whole infrastructure. So I upgraded the memory of the backup server and host to be backed up. The backup job is executed at night (1:30 am) so nobody interferes the job. Still it gets stuck… Are there any good advices out there? The Veeam support doesnt do anything and closes all cases immediatly...

Userlevel 4
Badge

Since I installed the new version 12.0.0.1420, I get this error message: “Processing xyserveryx Error: Transmission pipeline hanged, aborting process”. This message appears almost 5 hours after the process hung up… So is there at least a possibility to shorten the time until the defective job aborts and falls into retry? I cant find any setting options...

Userlevel 7
Badge +10

Since I installed the new version 12.0.0.1420, I get this error message: “Processing xyserveryx Error: Transmission pipeline hanged, aborting process”. This message appears almost 5 hours after the process hung up… So is there at least a possibility to shorten the time until the defective job aborts and falls into retry? I cant find any setting options...

Hi @Michail with this error message I can think is not a Veeam issue. 

Having intermittent pipeline hangs shouldn't affect the backup or restore process… so this means an issue on network or on a VM guest OS. 

But I need more info about jobs. previously you wrote “ I run 2 subsequent backup jobs”… how many VM there are in both jobs? 

both jobs goes in stuck mode? And even in the same VM backup?

size of xyserveryx? How many disks has this vm? 

have you checked if  all VSS writers are ok? 

Check on datastore folder for that vm if there are unused files

you can try to backup one by one or together: 

  1. removing queiescence and CBT as in this reference link: https://helpcenter.veeam.com/docs/backup/vsphere/changed_block_tracking.html?ver=120
  1. using network mode transport

 LEt me know what’s happens because I’m curious now :P 

Userlevel 4
Badge

Thanks for your quick reply Andanet. Due to several best practice advices, I setup a single backup job for two VMs. Both VMs run on the same physical host and one virtual disk (2,73 TB). One is a Win Server 2012 R2 with an Exchange 2013 Server installed on it (300GB). The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

I checked the VSS writers and they all show “last error: no error”. But i will keep an eye on them (thanks for the avice).

I checked the datastore folder, but unfortunatly I dont know how to identify unused files…

Queiesence has never vbeen checked.

CBT has been check by default, i never changed that.

For the next try, I changed the transport mode from “virtal appliance” to “network”.

 

Userlevel 7
Badge +10

The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

 

MSSQL server has an high I/O usage. So it can be an issue due to database read/write operations. 

I hope you have more than 1 vdisk for this SQL Server and you cannot exec a dump on its disks. 

With multiple disks scenario you can try to exclude all disks except 0:0 and try backup. After you can  add one more at a time.  

Userlevel 4
Badge

Dear Andanet,

I doubt that the SQL Installation on the Win 2016 Server interferes the backup job, because the backup is run during night (starting at 1:30 am). The database, the guest machine and the host are in idle mode at that time. No users or other machines access the server (except the backup server ofc).

Could you tell me how to identify unused files in the datastore folder? There are several log-files in it. The other files seem to belong to the VM.

 

Userlevel 7
Badge +10

Is not a simple operation to check zombie files… you can use rvtool or veeam one but it’s very important to read vmware documentation to avoid damaging VM. I don't recommend you to do any actions if you’re not confident. 

 

 

Userlevel 4
Badge

Thanks again Adanet, the first try in network transport mode was successfull. The backup job was done even faster than in virtual appliance mode. I will report after saturday, when next full backup job to linux NAS will have been done.

Userlevel 7
Badge +10

Thanks again Adanet, the first try in network transport mode was successfull. The backup job was done even faster than in virtual appliance mode. I will report after saturday, when next full backup job to linux NAS will have been done.

Hi @Michail  you must consider with a 1GB network an HotAdd proxy would be faster than NBD. But with a 10GB network, NBD is pretty fast for sure. 

BUT

normally for all operations HotAdd is best way to have…. especially regarding restore operations. 

I can suggest to verify, based in your infrastructure, all requirements. A little check here

https://www.veeam.com/kb1054

 

Other tips for virtual proxy and hotadd usage:
http://www.veeam.com/kb1882?ad=in-text-link

 

Userlevel 4
Badge

Unfortunatly I didnt have to wait until sunday… Tonight the backup job hung up again. Its a bit frustrating, because its not recognizable, why the job gets stuck. It happens absolutly randomly, so I am really clueless what to do.

New today is the error message:

I belive the message occured because I killed the “Veeam Agent for Microsoft Windows” service. I dont really need the Agent. So is it advisable to uninnstall the Agent?

The need of continuously manual interventions is really frustrating. I run a small business. So its no big deal, but how do large companies handle such problems?

Is there a (built in) way to shorten the duration of hanging jobs? Like fall into retry1 after 15 minutes of zero data processed or transported?

Is there a detailed log of the reason why a job gets stuck?

 

 

 

Userlevel 7
Badge +17

Mhh, under normal conditions you don’t have to do manual interventions, everything runs smoothly on it’s own.

Did you open a support all? This is definitely not a normal condition.

Userlevel 7
Badge +10

Mhh, under normal conditions you don’t have to do manual interventions, everything runs smoothly on it’s own.

Did you open a support all? This is definitely not a normal condition.

@Michail I can’t help more and agree with @JMeixner 

an exception of type VEEAM.backup.agentprovider.agentclosedexception was thrown is not a normal issue and have a lot of reasons. 

For example…. in https://www.veeam.com/kb2903 solution is to disable IPv6 from all network interfaces in your proxies…. 

Only support can help you definitively. 

Userlevel 4
Badge

Unfortunatly support immediatly closes all opend cases due to personell shortage….

Comment