Question

Veeam Backup Service completly stuck randomly


Userlevel 4
Badge

I learned a lot during the last few weeks about Veeam and I am kind of exited about this software. I resolved almost all problems, but there is still one issue I cant get a grip on. Ich have several backup jobs in a chain. No parallel Jobs. I have a lightning fast proxy (Veeam suggests a maximum of 24 parallel jobs).

My persisting promlem is, that jobs get stuck absolutly randomly. Sure, after hours the stuck job fails, goes into retry1 and finishes successfull. But thats not satisfying… Is there anything I missed about jobs getting stuck? Just for your information: Jobs launch and run. They get stuck somewhere after 20%, 50% or 90%. So abolutly random… I recognized that the “veeam backup service” gets completly frozen. This service cant be stopped by anything else than a server restart.


39 comments

Userlevel 7
Badge +10

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Hi @Michail are you using gateway server on a deduplication appliance? 
how many concurrent tasks have you set on storage? 

Are you using synthetic full? 

Thanks 

 

Hi Andanet,

Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server” 

Hello @Kamson

deduplication appliance needs to have a gateway… normally is a proxy that have the role of bridging backup server and repository. 

https://helpcenter.veeam.com/docs/backup/vsphere/gateway_server.html?ver=120

Here you can find all infos needed. 

Consider a small thing. Best practice is to assign a dedicated gateway server to every dedup appliance to have best performance. It all depends from your infrastructure. If it’s no a large environmnet you can use a proxy server as a gateway BUT compute resource (RAM and CPU) will be shared with natural proxy role (data transport). 

How many concurrent tasks is right to set? 

Check your datadomain streams (model based) relate to your concurrent backup/copy job and you RPO. after this you can calculate right number of tasks. 

Another info…. Veeam talking about DD used as performance tier with some limits: https://helpcenter.veeam.com/docs/backup/vsphere/performance_tier_limitations.html?ver=120 

Cheers  

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Hi @Michail are you using gateway server on a deduplication appliance? 
how many concurrent tasks have you set on storage? 

Are you using synthetic full? 

Thanks 

 

Hi Andanet,

Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server” 

Userlevel 7
Badge +8

Agree, as a test disable defender, open all ports between Veeam and the infrastructure and test. If it works, enable 1 thing at a time and find the issue. The list of ports in the Veeam documentation is a good place to start when creating firewall rules. 

 

Try a VM without the endpoint agenet. If you suspect it’s the agent can you disable it and test? can you clone the machine and rename it, remove the enpoint and test if it’s a production machine?   

Userlevel 4
Badge

Thanks again Link State, I am aware of and up to that. I have running Windows Defender and Sophos Endpoint Agent on the hanging VM. In a first step I added all exclusions to Defender. That didnt help. Next I will deactivate defender completely. If that fails too, I will deactivate Sophos Endpoint Agent.

Userlevel 7
Badge +8

What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.

I am really thankfull for all advices here. I will take a closer look to the ports topic. 

 

Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?

 

If you have the possibility try bypassing the firewall, disable antivirus or add exception

KB1999: Antivirus Exclusions for Veeam Backup & Replication

regards

Userlevel 4
Badge

What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.

I am really thankfull for all advices here. I will take a closer look to the ports topic. 

 

Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?

Userlevel 4
Badge

Thanks Link State, I learned a lot again :) I knew that specific ports are important, but i didnt know that Veeam doesnt open the needed ports by default… So I quickly checked ports on the proxy. The result is:

Port 135 (TCP)

Microsoft Windows RPC

Port 139 (TCP)

Microsoft Windows netbios-ssn

Port 445 (TCP)

 

Port 2500 (TCP)

 

Port 2501 (TCP)

 

Port 2502 (TCP)

 

Port 2503 (TCP)

 

Port 2504 (TCP)

 

Port 3389 (TCP)

Tunnel is Microsoft SChannel TLS: unknown service

Port 5040 (TCP)

 

Port 6160 (TCP)

 

Port 6162 (TCP)

 

Port 6190 (TCP)

 

Port 6290 (TCP)

 

Port 11731 (TCP)

 

Port 19500 (TCP)

Tunnel is Microsoft SChannel TLS: Microsoft HTTPAPI httpd 2.0 SSDP/UPnP

Port 49672 (TCP)

 

So what I read the above ports should be good enough? Or are there any ports I missed and should be opened on the proxy?

 

 

 

Userlevel 7
Badge +8

Hi @Michail  there is to investigate these errors,

 

 

https://helpcenter.veeam.com/docs/backup/agents/used_ports.html?ver=120

Check whether all necessary TCp ports are open between infra.

It seems that the proxy? server AW1 does not have all the ports open

greetings

 

Userlevel 4
Badge

Well those logs are huge… As an example I took out the logs for the following job, which got stuck and failed:

This the link to the Backup Job log:

https://ipblaw-my.sharepoint.com/:u:/g/personal/bachinger_ipblaw_at/EUaXgM5aG8RBovC0yU_rISQBhJnGNJiUlgD5nKkQn-ra3w?e=IZmyOW

This is the link to the Task log:

https://ipblaw-my.sharepoint.com/:u:/g/personal/bachinger_ipblaw_at/EaYA92AMsy9NiV1955iRhgsBLzOdFAPJ9V55CxAwLzj2ew?e=1vJegQ

 

Userlevel 7
Badge +8

Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?

Can you provide the error logs as requested?
To make an anlaysis they are essential if you cannot understand the problem.

Regards.

 

Userlevel 4
Badge

Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?

Userlevel 7
Badge +8

Thanks for your quick reply Andanet. Due to several best practice advices, I setup a single backup job for two VMs. Both VMs run on the same physical host and one virtual disk (2,73 TB). One is a Win Server 2012 R2 with an Exchange 2013 Server installed on it (300GB). The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

I checked the VSS writers and they all show “last error: no error”. But i will keep an eye on them (thanks for the avice).

I checked the datastore folder, but unfortunatly I dont know how to identify unused files…

Queiesence has never vbeen checked.

CBT has been check by default, i never changed that.

For the next try, I changed the transport mode from “virtal appliance” to “network”.

 

Hello @Michail  welcome to the community.
Can you attach all the parts with log errors of the respective job that randomly fails?


The Veeam job logs can be found in "C:\ProgramData\Veeam\Backup\"Job.<JOBNAME>.Backup.log" ".

Job/backup/replica task log
The tasks represent the virtual machines in the job itself.  
These logs will contain all errors of a virtual machine that fails during the job.


Tools required
- Notepad++ 
The only downside is that it cannot handle logs >200MB very well
- Vim 
For logs >200MB
- WinRAR 
For unpacking logs as well as being able to search text within the logs prior to unpacking
- BareTail 
BareTail mimics the 'tail' option typically found in Unix/Linux systems, where logs can be viewed in real-time

thanks

Userlevel 7
Badge +17

Mhh, perhaps some connection problems?

I have one customer environment where a firewall gets overloaded in one night of the week and then some (not all) DNS queries do not get an answer in appropriate time. This brings some backup jobs or parts of the to fail… So, every single part in your environment can have influence on your backup jobs.

Userlevel 4
Badge

Just for your information: The retry1 of the hanging job was successfull without any problems. As you can see, the retry1 took place from 09:10 am to 12:56 pm. At this time the backed up servers have much more load than in the night.

 

Userlevel 4
Badge

Unfortunatly support immediatly closes all opend cases due to personell shortage….

Userlevel 7
Badge +10

Mhh, under normal conditions you don’t have to do manual interventions, everything runs smoothly on it’s own.

Did you open a support all? This is definitely not a normal condition.

@Michail I can’t help more and agree with @JMeixner 

an exception of type VEEAM.backup.agentprovider.agentclosedexception was thrown is not a normal issue and have a lot of reasons. 

For example…. in https://www.veeam.com/kb2903 solution is to disable IPv6 from all network interfaces in your proxies…. 

Only support can help you definitively. 

Userlevel 7
Badge +17

Mhh, under normal conditions you don’t have to do manual interventions, everything runs smoothly on it’s own.

Did you open a support all? This is definitely not a normal condition.

Userlevel 4
Badge

Unfortunatly I didnt have to wait until sunday… Tonight the backup job hung up again. Its a bit frustrating, because its not recognizable, why the job gets stuck. It happens absolutly randomly, so I am really clueless what to do.

New today is the error message:

I belive the message occured because I killed the “Veeam Agent for Microsoft Windows” service. I dont really need the Agent. So is it advisable to uninnstall the Agent?

The need of continuously manual interventions is really frustrating. I run a small business. So its no big deal, but how do large companies handle such problems?

Is there a (built in) way to shorten the duration of hanging jobs? Like fall into retry1 after 15 minutes of zero data processed or transported?

Is there a detailed log of the reason why a job gets stuck?

 

 

 

Userlevel 7
Badge +10

Thanks again Adanet, the first try in network transport mode was successfull. The backup job was done even faster than in virtual appliance mode. I will report after saturday, when next full backup job to linux NAS will have been done.

Hi @Michail  you must consider with a 1GB network an HotAdd proxy would be faster than NBD. But with a 10GB network, NBD is pretty fast for sure. 

BUT

normally for all operations HotAdd is best way to have…. especially regarding restore operations. 

I can suggest to verify, based in your infrastructure, all requirements. A little check here

https://www.veeam.com/kb1054

 

Other tips for virtual proxy and hotadd usage:
http://www.veeam.com/kb1882?ad=in-text-link

 

Userlevel 4
Badge

Thanks again Adanet, the first try in network transport mode was successfull. The backup job was done even faster than in virtual appliance mode. I will report after saturday, when next full backup job to linux NAS will have been done.

Userlevel 7
Badge +10

Is not a simple operation to check zombie files… you can use rvtool or veeam one but it’s very important to read vmware documentation to avoid damaging VM. I don't recommend you to do any actions if you’re not confident. 

 

 

Userlevel 4
Badge

Dear Andanet,

I doubt that the SQL Installation on the Win 2016 Server interferes the backup job, because the backup is run during night (starting at 1:30 am). The database, the guest machine and the host are in idle mode at that time. No users or other machines access the server (except the backup server ofc).

Could you tell me how to identify unused files in the datastore folder? There are several log-files in it. The other files seem to belong to the VM.

 

Userlevel 7
Badge +10

The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

 

MSSQL server has an high I/O usage. So it can be an issue due to database read/write operations. 

I hope you have more than 1 vdisk for this SQL Server and you cannot exec a dump on its disks. 

With multiple disks scenario you can try to exclude all disks except 0:0 and try backup. After you can  add one more at a time.  

Userlevel 4
Badge

Thanks for your quick reply Andanet. Due to several best practice advices, I setup a single backup job for two VMs. Both VMs run on the same physical host and one virtual disk (2,73 TB). One is a Win Server 2012 R2 with an Exchange 2013 Server installed on it (300GB). The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

I checked the VSS writers and they all show “last error: no error”. But i will keep an eye on them (thanks for the avice).

I checked the datastore folder, but unfortunatly I dont know how to identify unused files…

Queiesence has never vbeen checked.

CBT has been check by default, i never changed that.

For the next try, I changed the transport mode from “virtal appliance” to “network”.

 

Userlevel 7
Badge +10

Since I installed the new version 12.0.0.1420, I get this error message: “Processing xyserveryx Error: Transmission pipeline hanged, aborting process”. This message appears almost 5 hours after the process hung up… So is there at least a possibility to shorten the time until the defective job aborts and falls into retry? I cant find any setting options...

Hi @Michail with this error message I can think is not a Veeam issue. 

Having intermittent pipeline hangs shouldn't affect the backup or restore process… so this means an issue on network or on a VM guest OS. 

But I need more info about jobs. previously you wrote “ I run 2 subsequent backup jobs”… how many VM there are in both jobs? 

both jobs goes in stuck mode? And even in the same VM backup?

size of xyserveryx? How many disks has this vm? 

have you checked if  all VSS writers are ok? 

Check on datastore folder for that vm if there are unused files

you can try to backup one by one or together: 

  1. removing queiescence and CBT as in this reference link: https://helpcenter.veeam.com/docs/backup/vsphere/changed_block_tracking.html?ver=120
  1. using network mode transport

 LEt me know what’s happens because I’m curious now :P 

Comment