Question

Veeam Backup Service completly stuck randomly


Userlevel 4
Badge

I learned a lot during the last few weeks about Veeam and I am kind of exited about this software. I resolved almost all problems, but there is still one issue I cant get a grip on. Ich have several backup jobs in a chain. No parallel Jobs. I have a lightning fast proxy (Veeam suggests a maximum of 24 parallel jobs).

My persisting promlem is, that jobs get stuck absolutly randomly. Sure, after hours the stuck job fails, goes into retry1 and finishes successfull. But thats not satisfying… Is there anything I missed about jobs getting stuck? Just for your information: Jobs launch and run. They get stuck somewhere after 20%, 50% or 90%. So abolutly random… I recognized that the “veeam backup service” gets completly frozen. This service cant be stopped by anything else than a server restart.


39 comments

Userlevel 4
Badge

Just for your information: The retry1 of the hanging job was successfull without any problems. As you can see, the retry1 took place from 09:10 am to 12:56 pm. At this time the backed up servers have much more load than in the night.

 

Userlevel 7
Badge +17

Mhh, perhaps some connection problems?

I have one customer environment where a firewall gets overloaded in one night of the week and then some (not all) DNS queries do not get an answer in appropriate time. This brings some backup jobs or parts of the to fail… So, every single part in your environment can have influence on your backup jobs.

Userlevel 7
Badge +7

Thanks for your quick reply Andanet. Due to several best practice advices, I setup a single backup job for two VMs. Both VMs run on the same physical host and one virtual disk (2,73 TB). One is a Win Server 2012 R2 with an Exchange 2013 Server installed on it (300GB). The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.

I checked the VSS writers and they all show “last error: no error”. But i will keep an eye on them (thanks for the avice).

I checked the datastore folder, but unfortunatly I dont know how to identify unused files…

Queiesence has never vbeen checked.

CBT has been check by default, i never changed that.

For the next try, I changed the transport mode from “virtal appliance” to “network”.

 

Hello @Michail  welcome to the community.
Can you attach all the parts with log errors of the respective job that randomly fails?


The Veeam job logs can be found in "C:\ProgramData\Veeam\Backup\"Job.<JOBNAME>.Backup.log" ".

Job/backup/replica task log
The tasks represent the virtual machines in the job itself.  
These logs will contain all errors of a virtual machine that fails during the job.


Tools required
- Notepad++ 
The only downside is that it cannot handle logs >200MB very well
- Vim 
For logs >200MB
- WinRAR 
For unpacking logs as well as being able to search text within the logs prior to unpacking
- BareTail 
BareTail mimics the 'tail' option typically found in Unix/Linux systems, where logs can be viewed in real-time

thanks

Userlevel 4
Badge

Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?

Userlevel 7
Badge +7

Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?

Can you provide the error logs as requested?
To make an anlaysis they are essential if you cannot understand the problem.

Regards.

 

Userlevel 4
Badge

Well those logs are huge… As an example I took out the logs for the following job, which got stuck and failed:

This the link to the Backup Job log:

https://ipblaw-my.sharepoint.com/:u:/g/personal/bachinger_ipblaw_at/EUaXgM5aG8RBovC0yU_rISQBhJnGNJiUlgD5nKkQn-ra3w?e=IZmyOW

This is the link to the Task log:

https://ipblaw-my.sharepoint.com/:u:/g/personal/bachinger_ipblaw_at/EaYA92AMsy9NiV1955iRhgsBLzOdFAPJ9V55CxAwLzj2ew?e=1vJegQ

 

Userlevel 7
Badge +7

Hi @Michail  there is to investigate these errors,

 

 

https://helpcenter.veeam.com/docs/backup/agents/used_ports.html?ver=120

Check whether all necessary TCp ports are open between infra.

It seems that the proxy? server AW1 does not have all the ports open

greetings

 

Userlevel 4
Badge

Thanks Link State, I learned a lot again :) I knew that specific ports are important, but i didnt know that Veeam doesnt open the needed ports by default… So I quickly checked ports on the proxy. The result is:

Port 135 (TCP)

Microsoft Windows RPC

Port 139 (TCP)

Microsoft Windows netbios-ssn

Port 445 (TCP)

 

Port 2500 (TCP)

 

Port 2501 (TCP)

 

Port 2502 (TCP)

 

Port 2503 (TCP)

 

Port 2504 (TCP)

 

Port 3389 (TCP)

Tunnel is Microsoft SChannel TLS: unknown service

Port 5040 (TCP)

 

Port 6160 (TCP)

 

Port 6162 (TCP)

 

Port 6190 (TCP)

 

Port 6290 (TCP)

 

Port 11731 (TCP)

 

Port 19500 (TCP)

Tunnel is Microsoft SChannel TLS: Microsoft HTTPAPI httpd 2.0 SSDP/UPnP

Port 49672 (TCP)

 

So what I read the above ports should be good enough? Or are there any ports I missed and should be opened on the proxy?

 

 

 

Userlevel 4
Badge

What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.

I am really thankfull for all advices here. I will take a closer look to the ports topic. 

 

Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?

Userlevel 7
Badge +7

What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.

I am really thankfull for all advices here. I will take a closer look to the ports topic. 

 

Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?

 

If you have the possibility try bypassing the firewall, disable antivirus or add exception

KB1999: Antivirus Exclusions for Veeam Backup & Replication

regards

Userlevel 4
Badge

Thanks again Link State, I am aware of and up to that. I have running Windows Defender and Sophos Endpoint Agent on the hanging VM. In a first step I added all exclusions to Defender. That didnt help. Next I will deactivate defender completely. If that fails too, I will deactivate Sophos Endpoint Agent.

Userlevel 7
Badge +8

Agree, as a test disable defender, open all ports between Veeam and the infrastructure and test. If it works, enable 1 thing at a time and find the issue. The list of ports in the Veeam documentation is a good place to start when creating firewall rules. 

 

Try a VM without the endpoint agenet. If you suspect it’s the agent can you disable it and test? can you clone the machine and rename it, remove the enpoint and test if it’s a production machine?   

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Hi @Michail are you using gateway server on a deduplication appliance? 
how many concurrent tasks have you set on storage? 

Are you using synthetic full? 

Thanks 

 

Hi Andanet,

Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server” 

Userlevel 7
Badge +10

For example log file for the hanging job says:

  Resource not ready: gateway server
Processing finished with errors at 20.03.2023 21:17:37

 

This would match witch my perception, that the “backup service” got completly stuck.

 

Hi @Michail are you using gateway server on a deduplication appliance? 
how many concurrent tasks have you set on storage? 

Are you using synthetic full? 

Thanks 

 

Hi Andanet,

Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server” 

Hello @Kamson

deduplication appliance needs to have a gateway… normally is a proxy that have the role of bridging backup server and repository. 

https://helpcenter.veeam.com/docs/backup/vsphere/gateway_server.html?ver=120

Here you can find all infos needed. 

Consider a small thing. Best practice is to assign a dedicated gateway server to every dedup appliance to have best performance. It all depends from your infrastructure. If it’s no a large environmnet you can use a proxy server as a gateway BUT compute resource (RAM and CPU) will be shared with natural proxy role (data transport). 

How many concurrent tasks is right to set? 

Check your datadomain streams (model based) relate to your concurrent backup/copy job and you RPO. after this you can calculate right number of tasks. 

Another info…. Veeam talking about DD used as performance tier with some limits: https://helpcenter.veeam.com/docs/backup/vsphere/performance_tier_limitations.html?ver=120 

Cheers  

Comment