I learned a lot during the last few weeks about Veeam and I am kind of exited about this software. I resolved almost all problems, but there is still one issue I cant get a grip on. Ich have several backup jobs in a chain. No parallel Jobs. I have a lightning fast proxy (Veeam suggests a maximum of 24 parallel jobs).
My persisting promlem is, that jobs get stuck absolutly randomly. Sure, after hours the stuck job fails, goes into retry1 and finishes successfull. But thats not satisfying… Is there anything I missed about jobs getting stuck? Just for your information: Jobs launch and run. They get stuck somewhere after 20%, 50% or 90%. So abolutly random… I recognized that the “veeam backup service” gets completly frozen. This service cant be stopped by anything else than a server restart.
Page 2 / 2
Just for your information: The retry1 of the hanging job was successfull without any problems. As you can see, the retry1 took place from 09:10 am to 12:56 pm. At this time the backed up servers have much more load than in the night.
Mhh, perhaps some connection problems?
I have one customer environment where a firewall gets overloaded in one night of the week and then some (not all) DNS queries do not get an answer in appropriate time. This brings some backup jobs or parts of the to fail… So, every single part in your environment can have influence on your backup jobs.
Thanks for your quick reply Andanet. Due to several best practice advices, I setup a single backup job for two VMs. Both VMs run on the same physical host and one virtual disk (2,73 TB). One is a Win Server 2012 R2 with an Exchange 2013 Server installed on it (300GB). The other is a Win Server 2016 with SQL installed on it (2,3 TB). Only one VM gets stuck. In most cases the Exchange VM finishes successfully and the other gets stuck a little later. But I also had the opposite case already.
I checked the VSS writers and they all show “last error: no error”. But i will keep an eye on them (thanks for the avice).
I checked the datastore folder, but unfortunatly I dont know how to identify unused files…
Queiesence has never vbeen checked.
CBT has been check by default, i never changed that.
For the next try, I changed the transport mode from “virtal appliance” to “network”.
Hello @Michail welcome to the community. Can you attach all the parts with log errors of the respective job that randomly fails?
The Veeam job logs can be found in "C:\ProgramData\Veeam\Backup\"Job.<JOBNAME>.Backup.log" ".
Job/backup/replica task log The tasks represent the virtual machines in the job itself. These logs will contain all errors of a virtual machine that fails during the job.
Tools required - Notepad++ The only downside is that it cannot handle logs >200MB very well - Vim For logs >200MB - WinRAR For unpacking logs as well as being able to search text within the logs prior to unpacking - BareTail BareTail mimics the 'tail' option typically found in Unix/Linux systems, where logs can be viewed in real-time
thanks
Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?
Well, my backup jobs still get stuck. some people here wrote that firewalls can cause stuck jobs. Well we have a Sophos XG 85 and Sophos Endpoint Agents installed on the hanging VM servers. Are there any known problems with Sophos firewalls or Endpoint Agents?
Can you provide the error logs as requested? To make an anlaysis they are essential if you cannot understand the problem.
Regards.
Well those logs are huge… As an example I took out the logs for the following job, which got stuck and failed:
Check whether all necessary TCp ports are open between infra.
It seems that the proxy? server AW1 does not have all the ports open
greetings
Thanks Link State, I learned a lot again I knew that specific ports are important, but i didnt know that Veeam doesnt open the needed ports by default… So I quickly checked ports on the proxy. The result is:
Port 135 (TCP)
Microsoft Windows RPC
Port 139 (TCP)
Microsoft Windows netbios-ssn
Port 445 (TCP)
Port 2500 (TCP)
Port 2501 (TCP)
Port 2502 (TCP)
Port 2503 (TCP)
Port 2504 (TCP)
Port 3389 (TCP)
Tunnel is Microsoft SChannel TLS: unknown service
Port 5040 (TCP)
Port 6160 (TCP)
Port 6162 (TCP)
Port 6190 (TCP)
Port 6290 (TCP)
Port 11731 (TCP)
Port 19500 (TCP)
Tunnel is Microsoft SChannel TLS: Microsoft HTTPAPI httpd 2.0 SSDP/UPnP
Port 49672 (TCP)
So what I read the above ports should be good enough? Or are there any ports I missed and should be opened on the proxy?
What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.
I am really thankfull for all advices here. I will take a closer look to the ports topic.
Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?
What I forgot. Its not a random problem only. The jobs get stuck out of nothing too. So one second the throughput speed is 100 MB/s+ and the next second it drops to zero and stays there for hours - until retry1 starts.
I am really thankfull for all advices here. I will take a closer look to the ports topic.
Can any1 answer my above question, whether Sophos firewalls or Sophos Endpoint Agents might cause problems?
If you have the possibility try bypassing the firewall, disable antivirus or add exception
Thanks again Link State, I am aware of and up to that. I have running Windows Defender and Sophos Endpoint Agent on the hanging VM. In a first step I added all exclusions to Defender. That didnt help. Next I will deactivate defender completely. If that fails too, I will deactivate Sophos Endpoint Agent.
Agree, as a test disable defender, open all ports between Veeam and the infrastructure and test. If it works, enable 1 thing at a time and find the issue. The list of ports in the Veeam documentation is a good place to start when creating firewall rules.
Try a VM without the endpoint agenet. If you suspect it’s the agent can you disable it and test? can you clone the machine and rename it, remove the enpoint and test if it’s a production machine?
For example log file for the hanging job says:
Resource not ready: gateway server Processing finished with errors at 20.03.2023 21:17:37
This would match witch my perception, that the “backup service” got completly stuck.
Hi @Michail are you using gateway server on a deduplication appliance? how many concurrent tasks have you set on storage?
Are you using synthetic full?
Thanks
Hi Andanet,
Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server”
For example log file for the hanging job says:
Resource not ready: gateway server Processing finished with errors at 20.03.2023 21:17:37
This would match witch my perception, that the “backup service” got completly stuck.
Hi @Michail are you using gateway server on a deduplication appliance? how many concurrent tasks have you set on storage?
Are you using synthetic full?
Thanks
Hi Andanet,
Any speciality on the setup with Datadomain? I am experiencing drastic performance degradation for backups due to the fact of “Resource not ready: gateway server”
Hello @Kamson,
deduplication appliance needs to have a gateway… normally is a proxy that have the role of bridging backup server and repository.
Consider a small thing. Best practice is to assign a dedicated gateway server to every dedup appliance to have best performance. It all depends from your infrastructure. If it’s no a large environmnet you can use a proxy server as a gateway BUT compute resource (RAM and CPU) will be shared with natural proxy role (data transport).
How many concurrent tasks is right to set?
Check your datadomain streams (model based) relate to your concurrent backup/copy job and you RPO. after this you can calculate right number of tasks.