Skip to main content

Hi,

 

Has anyone experienced VBR spinning up huge amounts of instances of a process, and consuming all RPC ports?

 

I suspect it’s AV related due to the customer’s AV change occurring at a similar time, but we’ve got the rules in place to exempt directories and binaries.

 

I’ve seen VBR have port exhaustion exactly once before, that was a VCC installation and the firewall was half-closing sessions and Veeam wasn’t doing a good job detecting this and freeing up ports subsequently.

 

This scenario is different however.

VBR server that is only for job orchestration (all backup & repo roles are on physical appliances, proxy & repo roles uninstalled from VBR). Server was up for less than 6 hours yesterday, and it spawned 18,000 threads of VeeamTransportSvc, all looping back to itself. But the server ran out of available ports and/or buffer space and then normal connections such as backup repository availability checks were failing, causing VBR to mark the repositories as offline, then causing even more cascading issues with jobs automatically failing due to repository extents being unavailable.

 

The physical servers are always okay, and it’s just happening on the VBR instance, which is the latest version of v11a btw. Veeam does eventually close down the threads, I saw it do so over a 30-40 minute period go from 18k threads down to about 2k threads across the entire OS. I’ve been doing netstat -abno outputs and normally the file is around 1k lines long, but when it happens its about 11k lines long.

 

So yeah, has anyone experienced anything like this before? I've got a P1 in with Veeam support but I’m also trying to explore any events that could be related.

 

BTW on the backup job front, hardly anything was running, a couple of object offloads, of which the VBR server doesn’t perform any offloading, and a couple of NAS Backup jobs, which again, repo & proxy roles are on the physical servers.

 

Thanks in advance :)

Never. My continued issues with Veeam are related to hotadd and not releasing disks from the hotadd proxies. Will be interested to hear the cause & resolution Michael.


Very strange, never happened.. 🤔


Never seen this before especially with all AV exclusions in place.  Any option to upgrade to v12 and see if the problem remains?


V12 upgrade isn’t on the cards at present, won’t be until later this year, but I strongly suspect it’s AV related, but I just don’t have the evidence yet...


Could you try and disable AV and run a few jobs to test? I noticed a folder missed in the exclusions once which caused some higher CPU usage, but didn’t need to dig into it too deep. 


It’s the licensed version of Windows Defender, so it might be tricky. It’s on my troubleshooting roadmap, but so far everything seems to be aligned.


I’ve not seen this that I can recall but am very interested in the resolution.


Oh boy, this was a fun one…

 

So, the problem was based out of two issues.

Firstly, lack of AV exclusions, including additional exclusions for the Veeam services so that the sub-processes it spawned wouldn’t be impacted by AV scanning. This was a huge pressure release on the server.

Secondly, someone had messed with multiple repository extents’ maximum concurrency. I need to be deliberately vague on this at present, but I’d like to remind people that if you have excessively high or non-existent backup repository concurrent task limits, then whilst your primary backups might be constrained by your proxy, backup copy jobs won’t have that limitation, they can (and will) consume those free task sessions when you’ve got enough individual backups to copy.


Yeah tasks are key for repos. Nice to see you figured this one out.


Thanks for sharing the resolution Michael. 


You know, I don’t think I’ve ever created AV exclusions for Veeam, and we’ve used a host of products over the past years.  Primarily Vipre (although that was before we had honed in on Veeam and had a big mix of Barracuda and Datto), Cylance, Crowdstrike, and most recently, Windows Defender managed by Datto RMM.  I feel like there was another in there as well.  Wonder if I would see any difference with exclusions in place.  I should send that to my central services admin to create the exclusions and see what happens.


Comment