Signifcant Performance Issue Even Though No Changes Made

Hi Community -

Yes..I ask questions too! 😉

I recently noticed a subset of my Backup Jobs performance going from hundreds of MB/s to just 5-10MB/s since last Fri. I updated my Veeam environment on Sun and initially thought it happened after that (I noticed yesterday a few jobs took 4-14hrs to run!), but after further investigation I found the perf drop started 1st thing Fri morning. I have 2 other subsets of Jobs which run fine. Each of the 3 “sets” of Jobs use different Proxies and Repositories. So this whole subset which use a certain physical Linux Proxy and physical Linux Repo has this performance issue. Another subset which uses a different set of physical Proxy and Repo boxes are fine. And the last subset of Jobs which use hotadd are fine. So I’ve obviously narrowed down the issue is either the physical Proxy or Repo on this subset. The Job stats went from bottleneck of Source in the 90% range to Target in the high 90% range. I’ve checked my Proxy frontward and backward. All configs seem fine. My network seems ok. Nics and HBAs are “up” and seemingly working fine. My multipathing on it and the Repo seem fine. Connection to my prod storage array (Proxy) and backup storage array (Repo) are still there...again no changes were made there. The network backbone is 10Gb. Storage network is isolated so no other traffic congestion traverses it.

I am working with Support, but was wondering if anyone has experienced this issue before and what was done to resolve? I have experienced a similar issue before and the the cause was the network (fiber, believe it or not) over a short-range internal WAN (to our DR site). But, this problem subset of Jobs backs up to a ‘local’ array. Still could be network-related I guess...bad twinax cables I guess? Anyway...going to attempt some kind of speed test to see, but again...curious if others had an issue like this and the resolution.

Thanks all.

Page 1 / 1

I guess the first question is what changed (if anything) on Friday? Since you updated on Sunday and the problem was first detected Friday it seems like it is not the update. Based on the bottleneck being the Target now (Repo) I would investigate there first to see. Not sure what to suggest as everyone’s environment is different and hopefully support can get to the bottom of it with a log deep dive. Best of luck and it will be interesting to know what the issue is.

Forgot to share one of the biggest things - no changes were even made! 😁 Interestingly, I did make a slight Repo storage and array change to another subset of jobs (1 job actually) which are working fine. But nothing was done to this perf problem group of Jobs 🤷🏻‍♂️ Thanks Chris.

Well as we tell others best to work with support at this point to narrow things down. 😋

Best of luck. 👍🏼

What is networking looking like underneath the repositories and proxies, do you see similar throughput during backup windows?

I haven’t yet checked network throughput during Windows. Going to run FIO test (Veeam KB) on the Repo (since my bottleneck since changed from the “source” to “target”). But, mgmt network is different than storage network. So my assumption is this FIO test will hopefully tell the story. Will update after I run it.

Sounds like my fiber SFP issue that I had a while ago. was very intermittent causing latency spikes. Only effected 1 database causing “Application issues” which were also intermittent and kept giving different results resembling anything but a storage/SFP issue.

VeeamONE actually triggered some latency alerts finally and saved the day.

Back to your issue, did you have any custom settings or registry changes before the upgrade? Is it possible they have reverted back to the standard vanilla settings?

As far as the “subsets” are the networks the same for both Linux environments? It seems odd that only one would be effected.

Does it happen when you run an active full or create a new job using that proxy/repo?

Could it potentially be a failing disk causing latency but not actually failed yet? Do you have any S.M.A.R.T data on the drives or something to look into hardware issues, latency etc?

Additionally, I’d try running some benchmarks on both of your Linux repos and see how they respond. If it’s poor I’d start looking into disk/server type issues. It may not be the upgrade at all.

If they benchmark fine, I’d look at configs, networking or something further up the chain.

Hi Scott...thanks for offering some suggestions. Yeah..the more I’m looking into this, the more I think some kind of network issue. I also had a similar issue like this yr or so ago. Veeam has a cool diskspd tool I used to determine latency on a Proxy/Repo combo box I had, but a 2nd combo box ran fine. It was a fiber issue (got either cut or bent, but regardless had to be replaced). This issue wouldn’t be that as the data is not going to a ‘local offsite’ array but rather an array not only in the same DC, but same rack 😂 So, what I’m thinking at this point (about to run a Linux FIO test) is maybe a bad twinax cable or something like that...either on the Repo host or somewhere there. Will update when I can. BTW..checked my array (good suggestion there..hadn’t thought of that)...but no bad disks there. The Nimble UI ‘Hardware’ tab would show if one is bad or not running properly. Thanks man.

Hey everyone. Update - I resolved the issue.

I shared above what I did to pinpoint where the issue was coming from → either Linux Proxy or Linux Repo the “problem” subset of Jobs uses….because..well, the other sets of Jobs ran fine 😊

Some steps to troubleshoot networking in Linux:

Run ip a sh to check status of Host nics
If using iSCSI on Repo and Proxy as I am, verify the IQN in the /etc/iscsi/initiatorname.iscsi file is what is used on the storage array(s). Maybe something changed somehow
Re-add the storage target: sudo iscsiadm -m discovery -t sendtarget -p <discovery-ip-addr>
On the Repo, make sure the Repo Volumes are still connected and space is good: df -hT
Verify the storage array has the appropriate access connection type on the Repo Volumes. For Nimble, this is “Snapshot Only”

If all the above checks out, as it did for me, do a speed test to see if there is some network connection issue possibly causing the latency, reducing I/O to the Repo Volumes. For both Windows and Linux, use the above KB I shared. For me, I use Linux so I used the FIO tool (needs installed; it’s not natively a pkg in Linux distributions). The cmd I ran was:
fio --name=full-write-test --filename=/path/testfile.dat --size=100G --bs=512k --rw=write --ioengine=libaio --direct=1 --time_based --runtime=600s

Run the above on a Repo which is working fine to compare I/O results between the 2. As the tool is running...btw, takes about 10mins to complete...you’ll actually see the I/O writing the dat file produces. My problem Repo was running an abysmal 55kB/s 😳 ….whereas my good Repo box was running up to 200MB/s. Bigly difference!

So what did I do to resolve the latency? For starters, I replaced the twinax cables I use from this server to the iSCSI switch. That did the trick. If it didn’t, I would’ve tried replacing the same cables connected to my target array (to the iSCSI switch), but thankfully didn’t have to go that far. All is now good. I just provided the details for others who may run into this issue.

Again, thanks to such a great Community for all the suggestions and ideas!

Glad to see you resolved the issue and it was something simple like a cable.

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded