Solved

Signifcant Performance Issue Even Though No Changes Made

7 months ago
September 10, 2024
10 comments
89 views

+20

coolsport00
Veeam Legend
4162 comments

Hi Community -

Yes..I ask questions too! 😉

I recently noticed a subset of my Backup Jobs performance going from hundreds of MB/s to just 5-10MB/s since last Fri. I updated my Veeam environment on Sun and initially thought it happened after that (I noticed yesterday a few jobs took 4-14hrs to run!), but after further investigation I found the perf drop started 1st thing Fri morning. I have 2 other subsets of Jobs which run fine. Each of the 3 “sets” of Jobs use different Proxies and Repositories. So this whole subset which use a certain physical Linux Proxy and physical Linux Repo has this performance issue. Another subset which uses a different set of physical Proxy and Repo boxes are fine. And the last subset of Jobs which use hotadd are fine. So I’ve obviously narrowed down the issue is either the physical Proxy or Repo on this subset. The Job stats went from bottleneck of Source in the 90% range to Target in the high 90% range. I’ve checked my Proxy frontward and backward. All configs seem fine. My network seems ok. Nics and HBAs are “up” and seemingly working fine. My multipathing on it and the Repo seem fine. Connection to my prod storage array (Proxy) and backup storage array (Repo) are still there...again no changes were made there. The network backbone is 10Gb. Storage network is isolated so no other traffic congestion traverses it.

I am working with Support, but was wondering if anyone has experienced this issue before and what was done to resolve? I have experienced a similar issue before and the the cause was the network (fiber, believe it or not) over a short-range internal WAN (to our DR site). But, this problem subset of Jobs backs up to a ‘local’ array. Still could be network-related I guess...bad twinax cables I guess? Anyway...going to attempt some kind of speed test to see, but again...curious if others had an issue like this and the resolution.

Thanks all.

Best answer by coolsport00

Hey everyone. Update - I resolved the issue.

I shared above what I did to pinpoint where the issue was coming from → either Linux Proxy or Linux Repo the “problem” subset of Jobs uses….because..well, the other sets of Jobs ran fine 😊

Some steps to troubleshoot networking in Linux:

Run ip a sh to check status of Host nics
If using iSCSI on Repo and Proxy as I am, verify the IQN in the /etc/iscsi/initiatorname.iscsi file is what is used on the storage array(s). Maybe something changed somehow
Re-add the storage target: sudo iscsiadm -m discovery -t sendtarget -p <discovery-ip-addr>
On the Repo, make sure the Repo Volumes are still connected and space is good: df -hT
Verify the storage array has the appropriate access connection type on the Repo Volumes. For Nimble, this is “Snapshot Only”

If all the above checks out, as it did for me, do a speed test to see if there is some network connection issue possibly causing the latency, reducing I/O to the Repo Volumes. For both Windows and Linux, use the above KB I shared. For me, I use Linux so I used the FIO tool (needs installed; it’s not natively a pkg in Linux distributions). The cmd I ran was:
fio --name=full-write-test --filename=/path/testfile.dat --size=100G --bs=512k --rw=write --ioengine=libaio --direct=1 --time_based --runtime=600s

Run the above on a Repo which is working fine to compare I/O results between the 2. As the tool is running...btw, takes about 10mins to complete...you’ll actually see the I/O writing the dat file produces. My problem Repo was running an abysmal 55kB/s 😳 ….whereas my good Repo box was running up to 200MB/s. Bigly difference!

So what did I do to resolve the latency? For starters, I replaced the twinax cables I use from this server to the iSCSI switch. That did the trick. If it didn’t, I would’ve tried replacing the same cables connected to my target array (to the iSCSI switch), but thankfully didn’t have to go that far. All is now good. I just provided the details for others who may run into this issue.

Again, thanks to such a great Community for all the suggestions and ideas!

View original

Did this topic help you find an answer to your question?

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8528 comments
7 months ago
September 10, 2024

I guess the first question is what changed (if anything) on Friday? Since you updated on Sunday and the problem was first detected Friday it seems like it is not the update. Based on the bottleneck being the Target now (Repo) I would investigate there first to see. Not sure what to suggest as everyone’s environment is different and hopefully support can get to the bottom of it with a log deep dive. Best of luck and it will be interesting to know what the issue is.

+20

coolsport00
Author
Veeam Legend
4162 comments
7 months ago
September 10, 2024

Forgot to share one of the biggest things - no changes were even made! 😁 Interestingly, I did make a slight Repo storage and array change to another subset of jobs (1 job actually) which are working fine. But nothing was done to this perf problem group of Jobs 🤷🏻‍♂️ Thanks Chris.

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8528 comments
7 months ago
September 10, 2024

coolsport00 wrote:

Well as we tell others best to work with support at this point to narrow things down. 😋

Best of luck. 👍🏼

k00laidIT
Veeam Vanguard
73 comments
7 months ago
September 10, 2024

What is networking looking like underneath the repositories and proxies, do you see similar throughput during backup windows?

Jim Jones, @k00laidIT, AWS Community Builder, Cisco Champion, vExpert, Veeam Vanguard

+20

coolsport00
Author
Veeam Legend
4162 comments
7 months ago
September 10, 2024

k00laidIT wrote:

What is networking looking like underneath the repositories and proxies, do you see similar throughput during backup windows?

I haven’t yet checked network throughput during Windows. Going to run FIO test (Veeam KB) on the Repo (since my bottleneck since changed from the “source” to “target”). But, mgmt network is different than storage network. So my assumption is this FIO test will hopefully tell the story. Will update after I run it.

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

Scott
Veeam Legend
1009 comments
7 months ago
September 10, 2024

Sounds like my fiber SFP issue that I had a while ago. was very intermittent causing latency spikes. Only effected 1 database causing “Application issues” which were also intermittent and kept giving different results resembling anything but a storage/SFP issue.

VeeamONE actually triggered some latency alerts finally and saved the day.

Back to your issue, did you have any custom settings or registry changes before the upgrade? Is it possible they have reverted back to the standard vanilla settings?

As far as the “subsets” are the networks the same for both Linux environments? It seems odd that only one would be effected.

Does it happen when you run an active full or create a new job using that proxy/repo?

Could it potentially be a failing disk causing latency but not actually failed yet? Do you have any S.M.A.R.T data on the drives or something to look into hardware issues, latency etc?

Scott
Veeam Legend
1009 comments
7 months ago
September 10, 2024

Additionally, I’d try running some benchmarks on both of your Linux repos and see how they respond. If it’s poor I’d start looking into disk/server type issues. It may not be the upgrade at all.

If they benchmark fine, I’d look at configs, networking or something further up the chain.

+20

coolsport00
Author
Veeam Legend
4162 comments
7 months ago
September 10, 2024

Hi Scott...thanks for offering some suggestions. Yeah..the more I’m looking into this, the more I think some kind of network issue. I also had a similar issue like this yr or so ago. Veeam has a cool diskspd tool I used to determine latency on a Proxy/Repo combo box I had, but a 2nd combo box ran fine. It was a fiber issue (got either cut or bent, but regardless had to be replaced). This issue wouldn’t be that as the data is not going to a ‘local offsite’ array but rather an array not only in the same DC, but same rack 😂 So, what I’m thinking at this point (about to run a Linux FIO test) is maybe a bad twinax cable or something like that...either on the Repo host or somewhere there. Will update when I can. BTW..checked my array (good suggestion there..hadn’t thought of that)...but no bad disks there. The Nimble UI ‘Hardware’ tab would show if one is bad or not running properly. Thanks man.

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

+20

coolsport00
Author
Veeam Legend
4162 comments
Answer
7 months ago
September 10, 2024

Hey everyone. Update - I resolved the issue.

I shared above what I did to pinpoint where the issue was coming from → either Linux Proxy or Linux Repo the “problem” subset of Jobs uses….because..well, the other sets of Jobs ran fine 😊

Some steps to troubleshoot networking in Linux:

Run ip a sh to check status of Host nics
If using iSCSI on Repo and Proxy as I am, verify the IQN in the /etc/iscsi/initiatorname.iscsi file is what is used on the storage array(s). Maybe something changed somehow
Re-add the storage target: sudo iscsiadm -m discovery -t sendtarget -p <discovery-ip-addr>
On the Repo, make sure the Repo Volumes are still connected and space is good: df -hT
Verify the storage array has the appropriate access connection type on the Repo Volumes. For Nimble, this is “Snapshot Only”

Again, thanks to such a great Community for all the suggestions and ideas!

Shane Williford - Veeam VMCA/VMCE | Veeam Legend | VUG Leader | VCP-DCV | Twitter: @coolsport00

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8528 comments
7 months ago
September 10, 2024

Glad to see you resolved the issue and it was something simple like a cable.

Comment

Rich Text Editor, editor1

Comment

Related topics

Designing Meaningful Recovery Point Objective & Recovery Time Objective Policies

In The Lab With: Veeam Backup for Microsoft 365 v6 - What's new? What's fixed? What should I know?

Ubuntu 22.04 LTS & Netplan - Creation & Troubleshooting of Bonded Interfaces

Veeam backup jobs and Object Storage Best Practice

Top Features of Veeam Backup & Replication You should Know to Pass Your VMCE Exam

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded