I guess the first question is what changed (if anything) on Friday? Since you updated on Sunday and the problem was first detected Friday it seems like it is not the update. Based on the bottleneck being the Target now (Repo) I would investigate there first to see. Not sure what to suggest as everyone’s environment is different and hopefully support can get to the bottom of it with a log deep dive. Best of luck and it will be interesting to know what the issue is.
Forgot to share one of the biggest things - no changes were even made! Interestingly, I did make a slight Repo storage and array change to another subset of jobs (1 job actually) which are working fine. But nothing was done to this perf problem group of Jobs ♂️ Thanks Chris.
Forgot to share one of the biggest things - no changes were even made! Interestingly, I did make a slight Repo storage and array change to another subset of jobs (1 job actually) which are working fine. But nothing was done to this perf problem group of Jobs ♂️ Thanks Chris.
Well as we tell others best to work with support at this point to narrow things down.
Best of luck.
What is networking looking like underneath the repositories and proxies, do you see similar throughput during backup windows?
What is networking looking like underneath the repositories and proxies, do you see similar throughput during backup windows?
I haven’t yet checked network throughput during Windows. Going to run FIO test (Veeam KB) on the Repo (since my bottleneck since changed from the “source” to “target”). But, mgmt network is different than storage network. So my assumption is this FIO test will hopefully tell the story. Will update after I run it.
Sounds like my fiber SFP issue that I had a while ago. was very intermittent causing latency spikes. Only effected 1 database causing “Application issues” which were also intermittent and kept giving different results resembling anything but a storage/SFP issue.
VeeamONE actually triggered some latency alerts finally and saved the day.
Back to your issue, did you have any custom settings or registry changes before the upgrade? Is it possible they have reverted back to the standard vanilla settings?
As far as the “subsets” are the networks the same for both Linux environments? It seems odd that only one would be effected.
Does it happen when you run an active full or create a new job using that proxy/repo?
Could it potentially be a failing disk causing latency but not actually failed yet? Do you have any S.M.A.R.T data on the drives or something to look into hardware issues, latency etc?
Additionally, I’d try running some benchmarks on both of your Linux repos and see how they respond. If it’s poor I’d start looking into disk/server type issues. It may not be the upgrade at all.
If they benchmark fine, I’d look at configs, networking or something further up the chain.
Hi Scott...thanks for offering some suggestions. Yeah..the more I’m looking into this, the more I think some kind of network issue. I also had a similar issue like this yr or so ago. Veeam has a cool diskspd tool I used to determine latency on a Proxy/Repo combo box I had, but a 2nd combo box ran fine. It was a fiber issue (got either cut or bent, but regardless had to be replaced). This issue wouldn’t be that as the data is not going to a ‘local offsite’ array but rather an array not only in the same DC, but same rack So, what I’m thinking at this point (about to run a Linux FIO test) is maybe a bad twinax cable or something like that...either on the Repo host or somewhere there. Will update when I can. BTW..checked my array (good suggestion there..hadn’t thought of that)...but no bad disks there. The Nimble UI ‘Hardware’ tab would show if one is bad or not running properly. Thanks man.
Hey everyone. Update - I resolved the issue.
I shared above what I did to pinpoint where the issue was coming from → either Linux Proxy or Linux Repo the “problem” subset of Jobs uses….because..well, the other sets of Jobs ran fine
Some steps to troubleshoot networking in Linux:
- Run
ip a sh
to check status of Host nics - If using iSCSI on Repo and Proxy as I am, verify the IQN in the
/etc/iscsi/initiatorname.iscsi
file is what is used on the storage array(s). Maybe something changed somehow - Re-add the storage target:
sudo iscsiadm -m discovery -t sendtarget -p <discovery-ip-addr>
- On the Repo, make sure the Repo Volumes are still connected and space is good:
df -hT
- Verify the storage array has the appropriate access connection type on the Repo Volumes. For Nimble, this is “Snapshot Only”
If all the above checks out, as it did for me, do a speed test to see if there is some network connection issue possibly causing the latency, reducing I/O to the Repo Volumes. For both Windows and Linux, use the above KB I shared. For me, I use Linux so I used the FIO tool (needs installed; it’s not natively a pkg in Linux distributions). The cmd I ran was:
fio --name=full-write-test --filename=/path/testfile.dat --size=100G --bs=512k --rw=write --ioengine=libaio --direct=1 --time_based --runtime=600s
Run the above on a Repo which is working fine to compare I/O results between the 2. As the tool is running...btw, takes about 10mins to complete...you’ll actually see the I/O writing the dat file produces. My problem Repo was running an abysmal 55kB/s ….whereas my good Repo box was running up to 200MB/s. Bigly difference!
So what did I do to resolve the latency? For starters, I replaced the twinax cables I use from this server to the iSCSI switch. That did the trick. If it didn’t, I would’ve tried replacing the same cables connected to my target array (to the iSCSI switch), but thankfully didn’t have to go that far. All is now good. I just provided the details for others who may run into this issue.
Again, thanks to such a great Community for all the suggestions and ideas!
Glad to see you resolved the issue and it was something simple like a cable.