I’ve just resolved a case with Veeam Support that I wanted to share.
I was working on a project that was protecting Oracle RMAN databases on an IBM AIX platform, and the performance unexpectedly dropped.
During testing, all had been fine, but when we came to the migration, the backups were taking a painfully long time. We backed out of the change and began to review.
This blog post is a write-up of the troubleshooting steps we took, and what the root cause was.
Firstly, we looked at Veeam’s statistics screen to review where Veeam perceived the bottleneck to be. Veeam reported an extremely high value for both the source and the network as the bottlenecks, with the target being idly waiting. We tested RMAN backing up to a local disk for comparison as the database team were saying that RMAN was awaiting further transmissions and found that it was backing up extremely fast to disk. A job that would take 30 minutes with Veeam was taking 1-2 minutes to local disk, not good!
This allowed us to rule out the source as the ultimate bottleneck, but we performed one more test. We enabled Veeam’s compression on the backup job, and the backup job completed faster! Again, this pointed to the network being the problem, as the compression was resulting in a lower volume of data transfer.
With both parties agreed that the network stack felt like the place to investigate, we pulled in AIX and Networking resources. At this point I’d like to highlight that the network stack was 20Gbps at the source and 80Gbps at the destination, and it was all a layer-2 network dedicated to backups, so bandwidth really shouldn’t have been a problem.
We monitored current utilisation of the source and destination systems’ CPU, RAM, Disk, and Networking to confirm that neither endpoint was busy, and the metrics were as we expected. The systems were sat there on 1-2% CPU utilisation, <10% RAM consumption, and network & disk throughput of a few Mbps.
We started to dive into the network layer a bit further, discussing things such as RFC1323 and how AIX by default isn’t optimised for multi-gigabit networks. We aligned these settings, confirmed that we had jumbo frames consistently between the source and destination, and saw marginal improvements. And by this, I mean consistently we saw 0.1-0.2MBps increase.
The thing I’ve omitted so far is, what throughput was I seeing? It was consistently between 5-7MBps. I started to go diving into the logs and found something incredibly peculiar. At the beginning of any job, I’d see the following:
n03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsReadBandwidth = 0
a03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsWriteBandwidth = 0
But then, once the job started transmitting data, I’d see the below:
e03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsReadBandwidth = 5760
n03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsWriteBandwidth = 5760
And when the job finished, I’d see it revert to the above:
03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsReadBandwidth = 0
e03.05.2023 07:14:36.099] <139857825363712> cli | (EInt32) KbpsWriteBandwidth = 0
However, I didn’t have a network throttle in place for these endpoints. What I did have however, was a global network throttling rule.
As some background information, if you have a network throttling rule, such as in my case, 100Mbps, and then you have two resources attempting to use a path under the same bandwidth restriction, Veeam will divide the maximum throughput between two.
I had a capacity tier offload running, which was permitted 100Mbps of WAN bandwidth. However, 100Mbps/2 = 50Mbps AKA 6.25MBps. Hence my values of 5-7MBps being reported in the statistics.
This seemed to line up really well and was worth exploring. My bandwidth throttle would increase to 225Mbps outside of hours, so I tested the RMAN backup job then. I noticed just over a doubling of performance, as the capacity tier offload was still running. I then took it one step further and temporarily disabled the network throttling. The job took only a few minutes! I had found the problem.
But this felt like a bug to me, the rule I was using was ANY:Internet, but they were both using the same L2 subnet that didn’t even have a default gateway for any complex asymmetric routing or anything else that could’ve triggered this rule. So, I dug deeper.
I manually created my own internet rules, but unlike the ANY:Internet rules, I explicitly stated only my ‘production/primary/management’ network, however you’d like to call it was the source IP addresses. As this network was the only network with a gateway to the internet, and the destinations being only the public IP address ranges accessible via the internet. I complied with the RFC1918 list of private networks and other related ‘private networks’ to ensure I’d captured everything relevant, and effectively had created 7 rules. They looked like the below, and for clarity, the backup repositories had neighbouring IP addresses, so the source was targeting 2x IP addresses only:
Source IP Start | Source IP End | Destination IP Start | Destination IP End |
---|---|---|---|
<BackupRepo1> | <BackupRepo2> | 1.0.0.1 | 9.255.255.255 |
<BackupRepo1> | <BackupRepo2> | 11.0.0.1 | 100.63.255.255 |
<BackupRepo1> | <BackupRepo2> | 100.128.0.1 | 126.255.255.255 |
<BackupRepo1> | <BackupRepo2> | 128.0.0.1 | 169.253.255.255 |
<BackupRepo1> | <BackupRepo2> | 169.255.0.1 | 172.15.255.255 |
<BackupRepo1> | <BackupRepo2> | 172.32.0.1 | 192.167.255.255 |
<BackupRepo1> | <BackupRepo2> | 192.169.0.1 | 255.255.255.255 |
At this point, neither the source nor destination IP address could match on any rule, I’d enabled throttling on them all, I could see they had successfully captured the traffic to the cloud object storage provider and was successfully rate limiting that, however when I tested RMAN backups again, it was still getting caught by the rules.
At that point, my workaround was to rate limit at the firewall, and raise a support case with Veeam.
The support team were great, a special thank you to Ivan Gavrilov for collating all the information and getting this passed up to QA, getting it validated as a bug, and getting a hotfix built for it!
It was also great to see the root cause provided for this, which was the below:
The preliminary cause of the issue was determined to be in Veeam plugin data mover inability to obtain its own IP address when running on AIX – under some conditions, the data structure for local IP addresses returned by AIX does not conform to POSIX standard and is thus ignored. Because of this, the data mover was unable to detect that it’s running in local subnet and VBR, in turn, applied the Internet rule to it, including throttling.
Veeam Support
And there we have it, the root cause identified, a fix issued, and a happy customer at the end of it all!