We have recently inherited a Kasten system and we are trying to (1) understand how Kasten works and (2) figure out why some of our applications are non-compliant. One issue that we are currently investigating is why one application in particular always fail due to timeout.
We are trying to backup a namespace with a couple of PVC:s (using CSI:s - Ceph RBD and Ceph FS). The total backup size is approximately 1.5 TiB.
When we first noticed an issue with this application, we saw that the backup failed due to timeout (10h). We then increased the timeout (to 24h) and the next backup worked! (It took 11h.) However, all subsequent backups have failed due to timeout.
We have been troubleshooting this for quite some time now but have been unable to find anything conclusive. The worst part is probably that we don’t really understand where in the Kasten deployment to look for relevant logs.
Any help figuring out the particular issue or general information about best practices w.r.t. troubleshooting Kasten are greatly appreciated.
