Large backup with Ceph PV timeout (and how do we find logs?)

We have recently inherited a Kasten system and we are trying to (1) understand how Kasten works and (2) figure out why some of our applications are non-compliant. One issue that we are currently investigating is why one application in particular always fail due to timeout.

We are trying to backup a namespace with a couple of PVC:s (using CSI:s - Ceph RBD and Ceph FS). The total backup size is approximately 1.5 TiB.

When we first noticed an issue with this application, we saw that the backup failed due to timeout (10h). We then increased the timeout (to 24h) and the next backup worked! (It took 11h.) However, all subsequent backups have failed due to timeout.

We have been troubleshooting this for quite some time now but have been unable to find anything conclusive. The worst part is probably that we don’t really understand where in the Kasten deployment to look for relevant logs.

Any help figuring out the particular issue or general information about best practices w.r.t. troubleshooting Kasten are greatly appreciated.

Page 1 / 1

Hi Erik,

I had a similar issue exporting large PVCs. You mentioned you increased the timeout. Can you state which timeout in particular you increased?

You can check your timeout settings by doing a

k get cm k10-config -n kasten-io -o yaml

Also, are you timing out during initial snapshot or during the export phase?

You might also try splitting the job into multiple jobs to decrease the backup size/time/work/etc using labels for testing to see if there's one in particular PVC causing the issue.

Can you state which timeout in particular you increased?

You can check your timeout settings by doing a

k get cm k10-config -n kasten-io -o yaml

We read through the documentation and found:

timeout.jobWait Specifies the timeout (in minutes) for completing execution of any child job, after which the parent job will be canceled. If no value is set, a default of 10 hours will be used None

Hence, we modified the timeout.JobWait parameter in the Helm values.yaml file, which in turn affected the K10TimeoutJobWait parameter in the configmap. The original value was unset and we modified it to 1440 (which we understand to mean “24 hours”).

These are all the parameters from the configmap which mentions “timeout”:

K10TimeoutBlueprintBackup: "45"
K10TimeoutBlueprintDelete: "45"
K10TimeoutBlueprintHooks: "20"
K10TimeoutBlueprintRestore: "600"
K10TimeoutCheckRepoPodReady: "20"
K10TimeoutEFSRestorePodReady: "45"
K10TimeoutJobWait: "1440"
K10TimeoutStatsPodReady: "20"
K10TimeoutWorkerPodReady: "15"
kubeVirtVMsUnFreezeTimeout: 5m
vmWareTaskTimeoutMin: "60"

Also, are you timing out during initial snapshot or during the export phase?

The issue seems to be with the initial snapshot. I can retrieve the following from the latest failed run:

phases:
  - attempt: 1
    endTime: 2025-10-26T00:01:11Z
    name: Snapshotting Application Components
    startTime: 2025-10-25T00:01:09Z
    state: failed
    updatedTime: 2025-10-26T00:01:11Z

You might also try splitting the job into multiple jobs to decrease the backup size/time/work/etc using labels for testing to see if there's one in particular PVC causing the issue.

We will look into this.

Hi,

Working with Erik on this issue.

One of the main obstacles here is that we have a hard time finding out what the action was actually doing when terminated by the timeout. Is it still calculating things? Waiting for disc I/O? slow transfers? stuck because of lack of resources in the K8s cluster hosting it? Is there any good way of finding this out? I’ve been sifting through the logs and not found much that relates to the actions, but I only have a very general idea on what I’m looking at.

Any suggestions on what to look for, where to find some more info or logs is appreciated.

You could try looking at the RunActions and BackupActions.

Look for Failed RunActions inside of the kasten-io namespace:

kubectl get -n kasten-io runactions -o jsonpath='{range .items[?(@.status.state=="Failed")]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'

or BackupActions inside of the namespace youre backing up:

kubectl get backupactions -n <namespace>

Are there any errors thrown in Ceph during the job?

@ErikThorsellZ Thank you for taking time to post this topic.

There could be multiple issues that might be causing this. I would probably ask which action fails due to the timeout ? Is it the backup action or the export action?

If it is the backup action, its usually failing to get the kubernetes snapshot into ready state.

If it is export, does it fail after 24hrs (since you mentioned the timeout increase to 24h) or does it fail very early (say 15 mins * 3 ?)

Once we understand at what point does it fail, I can point you on where to look for the issues.

@ErikThorsellZ Thank you for taking time to post this topic.

There could be multiple issues that might be causing this. I would probably ask which action fails due to the timeout ? Is it the backup action or the export action?

If it is the backup action, its usually failing to get the kubernetes snapshot into ready state.

If it is export, does it fail after 24hrs (since you mentioned the timeout increase to 24h) or does it fail very early (say 15 mins * 3 ?)

Once we understand at what point does it fail, I can point you on where to look for the issues.

I believe I have already answered one of your questions in a previous post. Am I missunderstanding what you’re asking for?

Concerning snapshot or export:

The issue seems to be with the initial snapshot. I can retrieve the following from the latest failed run:
phases:
  - attempt: 1
    endTime: 2025-10-26T00:01:11Z
    name: Snapshotting Application Components
    startTime: 2025-10-25T00:01:09Z
    state: failed
    updatedTime: 2025-10-26T00:01:11Z

Concerning timeout:

As can be seen in the code snippet above, the snappshotting starts at 2025-10-25T00:01:09Z and updates (as it fails) at 2025-10-26T00:01:11Z (24 hours and 2 seconds later). So Kasten attempts to do something for 24h and then fails.

Thank you for confirming that it is backupaction that is failing.

I somehow missed the snippets that you added earlier.

The only action that kasten does during the snapshotting application components is the volumesnapshots and/or blueprint based backups.

Since you mentioned that you have couple of PVCs(with both cephFS and CephRBD), I would request you to check the status of the volumesnapshot resources created in this namespace when you run the backup. Basically Kasten just waits for the snapshot to be ready to use. (There is a readyToUse field in the volumesnapshot status that is boolean)

If the snapshot is not getting ready, there could be some underlying issues in the CSI driver while snapshotting.

@jaiganeshjk, no worries.

I looked at the volumesnapshots.snapshot.storage.k8s.io in the namespace we are trying to backup and I found almost 150 objects. Some are using the snapshotclass related to ceph rbd and some are using ceph fs. The snapshots are 200Gi and 1Gi respectively.

I notice that the newest snapshots are a little bit more than a week old, whereas the oldest ones are almost a year old.

All of the resources have “READYTOUSE” equivalent to true.

Could it be the case that these old snapshots are causing issues?

This is weird. Would you be able to pick one of the volumesnapshot from the latest backup and share the YAML here ?

It would also help if you could share the YAML of the corresponding volumesnapshotcontent

@jaiganeshjk, as I mentioned, the latest (according to creation date) volumesnapshot is a little bit more than a week old and looks like this:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2025-10-22T00:03:35Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 1
  labels:
    kasten_io_appnamespace: ponyo
    kasten_io_jobid: 188edac4-aeda-11f0-8e76-0a580a811bcc
    kasten_io_manifestid: 1801b77a-aeda-11f0-8e2f-0a580a811bd9
    kasten_io_pvc: data-ponyo-kafka-0
    name: kasten__snapshot-ponyo-ns-2025-10-22t00-00-00z-00
  name: k10-csi-snap-ppslgmt74jmm8m5b
  namespace: ponyo
  resourceVersion: "4827477351"
  uid: 483b435d-9332-4416-877a-80986d61f5ea
spec:
  source:
    persistentVolumeClaimName: data-ponyo-kafka-0
  volumeSnapshotClassName: ocs-external-storagecluster-rbdplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-483b435d-9332-4416-877a-80986d61f5ea
  creationTime: "2025-10-22T00:03:36Z"
  readyToUse: true
  restoreSize: 200Gi

The corresponding volumesnapshotcontent:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  annotations:
    snapshot.storage.kubernetes.io/deletion-secret-name: rook-csi-rbd-provisioner
    snapshot.storage.kubernetes.io/deletion-secret-namespace: openshift-storage
  creationTimestamp: "2025-10-22T00:03:35Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  generation: 1
  name: snapcontent-483b435d-9332-4416-877a-80986d61f5ea
  resourceVersion: "4827477349"
  uid: f01b3b8b-f810-4bff-8b18-1271b800908e
spec:
  deletionPolicy: Delete
  driver: openshift-storage.rbd.csi.ceph.com
  source:
    volumeHandle: 0001-0011-openshift-storage-000000000000000d-18ba58c8-3115-4416-a423-863b474738fb
  volumeSnapshotClassName: ocs-external-storagecluster-rbdplugin-snapclass
  volumeSnapshotRef:
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    name: k10-csi-snap-ppslgmt74jmm8m5b
    namespace: ponyo
    resourceVersion: "4827477247"
    uid: 483b435d-9332-4416-877a-80986d61f5ea
status:
  creationTime: 1761091416738609698
  readyToUse: true
  restoreSize: 214748364800
  snapshotHandle: 0001-0011-openshift-storage-000000000000000d-fff534ae-3d36-49a8-8ccb-fa854477f517

Granted I don’t really know what to look for, I cannot see anything that immediately stands out as erroneous. Can you?

Thank you. I don’t see any problem in here as well.(I was looking to compare the timestamps for the creation to see if there is a gap between creation and readyToUse).

Since you mentioned you only see things that are a week old, May I ask if your backups started failing after that ?

In that case, Kasten cleans up the snapshot that did not come into ready state.

I would recommend to trigger a backup now and see if the new snapshots change into ready state.

Ideally it shouldn’t take more than a min for a snapshot to be ready. If it takes, then we could look at the csi-snapshotter logs from the ceph’s namespace.

Thank you. I don’t see any problem in here as well.(I was looking to compare the timestamps for the creation to see if there is a gap between creation and readyToUse).

Thank you for clarifying what you were looking at.

Since you mentioned you only see things that are a week old, May I ask if your backups started failing after that ?

In that case, Kasten cleans up the snapshot that did not come into ready state.

When we identified a problem, the backup policy had not run successfully for a long time (months). We then noticed the 10h timeout in the UI, modified it to 24h and triggered a manual backup once. That manual backup worked, but there after all automatic backups have failed.

I cancelled the currently running job and re-triggered it manually. It is still stuck at snapshotting and I cannot see any new volumesnapshots.

Thanks for sharing the background.
I would also be interested in checking the configmaps in kasten namespace.

Whenever Kasten runs a backup, It locks the namespace with a configmap prefixed k10-nslock-

This is to make sure that there are no simultaneous backup runs for the same namespace running at the same time.

Would you be able to check if there is an nslock with the namespace name or the policy name that is left in the cluster for a long time ?

kubectl -n kasten-io get cm |grep -i nslock

The reason I am asking this is because, I understand that currently its not just an issue with the snapshotting(because we haven’t got to that part yet), its an issue in running the job itself.

Thanks for sharing the background.
I would also be interested in checking the configmaps in kasten namespace.

Whenever Kasten runs a backup, It locks the namespace with a configmap prefixed k10-nslock-

This is to make sure that there are no simultaneous backup runs for the same namespace running at the same time.

Would you be able to check if there is an nslock with the namespace name or the policy name that is left in the cluster for a long time ?

kubectl -n kasten-io get cm |grep -i nslock

The reason I am asking this is because, I understand that currently its not just an issue with the snapshotting(because we haven’t got to that part yet), its an issue in running the job itself.

This is interesting, thank you! I see a k10-nslock referencing the troublesome namespace and it’s 7 days old!

Thank you for confirming.
I would recommend cancelling the backup that is currently running. Make sure that it is cancelled and then delete the configmap. Once deleted, Please retry the backup and let me know how it goes.

Removing the lock file and re-triggering did enable me to restart the policy. This time several workloads were snapshotted and the application configuration was also snappshotted. However, the UI says “Snapshotting Application Components (x2)”, which I understand to mean that it has attempted to perform the snapshot twice(?). There’s also a 1% in the top left corner of the backup action visual item. The same visual item indicates that the snapshot(?) contains 376 artifacts.

Yes x2 means that Kasten is doing the second attempt for the snapshot which also means that the first attempt failed for some reason.

You should be able to find the failure reason in action details YAML or under phases in the sidepanel that opens when you click the running action.
If you can share the complete error message, We can try to figure out what is happening.

The only information I can find (in the UI and in the YAML) referencing “Component” is:

phases:
  - attempt: 2
    endTime: null
    name: Snapshotting Application Components
    startTime: 2025-10-29T09:23:49Z
    state: running
    updatedTime: 2025-10-29T09:42:29Z

I see no logs or error messages.

It seems that this might need more involved debugging from the support team.

You could also open an official support case https://my.veeam.com/open-case/technical-case by selecting Veeam Kasten for Kubernetes Trial.

Thank you so much for your help. We will take this with support.

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded