Question

Snapshots lying around in Ceph trash

  • 26 February 2023
  • 3 comments
  • 138 views

Userlevel 3

I recently purged everything in my Ceph cluster, started fresh…..

And I have backups going on, but I’m finding that in 2 weeks of this new install that the number of “trash” snaps are going up. It was first 6, now 20… I can’t delete them…..

 

I haven’t done a single restore via Kasten, so I don’t understand why these images aren’t deleting (should be no volumes/image dependencies from a restore)…..

Nevertheless purging gives: “RBD image has snapshots (error deleting image from trash)”

My retention is basically keep 1 hourly backup, the rest are sent to external storage outside of Ceph.

 

Does Kasten have bugs with snapshots? Before I started fresh, restoring PVC’s, making up the clones, restoring, would result in loads of unclearable snaps inside Ceph as well….


3 comments

Userlevel 5
Badge +2

Hello @voarsh Try to have a look into the status of the retire-actions that get created and check if it is failing! 

kubectl get retireactions.actions.kio.kasten.io -o custom-columns=NAME:.metadata.name,"CREATED AT":.metadata.creationTimestamp,STATUS:.status.state


and what is the deletion policy specified in the volumesnapshot class(that the volumesnapshot uses)?
 

Ahmed Hagag

Userlevel 3

Argh. My reply was over the char limit….

An example of a stuck retire job:

kubectl get --raw /apis/actions.kio.kasten.io/v1alpha1/retireactions/retire-r6qdh8jdps/details

{"kind":"RetireAction","apiVersion":"actions.kio.kasten.io/v1alpha1","metadata":{"name":"retire-r6qdh8jdps","uid":"3a637be2-b801-11ed-874e-36f9cb9e1653","resourceVersion":"687634","creationTimestamp":"2023-03-01T07:18:13Z","labels":{"k10.kasten.io/policyName":"chatwoot-backup","k10.kasten.io/policyNamespace":"kasten-io"}},"status":{"state":"Running","startTime":"2023-03-01T07:18:13Z","endTime":null,"restorePoint":{"name":""},"result":{"name":""},"actionDetails":{"phases":[{"endTime":null,"name":"Retiring RestorePoint","startTime":"2023-03-01T07:18:13Z","state":"waiting","updatedTime":"2023-03-01T11:25:13Z"}]},"progress":14},"spec":{"subject":{"apiVersion":"internals.kio.kasten.io/v1alpha1","kind":"Manifest","name":"59c2ae22-b7f6-11ed-874e-36f9cb9e1653"},"scheduledTime":"2023-03-01T07:00:00Z","retireIndex":614}}

I have hundreds of backups in the UI that won’t retire.

I had manually deleted all the contents of VolumeSnapshots… since Kasten won’t retire them...

Userlevel 3

and what is the deletion policy specified in the volumesnapshot class(that the volumesnapshot uses)?

ceph-block (delete), and k10-clone-ceph-block (retain)… though, looking at new snaps in VolumeSnapshots, the class seems to be using ceph-block, can’t speak for the 200+ that I had deleted earlier, assume it was the same.

 

As it currently stands, I have 27 snaps in Ceph trash that I can’t delete… I would need to manually dig into Ceph to see what the parent image is, is it safe to delete all child snaps… because Kasten has definitely messed up I think.

 

kubectl get retireactions.actions.kio.kasten.io -o custom-columns=NAME:.metadata.name,"CREATED AT":.metadata.creationTimestamp,STATUS:.status.state

And for sure this shows some failed, not much reasons on why, the output is way too large to post here, and doesn’t offer much detail 

Comment