Question

Ceph RBD - Kasten still creates leftover orphaned RBD images


Userlevel 3

Following up from this:

 

I have my backup policies to only keep 1 backup (for exporting) to S3 compatible.


I have a reasonably fresh Ceph cluster with 33 images. Earlier I noticed 35 images, some image(s) have been created by Kasten K10 that haven’t been deleted properly… (and an unexpected increase in disk space usage)
In Ceph trash I see two images that it can’t delete because of error: [errno 39] RBD image has snapshots (error deleting image from trash)
So I run this:

for x in $(rbd list --pool ceph-blockpool); do
echo "Listing snapshots for $x:"
rbd snap ls ceph-blockpool/$x
done for x in $(rbd list --pool ceph-blockpool); do rbd snap ls ceph-blockpool/$x done

The output doesn’t show any snaps for images…
When I try and gather more info about the two images that can’t be purged:
rbd status ceph-blockpool/csi-snap-9cafd9dd-7fa2-40cc-b0fd-69c937008228
rbd: error opening image csi-snap-9cafd9dd-7fa2-40cc-b0fd-69c937008228: (2) No such file or directory
rbd status ceph-blockpool/csi-snap-4795fabd-f45c-4e6d-8ec0-53cb3283a5c3
rbd: error opening image csi-snap-4795fabd-f45c-4e6d-8ec0-53cb3283a5c3: (2) No such file or directory

 

I don’t understand why some extra space is used (couple of hundred GB’s), 2 images that don’t seem to exist but can’t remove them because Ceph thinks there’s snapshots - but I don’t see any snapshots in the pool from my first command. And apparently the 2 images don’t exist when I try and query it for more info or associated snapshots.


3 comments

Userlevel 3

If I remember correctly from when I ran into this, either K10 or the K8S RBD driver takes a copy-on-write clone first and then snapshots the clone. This makes it pretty difficult to track down because you have to find RBD images with a parent ID of the one you are trying to remove. That COW image will then be a root for one or more snapshots which you’ll have to flatten or purge.

Userlevel 7
Badge +7

@jaiganeshjk 

Userlevel 3

If I remember correctly from when I ran into this, either K10 or the K8S RBD driver takes a copy-on-write clone first and then snapshots the clone. This makes it pretty difficult to track down because you have to find RBD images with a parent ID of the one you are trying to remove. That COW image will then be a root for one or more snapshots which you’ll have to flatten or purge.

I guess I don’t know if there’s a bug in the K8s snapshot driver or K10.

It keeps happening after a while when I have unattended hourly snapshot backups. I don’t want my cluster filled with orphaned images that take up unnecessary space. My previous Ceph cluster ended up having several TB’s of images (several hundred images that couldn’t be purged) I couldn’t remove through conventional means.

 

Have you got any tips on how I can proceed to get to the parent image and delete the parent and clone? Bearing in mind that I can’t even query the image (as per my first post here)

Comment