Skip to main content

Hello,

We observe that, in the events of the kasten-io namespace, several PVCs are remaining in the Provisioning state but when checking the cluster, those PVCs are not provisioned.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

v1/events
LAST SEEN   TYPE     REASON         OBJECT                                     MESSAGE
3m17s       Normal   Provisioning   persistentvolumeclaim/kanister-pvc-15kkr   External provisioner is provisioning volume for claim "kasten-io/kanister-pvc-15kkr"
 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once finished the snapshot, the PVC is deleted in Kubernetes OK but the orphaned subvolume remains in the cephfs storage:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2023-03-28T15:43:47.536161813Z I0328 15:43:47.536137       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kasten-io", Name:"kanister-pvc-15kkr", UID:"e67dfd43-b5c1-40fb-8a13-7c3923c3724c", APIVersion:"v1", ResourceVersion:"2383954963", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-cephfs": error getting handle for DataSource Type VolumeSnapshot by Name snapshot-copy-6xvmkfwx: error getting snapshot snapshot-copy-6xvmkfwx from api server: volumesnapshots.snapshot.storage.k8s.io "snapshot-copy-6xvmkfwx" not found
 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These events refer to PVCs that are used to take daily snapshots of the Kubernetes cluster. They are accumulated in the CEPH cluster and as a walkaround what we do is manually delete the associated subvolumes.

Can you help me to delete those PVC referents on kasten-io namespace, please?

Thank you

@jaiganeshjk 


Hi @jesus.rosales Thank you for posting the questions here.

The mentioned kanister PVC are created during an export and are cleaned up when the export is completed.

Since you mentioned CephFS, There might be a chance the K10 initiates a PVC creation from volumesnapshot source and for some reason(due to timeout maybe) the operation fails, and K10 cleans up the PVC. But the clone operation happening in the background started by Ceph CSI driver still runs and doesn’t get cleaned up as the CSI driver doesn’t handle the cancellation/removal of the clone operation when the PVC(Created from snapshot source) is removed. 

K10 has a pod wait timeout set to 15 mins by default when it waits for the temporary cloned volume to be ready for exports. However, I have seen cephFS(as default clone operation in CephFS is a full copy) taking more time to clone volumes.

The above is my hunch and it could be entirely something else as well.

Can you confirm what is the largest PVC(in terms of utilization) in your environment ? And if you have sometime, You can manually create a PVC for the large volume with the volumesnapshot as a source and see how much time it takes for the PV to be created and bound to the PVC.
 

This way we can verify if the above issue is happening in your cluster.


Thank you jaiganeshjk for your answer, we will replicate those tests. Now, I have a question, how is the communication flow between the Kasten-io and the Ceph?

From the creation of the PVC to taking the snapshot until its removal from the Ceph.

Thanks in advance.


K10 doesn’t directly interact with Ceph when you are using a CSI driver. K10 creates corresponding kubernetes resources(volumesnapsot/PVC) and CSI driver acts on it.

We just create a VolumeSnapshot resource when a backup action runs. Similarly, We create a temp PVC with the source as volumesnapshot during the export(With which ceph internally clones the volume)

We delete the temp PVC as soon as the export completes. We also delete the volumesnapshot resource when the restorepoints crosses its retention period.


Great, last question… (I hope so haha). Do you know how is the workflow between Kanister, Kopia, and CSI drivers in Kubernetes?

 

Thanks in advance.


Comment