Skip to main content

Hello everyone, I continue to dive into kubernetes and Kasten

Deployed a new cluster, as a storage system - NetApp Ontap

Deployed Kasten and a simple application

I'm trying to make a backup, and the process freezes at the moment of Snapshotting Application Components and nothing happens after that. The task is not interrupted, there are no errors either

At the same time, he apparently does the snapshot himself successfully

korp@k8smaster:~$ kubectl get vs -n mysql
NAME                            READYTOUSE   SOURCEPVC                 SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS           SNAPSHOTCONTENT                                    CREATIONTIME   AGE
k10-csi-snap-wj6bljvrhrvzblpm   false        data-my-release-mysql-0                                         trident-snapshotclass   snapcontent-634da5fe-9160-4af1-afb9-d0f9a546d2cd                  6m53s
korp@k8smaster:~$ kubectl get vsc -n mysql
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                  VOLUMESNAPSHOTCLASS     VOLUMESNAPSHOT                  VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-634da5fe-9160-4af1-afb9-d0f9a546d2cd                              Delete           csi.trident.netapp.io   trident-snapshotclass   k10-csi-snap-wj6bljvrhrvzblpm   mysql                     6m58s

Help me figure out where to start looking. Without a clear error, it's hard for me to figure out what's wrong

Kubernetes Version Check:
  Valid kubernetes version (v1.21.14)  -  OK

RBAC Check:
  Kubernetes RBAC is enabled  -  OK

Aggregated Layer Check:
  The Kubernetes Aggregated Layer is enabled  -  OK

CSI Capabilities Check:
  Using CSI GroupVersion snapshot.storage.k8s.io/v1  -  OK

Validating Provisioners:
csi.trident.netapp.io:
  Is a CSI Provisioner  -  OK
  Storage Classes:
    basic
      Valid Storage Class  -  OK
  Volume Snapshot Classes:
    trident-snapshotclass
      Has k10.kasten.io/is-snapshot-class annotation set to true  -  OK
      Has deletionPolicy 'Delete'  -  OK

Validate Generic Volume Snapshot:
  Pod created successfully  -  OK
  GVS Backup command executed successfully  -  OK
  Pod deleted successfully  -  OK

I looked at the logs of the executor and found many duplicate records:

{"File":"kasten.io/k10/kio/exec/internal/runner/runner.go","Function":"kasten.io/k10/kio/exec/internal/runner.(*Runner).execPhases","Job":{"completeTime":"0001-01-01T00:00:00.000Z","count":1,"creationTime":"2022-10-08T20:23:28.776Z","deadline":"0001-01-01T00:00:00.000Z","errors":null,"id":"11f1200e-4747-11ed-91ee-7e67ff22cc9a","manifest":"11f1f48d-4747-11ed-bc09-06547944e6eb","originatingPolicies":"{"id":"b4302261-e73b-407f-9893-22a228512a76"}],"phases":"{"name":"fanout","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"queuingAndWaitingOnChildren","scratch":null,"status":"pending","weight":100}],"scheduledTime":"2022-10-08T20:23:28.760Z","startedTime":"2022-10-08T20:23:31.707Z","status":"running","updatedTime":"2022-10-09T05:53:18.032Z","waitCount":2197,"waitStartTime":"2022-10-08T20:23:31.801Z"},"JobID":"11f1200e-4747-11ed-91ee-7e67ff22cc9a","Line":417,"ManifestID":"11f1f48d-4747-11ed-bc09-06547944e6eb","QueuedJobID":"11f1200e-4747-11ed-91ee-7e67ff22cc9a","RequestID":"74cdfe53-4796-11ed-b322-b6feef83f305","SubjectRef":"kasten-io:mysql-backup","cluster_name":"1f863e67-29fe-4090-a0f0-544851cb0e1f","hostname":"executor-svc-79d874d9d6-cqnwd","level":"info","msg":"Skipping Phase: fanout","time":"20221009-05:53:18.042Z","version":"5.0.10"}

all the time after launching and creating a snapshot, it is in this state. And it is not possible to interrupt the task


@KorP Thank you for posting this question.

I understand that you are using Netapp PVC with Trident CSI driver(With csi.trident.netapp.io ) to snapshot the volumes.

Basically, for CSI snapshot workflow, K10 creates a volumesnapshot and waits for it to be readyToUse.

For some reason(outside of K10), if the volumesnapshot did not get ready , there is a problem with your CSI driver while it tries to sync the snapshot with the backend(This means that it is unable to create a snapshot in the netapp array).

I could confirm that it is the case from your output of volumesnapshot/volumesnapshotcontent.

It doesn’t say `true` for the field `readyToUse`.

 

To troubleshoot the issue further, you will have to know the workflow of CSI snapshots. This documentation will help you understand better.

 

To point you in the right direction , you will have to look through the logs from `csi-snapshotter` container in the trident provisioner pod. That will give you the reason on why the snapshot is not ready.
 


@jaiganeshjk Thank you very much for the help!
Indeed, I did not think of looking into the csi-snapshotter logs (so far there is not enough knowledge to fully understand the picture). He complained:

the server could not find the requested resource (get volumesnapshotcontents.snapshot.storage.k8s.io)

Although I know for sure that I installed it, because deployed the cluster using Ansible and it is in the playbook. But nevertheless, after I applied them again, the snapshot state immediately changed to READYTOUSE - true. I ran the task and it ran successfully. Now it will figure out what was wrong with installing snapshot crds


the server could not find the requested resource (get volumesnapshotcontents.snapshot.storage.k8s.io)

This error might be usually due to the apiVersion of the CRDs.

If you had v1beta1 in the cluster whereas your driver supported only v1 or vice-versa


Comment