Solved

K10 Kasten - PVCs Snapshot restore fails with context deadline exceeded


Userlevel 2

We are using Kasten K10 v5.5.0, MinIO as S3 storage. Our application (RabbitMQ) is deployed in a AKS cluster in HA with 3 nodes and we backup the snapshot of the PVCs successfully. After reinstalling the application in order to simulate a disaster recovery, we meet the following error during the restore phase, using as restore point the one related to MinIO storage:

 

Restoring Application Components failure

If instead we use the restore point native to the cluster, our restore is successfully. So the problem is only related to the case in which the restore point is the one outside the cluster.

We don’t think this issue could be related to some timeout into k10-config configmap because our restore procedure fails just after a minute and those values are greater. 

 

Do you have any idea about this error?

Thank you in advance for your support and let me know if you need other details.

icon

Best answer by jaiganeshjk 17 May 2023, 13:43

View original

9 comments

Userlevel 7
Badge +20

@jaiganeshjk 

Userlevel 7
Badge +22

Did you say you are re-installing the application? So having an vanilla application running, then restoring the pvcs? I am guessing here to be honest, but it is timing out according to the log on deleting pvcs (so perhaps trying to overwrite the “reinstalled empty once”. Again since coming from S3 this would be Kannister I think doing its magic that timeout could be due to some permissions restrictions in AKS (again wild guess)  Could you attach the full logs?

Userlevel 2

Hi @Geoff Burke,
sorry I wasn't very clear about reinstalling. We mean that release + its PVCs are deleted and then the application is reinstalled from scratch. After that, we perform the restore in order to simulate a disaster recovery.

Yes sure these are part of the logs containing the mentioned error (we cannot upload the entire logs because of comment size limit).
Thanks

{"File":"kasten.io/k10/kio/exec/internal/runner/runner.go","Function":"kasten.io/k10/kio/exec/internal/runner.(*Runner).execPhases","Job":{"completeTime":"0001-01-01T00:00:00.000Z","conflictLockString":"rabbit-ns","count":3,"creationTime":"2023-05-03T10:29:50.733Z","deadline":"0001-01-01T00:00:00.000Z","errors":[{"cause":{"cause":{"cause":{"cause":{"cause":{"message":"context deadline exceeded"},"file":"kasten.io/k10/kio/poll/poll.go:96","function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetriesHelper","linenumber":96,"message":"Context done while polling"},"fields":[{"name":"duration","value":"1m0.000784811s"}],"file":"kasten.io/k10/kio/poll/poll.go:66","function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetries","linenumber":66,"message":"Timeout while polling"},"fields":[{"name":"pvcName","value":"data-rabbitmq-0"}],"file":"kasten.io/k10/kio/exec/phases/phase/restore.go:406","function":"kasten.io/k10/kio/exec/phases/phase.deletePVCsFromNames","linenumber":406,"message":"Failed to delete PVC"},"file":"kasten.io/k10/kio/exec/phases/phase/restore_app_prepare.go:94","function":"kasten.io/k10/kio/exec/phases/phase.(*restoreAppPreparePhase).Run","linenumber":94,"message":"Failed to delete PVCs from specs"},"fields":[],"message":"Job failed to be executed"},{"cause":{"cause":{"cause":{"cause":{"cause":{"message":"context deadline exceeded"},"file":"kasten.io/k10/kio/poll/poll.go:96","function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetriesHelper","linenumber":96,"message":"Context done while polling"},"fields":[{"name":"duration","value":"1m0.000557942s"}],"file":"kasten.io/k10/kio/poll/poll.go:66","function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetries","linenumber":66,"message":"Timeout while polling"},"fields":[{"name":"pvcName","value":"data-rabbitmq-0"}],"file":"kasten.io/k10/kio/exec/phases/phase/restore.go:406","function":"kasten.io/k10/kio/exec/phases/phase.deletePVCsFromNames","linenumber":406,"message":"Failed to delete PVC"},"file":"kasten.io/k10/kio/exec/phases/phase/restore_app_prepare.go:94","function":"kasten.io/k10/kio/exec/phases/phase.(*restoreAppPreparePhase).Run","linenumber":94,"message":"Failed to delete PVCs from specs"},"fields":[],"message":"Job failed to be executed"}],"id":"6f743426-e99d-11ed-8bad-a6920edc1e18","manifest":"6f725ff1-e99d-11ed-af63-82f61f06165a","phases":[{"name":"createNamespace","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"takeRestorePointHold","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"lockKubernetesNamespace","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"restorePreHook","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"restorePrecheck","progress":100,"scratch":[{"key":"replicas","value":"{}"}],"status":"succeeded","weight":1},{"name":"RestoreSpecsForNamespace","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"RestoreSpecsForResources","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"restoreVolumes","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"restoreAppPrepare","progress":100,"scratch":null,"status":"failed","weight":1},{"name":"restoreApplication","scratch":null,"status":"pending","weight":1},{"name":"RestoreSpecsForCRs","scratch":null,"status":"pending","weight":1},{"name":"restoreSanitize","scratch":null,"status":"pending","weight":1},{"name":"restoreKanister","scratch":null,"status":"pending","weight":1},{"name":"restorePostHook","progress":100,"scratch":null,"status":"succeeded","weight":1},{"name":"dropRestorePointHold","progress":100,"scratch":null,"status":"succeeded","weight":1}],"scheduledTime":"2023-05-03T10:29:51.000Z","startedTime":"2023-05-03T10:29:56.321Z","status":"running","updatedTime":"2023-05-03T10:33:42.609Z","waitStartTime":"0001-01-01T00:00:00.000Z"},"JobID":"6f743426-e99d-11ed-8bad-a6920edc1e18","Line":420,"ManifestID":"6f725ff1-e99d-11ed-af63-82f61f06165a","QueuedJobID":"6f743426-e99d-11ed-8bad-a6920edc1e18","RequestID":"4707b280-e98e-11ed-8376-6649e5777b97","cluster_name":"e8244544-777b-4bab-8e60-0e3545406fb4","hostname":"executor-svc-65dcb49d85-8nztz","level":"info","msg":"Skipping Phase: restoreSanitize","time":"20230503-10:33:42.616Z","version":"5.5.0"}
{"File":"kasten.io/k10/kio/exec/internal/runner/runner.go","Function":"kasten.io/k10/kio/exec/internal/runner.(*Runner).maybeExecJob","JobID":"6f743426-e99d-11ed-8bad-a6920edc1e18","Line":229,"ManifestID":"6f725ff1-e99d-11ed-af63-82f61f06165a","QueuedJobID":"6f743426-e99d-11ed-8bad-a6920edc1e18","RequestID":"4707b280-e98e-11ed-8376-6649e5777b97","cluster_name":"e8244544-777b-4bab-8e60-0e3545406fb4","error":{"message":"Failed to delete PVCs from specs","function":"kasten.io/k10/kio/exec/phases/phase.(*restoreAppPreparePhase).Run","linenumber":94,"file":"kasten.io/k10/kio/exec/phases/phase/restore_app_prepare.go:94","cause":{"message":"Failed to delete PVC","function":"kasten.io/k10/kio/exec/phases/phase.deletePVCsFromNames","linenumber":406,"file":"kasten.io/k10/kio/exec/phases/phase/restore.go:406","fields":[{"name":"pvcName","value":"data-rabbitmq-0"}],"cause":{"message":"Timeout while polling","function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetries","linenumber":66,"file":"kasten.io/k10/kio/poll/poll.go:66","fields":[{"name":"duration","value":"1m0.000242748s"}],"cause":{"message":"client rate limiter Wait returned an error: context deadline exceeded"}}}},"hostname":"executor-svc-65dcb49d85-8nztz","level":"error","msg":"Job failed","time":"20230503-10:33:42.659Z","version":"5.5.0"}

 

Userlevel 6
Badge +2

@giorgionanfa Thanks for your explanation.
Generally during the restore workflow, K10 deletes the PVC and recreates it based on the backups and restore the data into it from the external target(MinIO in your case) .

From the log messages that you have shared, It says that k10 is failing to delete the PVC.

 

Since you mentioned that release + its PVCs were deleted, did you validate if the PVC actually got removed from the cluster or was it in terminating state ?

If it is stuck in terminating state, k10 would also fail to delete the PVC.

However, if the namespace was clean, we wouldn’t attempt to delete the PVC as it wouldn’t be there. 

If possible can you retry the restore to the same namespace after deleting the entire namespace(and ensuring that the resources are removed) ?

Userlevel 2

@jaiganeshjk Thanks for your reply.

Sure, we validated that PVCs were deleted and not in a terminating state before reinstalling the application.
We also tried to delete the entire namespace (instead of release + PVCs), reinstall the application and finally restore, but the error was the same because K10 fails to delete the PVCs.
If we try to restore the snapshot of the PVCs into the empty namespace, the restore works properly because there are not PVCs to delete.

 

I would like to add a detail to yesterday's comment. We found this log after the others that we shared:

 

{"File":"kasten.io/k10/kio/exec/phases/phase/restore.go","Function":"kasten.io/k10/kio/exec/phases/phase.getPVCNameSet","Line":392,"cluster_name":"e8244544-777b-4bab-8e60-0e3545406fb4","hostname":"executor-svc-65dcb49d85-8nztz","level":"info","msg":"Duplicate PVC name in artifacts","pvcName":"data-rabbitmq-0","time":"20230503-09:06:18.952Z","version":"5.5.0"}

 

It would be great to be able to share the entire logs, but I don't know if I can in my comment.

Thanks

Userlevel 2

UPDATE

Moreover we observed that, scaling our statefulset to 0 replicas, the restore works properly and the PVCs are deleted and recreated. So it seems that K10 fails to delete the PVCs because the pods exist and the PVCs metadata finalizers are set to kubernetes.io/pv-protection

Starting from these considerations, we implemented a custom blueprint with a restorePrehook scaling the application replicas to 0 and a restorePosthook to 3 for HA. Using this workaround, the PVCs are deleted and then recreated without any problem.

 

We should still fully understand the difference with what happens in this MinIO case (we need this manual workaround) and when the restore point is native to the cluster (we don’t need it because the pods are “automatically” deleted before the PVCs are deleted and recreated, and everything works properly).

Thank you for the support

Userlevel 6
Badge +2

@giorgionanfa Thanks for the updates.

From what I understand, We do scale down of the workloads even during the restore of the exported restorepoints(same as we do for local restorepoints). However, your tests with restoreprehook and restoreposthook eventually shows that it doesn’t. 

I will have to test this locally, try to reproduce it and file a bug.

Userlevel 2

Thanks @jaiganeshjk, let me know if you need other details about our tests.

Userlevel 6
Badge +2

@giorgionanfa Thank you for your time over the call to understand about the issue.

Explaining the issue and resolution below for the sake of community.

The expected behaviour of k10 during a restore is to delete the PVC before doing a restore.

And before we delete the PVC, the workloads are scaled down so that the deletion of PVCs happens gracefully. This happens only when the workloads(STS/Deployment/DC) are part of the backup as well as the restore.

 

However, We found that Giorgio was just including the PVC in the policy(which effectively excludes all the other resources) and this was causing k10 to not scaled down the workload before attempting to delete it. This caused the PVC to stay in terminating state and the restore eventually failed.

Once Giorgio added the statefulesets to the policy include filter, the restore started to complete without any issues.

Comment