Question

DR restore job aborts with "Failed to scale down Catalog"

  • 27 February 2023
  • 4 comments
  • 138 views

Userlevel 1

Hello!

We’re facing an issue with using the Kasten DR functionality following the official guide here: https://docs.kasten.io/latest/operating/dr.html#recovering-k10-from-a-disaster

The issue is logged in the DR restore job pod stating the following:

Error: {"message":"Failed to scale down Catalog","function":"kasten.io/k10/kio/tools/restorectl.restoreK10","linenumber":138,"file":"kasten.io/k10/kio/tools/restorectl/restore.go:138","cause":{"message":"Failed waiting for deployment replicas","function":"kasten.io/k10/kio/tools/restorectl/servicescaler.(*deploymentScaler).ScaleAndVerifyWithTimeout","linenumber":80,"file":"kasten.io/k10/kio/tools/restorectl/servicescaler/deployment_scaler.go:80","fields":[{"name":"deployment","value":"catalog-svc"},{"name":"replicas","value":0}],"cause":{"message":"context cancelled","function":"kasten.io/k10/kio/tools/restorectl/servicescaler.waitForDeploymentReplicas","linenumber":61,"file":"kasten.io/k10/kio/tools/restorectl/servicescaler/utils.go:61"}}}

Our setup:

  • EKS cluster version 1.25.6
  • K10 Helm version 5.5.6 using IRSA (IAM role to Service Account)
  • DR activated on an AWS S3 location and a passphrase

In the k10-restore helm chart the following is given:

  • sourceClusterID
  • profile.name points to K10 AWS S3 location name
  • secrets.awsIamRole points to IRSA arn
  • Passphrase is given via provisioned K8s secret

Looking forward to any input on this, as this is crucial for us to work to cover the DR scenario.

 

All the best,

Widura


4 comments

Userlevel 7
Badge +7

@jaiganeshjk 

Userlevel 5
Badge +2

Hello @widura it seems, the catalog deployment is already scaled down and the restore pod is trying to scale it down too, and eventually, you got the error “unable to scale down...)

plz share the output of the below command:

 

​​​​​​​kubectl get deployments catalog-svc -n kasten-io


and please clarify if you trying to restore on the same cluster.
If you are reinstalling K10 on the same cluster, cleaning up the namespace in which K10 was previously installed before the above passphrase creation is important.

and K10 must be reinstalled before recovery.
 

Ahmed Hagag

Userlevel 1

Hello,

thanks for all the input so far and sorry for the late reply! I was fully booked yesterday.

I tried to reproduce this now with a fresh k10 installation, with nothing but the DR enabled. I deleted k10 completely, including the namespace, sticked with the identical cluster to redeploy k10, deploy the DR passphrase as secret, create a location profile manually using the “Authenticate With AWS IAM Role” option and deploy the k10restore-job as helm chart with sourceClusterID, profile.name and secrets.awsIamRole set.

I receive this error message:

{"File":"kasten.io/k10/kio/tools/restorectl/root.go","Function":"kasten.io/k10/kio/tools/restorectl.Execute","Line":24,"cluster_name":"7fe66dd6-231b-48e4-834a-87a128b2d777","error":{"message":"Failed to restore DR snapshot","function":"kasten.io/k10/kio/tools/restorectl.restoreK10","linenumber":160,"file":"kasten.io/k10/kio/tools/restorectl/restore.go:160","cause":{"message":"Failed to restore K10 DR backup","function":"kasten.io/k10/kio/kanister/function.RestoreDataDR","linenumber":135,"file":"kasten.io/k10/kio/kanister/function/kio_restore_data_dr.go:135","cause":{"message":"Failed to generate repository connect command","function":"kasten.io/k10/kio/kopia.ConnectToKopiaRepository","linenumber":536,"file":"kasten.io/k10/kio/kopia/repository.go:536","cause":{"message":"Failed to generate blob store args","function":"kasten.io/k10/kio/kopia.repositoryConnectCommand","linenumber":358,"file":"kasten.io/k10/kio/kopia/repository.go:358","cause":{"message":"Failed to get AWS credentials: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 337285c9-28e5-4a90-bfc7-daaece35c61f"}}}}},"hostname":"k10-restore-k10restore-bgcft","level":"error","msg":"Failed","time":"20230301-08:40:45.172Z"}
Stream closed EOF for kasten-io/k10-restore-k10restore-bgcft (k10restore)
​​​​

I noticed that the secret used by the location is empty. I tried to copy the content from the one secret containing the IRSA role ARN to the empty secret but without a result.  

I realized that re-running the k10restore job with this issue, results in the catalog-svc deployment being set to “0” but not to “1” again, resulting in the issue described in my initial post.

Though, I’m now stuck with this authentication issue. I double-checked the Service Account of the k10restore-job but this is correctly annotated with the IRSA role ARN. I’m using the same role for the restore as for creating the DR backup, so that is also not the issue.

Any input from your end?

 

All the best,

Widura

Userlevel 1

I have managed to sort it out. I was missing to reference the correct serviceaccount in the k10restore helm chart. By default, a new service account is created, which did not match the IRSA policy. Referencing the SA name from the k10 deployment solved the authentication issue. I’ve added theses two lines to the list of values in the k10restore helm chart:

  • serviceAccount.create = false
  • serviceAccount.name = <irsa_service_account_name>

Thanks again to all for your efforts!

Widura

Comment