Question

EBS persistent volumes not imported cross-region when stateful set scaled to zero (except once)

2 years ago
April 22, 2022
3 comments
234 views

jglick
New Here
2 comments

I am experimenting with K10 for cross-region disaster recovery in AWS/EBS/EKS. To that end I have set up an EKS (1.21.9) cluster in us-east-1 running K10 4.5.13 with a scheduled backup policy using S3 for metadata that excludes pods and exports volume snapshots to us-west-1. I set up a similar cluster in us-west-1 with an on-demand import policy that restores after import, and a transform applied to stateful sets to inject an environment variable.

I have six stateful sets. If all of them are scaled up (1/1) in us-east-1 at the time of the backup, the restore into us-west-1 works perfectly: all six come up in us-west-1 with data intact. For example, an original persistent volume labelled topology.kubernetes.io/region=us-east-1 & topology.kubernetes.io/zone=us-east-1a is recreated as a PV with labels failure-domain.beta.kubernetes.io/region=us-west-1 & failure-domain.beta.kubernetes.io/zone=us-west-1b. (Annoying that the transformation uses the deprecated label names, but fine.)

If three are scaled up at the time of backup, while three are scaled to zero (0/0), the K10 dashboard in us-west-1 claims the import from the restore point was successful, listing the six snapshot artifacts as expected. In fact kubectl get pv or kubectl get pvc shows that only four volumes were recreated: the three from the scaled-up stateful sets; and one from one of the scaled-down stateful sets. There is no sign of the other two volumes or volume claims. If I kubectl scale --replicas=1 statefulset one-of-those-two, it will start up with a fresh persistent volume claim and an empty disk, so the backed-up data is effectively lost.

If all are scaled down at the time of backup, the dashboard claims the import was successful, with six snapshot artifacts, but just one volume is recreated, with five missing. Again there is no apparent error message or explanation.

If all but one are scaled down, the import claims there are six snapshot artifacts, but the restore shows only two volume artifacts. Again there are just two persistent volume claims in the application namespace, and two persistent volumes not associated with K10 itself—one for the scaled-up stateful set, one for one of the scaled-down ones (seemingly arbitrary which). The JSON from the restore phase shows just the two artifacts. I looked through the JSON from the import phase, which shows all six artifacts, and found ids of the EBS snapshots (the copies in us-west-1) and checked them out using aws ec2 describe-snapshots --snapshot-id, getting similar valid output whether for a volume from a scaled-up stateful set (that was successfully restored) or for one from a scaled-down stateful set (that was not successfully restored), so there is nothing obviously amiss at the AWS level.

Is there something about an EBS volume being currently unmounted that could confuse the import logic and cause it to silently skip subsequent unmounted volumes?

J

jglick
Author
New Here
2 comments
2 years ago
April 26, 2022

Another problem I find with dynamic scaling: the backup fails if a stateful set is in the middle of scaling up:

cause:
  cause:
    message: "Specified 1 replicas and only 0 are ready: could not get StatefulSet{Namespace: …, Name: …}: context deadline exceeded"
  fields: …
  file: kasten.io/k10/kio/exec/phases/phase/snapshot.go:410
  function: kasten.io/k10/kio/exec/phases/phase.WaitOnWorkloadReady
  linenumber: 410
  message: Statefulset not in ready state. Retry the operation once Statefulset is ready
message: Job failed to be executed
fields: []

I would expect the backup to not care about the number of ready replicas at all. The stateful set spec can be backed up as a resources, and the EBS volume can be snapshotted; whatever is happening with pods in the replica set should be irrelevant.

J

jglick
Author
New Here
2 comments
2 years ago
April 26, 2022

Another error while trying to run a backup while in the middle of creating a new stateful set:

cause:
  cause:
    cause:
      fields: …
      file: kasten.io/k10/kio/kube/volume.go:586
      function: kasten.io/k10/kio/kube.getPVCInfoHelper
      linenumber: 586
      message: PVC not bound
    file: kasten.io/k10/kio/exec/phases/phase/snapshot.go:201
    function: kasten.io/k10/kio/exec/phases/phase.FetchSnapshotSession
    linenumber: 201
    message: Could not query volume info from PVC
  file: kasten.io/k10/kio/exec/phases/backup/snapshot_data_phase.go:106
  function: kasten.io/k10/kio/exec/phases/backup.(*SnapshotDataPhase).Run
  linenumber: 106
  message: Failed to fetch the snapshot session
message: Job failed to be executed
fields: []

Comment

Related topics

SureBackup Dev Lab - My Favorite Feature!

MEET THE ARCHITECT MICHAEL PAUL!

MEET THE ARCHITECT MATT PRICE!

What's your favorite VBR Trick?

v12.1 What's new? What's your favorite?

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded