Question

EBS persistent volumes not imported cross-region when stateful set scaled to zero (except once)


I am experimenting with K10 for cross-region disaster recovery in AWS/EBS/EKS. To that end I have set up an EKS (1.21.9) cluster in us-east-1 running K10 4.5.13 with a scheduled backup policy using S3 for metadata that excludes pods and exports volume snapshots to us-west-1. I set up a similar cluster in us-west-1 with an on-demand import policy that restores after import, and a transform applied to stateful sets to inject an environment variable.

 

I have six stateful sets. If all of them are scaled up (1/1) in us-east-1 at the time of the backup, the restore into us-west-1 works perfectly: all six come up in us-west-1 with data intact. For example, an original persistent volume labelled topology.kubernetes.io/region=us-east-1 & topology.kubernetes.io/zone=us-east-1a is recreated as a PV with labels failure-domain.beta.kubernetes.io/region=us-west-1 & failure-domain.beta.kubernetes.io/zone=us-west-1b. (Annoying that the transformation uses the deprecated label names, but fine.)

 

If three are scaled up at the time of backup, while three are scaled to zero (0/0), the K10 dashboard in us-west-1 claims the import from the restore point was successful, listing the six snapshot artifacts as expected. In fact kubectl get pv or kubectl get pvc shows that only four volumes were recreated: the three from the scaled-up stateful sets; and one from one of the scaled-down stateful sets. There is no sign of the other two volumes or volume claims. If I kubectl scale --replicas=1 statefulset one-of-those-two, it will start up with a fresh persistent volume claim and an empty disk, so the backed-up data is effectively lost.

 

If all are scaled down at the time of backup, the dashboard claims the import was successful, with six snapshot artifacts, but just one volume is recreated, with five missing. Again there is no apparent error message or explanation.

 

If all but one are scaled down, the import claims there are six snapshot artifacts, but the restore shows only two volume artifacts. Again there are just two persistent volume claims in the application namespace, and two persistent volumes not associated with K10 itself—one for the scaled-up stateful set, one for one of the scaled-down ones (seemingly arbitrary which). The JSON from the restore phase shows just the two artifacts. I looked through the JSON from the import phase, which shows all six artifacts, and found ids of the EBS snapshots (the copies in us-west-1) and checked them out using aws ec2 describe-snapshots --snapshot-id, getting similar valid output whether for a volume from a scaled-up stateful set (that was successfully restored) or for one from a scaled-down stateful set (that was not successfully restored), so there is nothing obviously amiss at the AWS level.

 

Is there something about an EBS volume being currently unmounted that could confuse the import logic and cause it to silently skip subsequent unmounted volumes?


2 comments

Another problem I find with dynamic scaling: the backup fails if a stateful set is in the middle of scaling up:

cause:
cause:
message: "Specified 1 replicas and only 0 are ready: could not get StatefulSet{Namespace: …, Name: …}: context deadline exceeded"
fields: …
file: kasten.io/k10/kio/exec/phases/phase/snapshot.go:410
function: kasten.io/k10/kio/exec/phases/phase.WaitOnWorkloadReady
linenumber: 410
message: Statefulset not in ready state. Retry the operation once Statefulset is ready
message: Job failed to be executed
fields: []

I would expect the backup to not care about the number of ready replicas at all. The stateful set spec can be backed up as a resources, and the EBS volume can be snapshotted; whatever is happening with pods in the replica set should be irrelevant.

Another error while trying to run a backup while in the middle of creating a new stateful set:

cause:
cause:
cause:
fields: …
file: kasten.io/k10/kio/kube/volume.go:586
function: kasten.io/k10/kio/kube.getPVCInfoHelper
linenumber: 586
message: PVC not bound
file: kasten.io/k10/kio/exec/phases/phase/snapshot.go:201
function: kasten.io/k10/kio/exec/phases/phase.FetchSnapshotSession
linenumber: 201
message: Could not query volume info from PVC
file: kasten.io/k10/kio/exec/phases/backup/snapshot_data_phase.go:106
function: kasten.io/k10/kio/exec/phases/backup.(*SnapshotDataPhase).Run
linenumber: 106
message: Failed to fetch the snapshot session
message: Job failed to be executed
fields: []

 

Comment