I am experimenting with K10 for cross-region disaster recovery in AWS/EBS/EKS. To that end I have set up an EKS (1.21.9) cluster in us-east-1 running K10 4.5.13 with a scheduled backup policy using S3 for metadata that excludes pods and exports volume snapshots to us-west-1. I set up a similar cluster in us-west-1 with an on-demand import policy that restores after import, and a transform applied to stateful sets to inject an environment variable.
I have six stateful sets. If all of them are scaled up (1/1) in us-east-1 at the time of the backup, the restore into us-west-1 works perfectly: all six come up in us-west-1 with data intact. For example, an original persistent volume labelled topology.kubernetes.io/region=us-east-1 & topology.kubernetes.io/zone=us-east-1a is recreated as a PV with labels failure-domain.beta.kubernetes.io/region=us-west-1 & failure-domain.beta.kubernetes.io/zone=us-west-1b. (Annoying that the transformation uses the deprecated label names, but fine.)
If three are scaled up at the time of backup, while three are scaled to zero (0/0), the K10 dashboard in us-west-1 claims the import from the restore point was successful, listing the six snapshot artifacts as expected. In fact kubectl get pv or kubectl get pvc shows that only four volumes were recreated: the three from the scaled-up stateful sets; and one from one of the scaled-down stateful sets. There is no sign of the other two volumes or volume claims. If I kubectl scale --replicas=1 statefulset one-of-those-two, it will start up with a fresh persistent volume claim and an empty disk, so the backed-up data is effectively lost.
If all are scaled down at the time of backup, the dashboard claims the import was successful, with six snapshot artifacts, but just one volume is recreated, with five missing. Again there is no apparent error message or explanation.
If all but one are scaled down, the import claims there are six snapshot artifacts, but the restore shows only two volume artifacts. Again there are just two persistent volume claims in the application namespace, and two persistent volumes not associated with K10 itself—one for the scaled-up stateful set, one for one of the scaled-down ones (seemingly arbitrary which). The JSON from the restore phase shows just the two artifacts. I looked through the JSON from the import phase, which shows all six artifacts, and found ids of the EBS snapshots (the copies in us-west-1) and checked them out using aws ec2 describe-snapshots --snapshot-id, getting similar valid output whether for a volume from a scaled-up stateful set (that was successfully restored) or for one from a scaled-down stateful set (that was not successfully restored), so there is nothing obviously amiss at the AWS level.
Is there something about an EBS volume being currently unmounted that could confuse the import logic and cause it to silently skip subsequent unmounted volumes?