Skip to main content

I have deployed a sample EFS app in an AWS EKS cluster as described in the AWS documentation. I’m using dynamic provisioning & the sample app is deployed to a dedicated namespace efs-app. The pod is running as expected & writing data to the EFS file system.

I created a manual snapshot of this app from the Kasten dashboard & exported it to S3. As described in Kasten documentation, Kasten used the Shareable Volume Backup and Restore mechanism to backup the contents of the EFS file system. This was verified by observing the pods in the efs-app namespace as Kasten spun up a new pod, mounted the EFS volume & backed up its data.

In another EKS cluster in another AWS region & another AWS account, I created an on-demand Kasten import profile & ran it to successfully import restore points for efs-app from S3. However, when I try restoring the app from the restore point, restore fails with the error:

Job failed to be executed > Failure in planned phase > Failed to restore workloads > Error waiting for workload to be ready > Pod not in ready state

The Kubernetes events for the namespace show a warning for Kasten’s affinity-pod-0https://pastebin.com/0p6NW6Xj

I have confirmed that the EFS ID fs-0c3c6fa2945ce52f7 & EFS access point ID fsap-0df2e5c48488629d8 from the above error, are valid in this AWS region & account. The access point was auto-created by the EFS CSI driver & is in the active state.

Note that this Kasten restore was only attempted after all prerequisites were met like manually creating EFS file system in AWS console, creating a mount point for it with a security group providing the cluster access to it, etc. Moreover, if I manually deploy a fresh copy of the sample app in a different namespace in this cluster, it works as expected.

According to this GitHub issue, this is a limitation of the Bottlerocket OS that we’re using for our EKS clusters. Using Amazon Linux 2 for EKS worker nodes should fix this issue.

For the record, here are a few steps we tried before we knew about the Bottlerocket issue. These of course, did not work:

  1. Add an ingress rule to the cluster’s security groups to allow all traffic from everywhere. All that is really needed here is to allow NFS traffic on TCP 2049 from the EKS workers’ subnet CIDRs.
  2. Add AdministratorAccess IAM policy to:
    1. The Kasten IAM user whose access keys were used to create infrastructure & location profiles in K10.
    2. EKS worker node instance profiles (IAM roles).
    3. The IAM role used by the EFS CSI driver’s service account.

By the way, the error seen in Kubernetes events can also be found in the efs-csi-node pod logs.


Comment