Solved

K10 not working with Linkerd service mesh


Userlevel 2

Trying to get K10 working with linkerd service mesh in a test cluster.

Prior to deploying linkerd, k10 is working ok doing csi snapshots (Rook Ceph provider) and mysql logical backups (using the kanister example) with backups going to a minio cluster living in its own namespace.

I have mesh enabled on the k10, minio and vaultwarden namespaces (testing using vaultwarden as I know it quite well and it uses a database). mysql logical backup is temporarily disabled. Unfortunately, k10 export fails:

  • k10 appears to come up fine - no errors - all pods communication with each other - ui works fine
  • vaultwarden works fine
  • backup csi snapshot works fine
  • backup export to minio fails with error (clip below)
status:
state: Failed
startTime: 2023-02-28T09:58:27Z
endTime: 2023-02-28T09:59:26Z
restorePoint:
name: ""
result:
name: ""
error:
cause: '{"cause":{"cause":{"cause":{"message":"Failed to exec command in pod:
Internal error occurred: error executing command in container: failed to
exec in container: failed to start exec
\"5e49e0fa33a9aa7ca1c6c1ff161305332754f0b7227f21173c51b44c794e8cba\": OCI
runtime exec failed: exec failed: unable to start container process: exec:
\"kopia\": executable file not found in $PATH:
unknown"},"file":"kasten.io/k10/kio/kopia/repository.go:558","function":"kasten.io/k10/kio/kopia.ConnectToKopiaRepository","linenumber":558,"message":"Failed
to connect to the backup
repository"},"fields":[{"name":"appNamespace","value":"vaultwarden"}],"file":"kasten.io/k10/kio/exec/phases/phase/export.go:264","function":"kasten.io/k10/kio/exec/phases/phase.prepareKopiaRepoIfExportingData","linenumber":264,"message":"Failed
to create Kopia repository for data
export"},"file":"kasten.io/k10/kio/exec/phases/phase/export.go:166","function":"kasten.io/k10/kio/exec/phases/phase.(*exportRestorePointPhase).Run","linenumber":166,"message":"Failed
to copy artifacts"}'
message: Job failed to be executed
progress: 100

Additional information:

  • no authorization restrictions are set in linkerd in these 3 namespaces so it is wide open
  • the K10 UI shows the minio location profile as validated
  • if I switch to sidecar snapshotting, the snapshot gets sent ok to the minio kasten-backup bucket (I can see the data in the Minio ui) so access to minio from the vaultwarden namespace appears to be ok BUT the k10 export still fails
  • while k10 is exporting, I see pods briefly appearing in the k10 namespace presumably to do the export
  • I previously tried the above with istio and k10 failed with communication errors so linderd is definitely a step further

In sum, it seems that the k10 ui and the sidecar snapshotter work and can communicate with minio but the k10 export fails.

Any ideas even just where to troubleshoot next.

Thanks

icon

Best answer by jaiganeshjk 2 March 2023, 13:52

View original

18 comments

Userlevel 7
Badge +7

@jaiganeshjk 

Userlevel 6
Badge +2

Hi @Garland7362 Thank you for posting the question.

From the error message, it seems that the kopia binary is not available in the pod that is spinned-up for exports.

It could be either create-repo or copy-vol-data pod. Do you see which image it is trying to use ?

 

Userlevel 6
Badge +2

I am just checking how k10 runs exec command in the newly created pods. We could be trying to run k10 kopia commands in first container but for some reason it could be running in linkerD proxy container rather than the kanister-tools container

 

Userlevel 2

Hi @jaiganeshjk thanks for your response.

I’ve just checked and it spins up create-repo and backup-data-stats first and later data-mover-svc but the export shows a retry with x2 in the console right after the create-repo.

Also, i had a thought. When these pods are deployed linkerd will cause the admission controller to mutate the pod spec to include a linkerd-proxy container. I wonder because k10 just assumes there is only one container that it just attempts to exec kopia in the first container which happens now to be the linkerd-proxy one?

Tim

Userlevel 2

Ha ha - we both thought of the same cause at the same time!

Userlevel 6
Badge +2

I see that you have a support case with us as well. 
Let me confirm if that is the cause of the issue and see what could be done if this is the case.

Userlevel 2

Thanks @jaiganeshjk - much appreciated

Userlevel 6
Badge +2

@Garland7362

I can confirm that the linkerd proxy container is where the commands are being run.
We seem to be running  exec commands in first container as we don't expect to have multiple containers with the pod that we dynamically spin-up.

We will have to enhance this behaviour for your use case. I have filed an enhancement request to support this configuration. 

I went through the Linkerd issues to find out the reason for adding the proxy container as the first container.
They mention that this change was purposefully added to avoid startup issues in this github issue .

It seems that you can use annotation in the workload config.linkerd.io/proxy-await: "disabled"  to disable this feature.


Thinking about working around this, You could try using custom kanister annotations helm value to add this annotation to the pods spinned up by K10.

This would make sure that the kanister-tools container be the first container and should resolve your issue for time-being.

https://docs.kasten.io/latest/kanister/override.html#configuring-custom-labels-and-annotations

Please let me know if it works for you.

Userlevel 2

@jaiganeshjk 

Thanks for working through this and for raising the enhancement request.

I am trying to implement the workaround but having issues.

Following the documentation link, I added to my helm values file:

kanisterPodCustomAnnotations: "config.linkerd.io/proxy-await=disabled"

When k10 is deployed virtually all the pods remain stuck in CreateContainerConfigError. Looking at one of the pods on events it shows:

9s          Warning   Failed                   pod/executor-svc-6cf948dd75-crvql             Error: couldn't find key kanisterPodCustomAnnotations in ConfigMap kasten/k10-config

I can see the kanisterPodCustomAnnotations in the k10-config configmap:

apiVersion: v1
data:
AWSAssumeRoleDuration: 60m
K10BackupBufferFileHeadroomFactor: "1.1"
K10ExecutorMaxConcurrentRestoreCsiSnapshots: "3"
K10ExecutorMaxConcurrentRestoreGenericVolumeSnapshots: "3"
K10ExecutorMaxConcurrentRestoreWorkloads: "3"
K10ExecutorWorkerCount: "8"
K10GCDaemonPeriod: "21600"
K10GCImportRunActionsEnabled: "false"
K10GCKeepMaxActions: "1000"
K10GCRetireActionsEnabled: "false"
K10LimiterCsiSnapshots: "10"
K10LimiterGenericVolumeCopies: "10"
K10LimiterGenericVolumeRestores: "10"
K10LimiterGenericVolumeSnapshots: "10"
K10LimiterProviderSnapshots: "10"
K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
K10RootlessContainers: "false"
KanisterBackupTimeout: "45"
KanisterCheckRepoTimeout: "20"
KanisterDeleteTimeout: "45"
KanisterEFSPostRestoreTimeout: "45"
KanisterHookTimeout: "20"
KanisterPodCustomAnnotations: config.linkerd.io/proxy-await=disabled
KanisterPodReadyWaitTimeout: "15"
KanisterRestoreTimeout: "600"
KanisterStatsTimeout: "20"
apiDomain: kio.kasten.io
concurrentSnapConversions: "3"
concurrentWorkloadSnapshots: "5"
efsBackupVaultName: k10vault
k10DataStoreGeneralContentCacheSizeMB: "0"
k10DataStoreGeneralMetadataCacheSizeMB: "500"
k10DataStoreParallelUpload: "8"
k10DataStoreRestoreContentCacheSizeMB: "500"
k10DataStoreRestoreMetadataCacheSizeMB: "500"
kanisterFunctionVersion: v1.0.0-alpha
kubeVirtVMsUnFreezeTimeout: 5m
loglevel: info
modelstoredirname: //mnt/k10state/kasten-io/
multiClusterVersion: "2"
version: 5.5.2
vmWareTaskTimeoutMin: "60"
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: kasten
meta.helm.sh/release-namespace: kasten
creationTimestamp: "2023-03-02T13:57:08Z"
labels:
app: k10
app.kubernetes.io/instance: kasten
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: k10
helm.sh/chart: k10-5.5.2
heritage: Helm
release: kasten
name: k10-config
namespace: kasten
resourceVersion: "727624"
uid: c22f8f9c-4148-457f-90cc-ed8832b45ca2

Checking one of the pods I don’t see the annotation present:

apiVersion: v1
kind: Pod
metadata:
annotations:
checksum/config: e9e02307e75a18baf778c4179ed1e3a581d163273e4614c240cb09aa5897ef9d
checksum/frontend-nginx-config: 44e6086c684885c88e43f79224e688aec51a0672e5f87fc961ed2af9006e60fb
checksum/secret: 90de018eb29ceff98d4bdbf538a1ec6f1696a830dbd249640db547e571ca8569
linkerd.io/created-by: linkerd/proxy-injector stable-2.12.4
linkerd.io/inject: enabled
linkerd.io/proxy-version: stable-2.12.4
linkerd.io/trust-root-sha256: 514ac68fae331666e1366e476d3e49e3b72226b9d0b3483f2a15007d510b09bc
rollme: HLRIq
viz.linkerd.io/tap-enabled: "true"
creationTimestamp: "2023-03-02T13:57:10Z"
generateName: executor-svc-6cf948dd75-
labels:
app: k10
app.kubernetes.io/instance: kasten
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: k10
component: executor
helm.sh/chart: k10-5.5.2
heritage: Helm
linkerd.io/control-plane-ns: linkerd
linkerd.io/proxy-deployment: executor-svc
linkerd.io/workload-ns: kasten
pod-template-hash: 6cf948dd75
release: kasten
run: executor-svc
name: executor-svc-6cf948dd75-2x45t
namespace: kasten
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: executor-svc-6cf948dd75
uid: 25f315ef-968d-47bb-9652-efe12fa72a33
resourceVersion: "728396"
uid: 59b7ec3f-c946-45c6-8f11-1505f17c7940
spec:
...

 

I’ve tried using “: ” instead of “=” in the annotation but no change.

Is there something I am doing wrong here?

Tim

Userlevel 6
Badge +2

I think I know the answer. It seems to be a bug from our side(I will file a bug and get it fixed).

The keyReference(from the configMap) that is added to the workloads is kanisterPodCustomAnnotations

Where as the Key that was added in the configmap is KanisterPodCustomAnnotations

Notice the upper case K in the configMap and lowercase in the deployments.

You might have to workaround it for the time-being by editing the configmap and changing the key to kanisterPodCustomAnnotations (with lowercase k)

 

Userlevel 2

Thanks, I didn’t spot that. Yes, editing the key name in the configmap solved the startup issue. I’ll test now to see if backup exports now work.

Userlevel 2

@jaiganeshjk,

I have now successfully completed one test cycle of my test app: add items to database, backup (including logical mysql backup), delete items, restore.

This appears to have fixed the problem.

I will continue to test over the next few days to confirm.

Thank you - this is excellent.

Are you able to advise timeframe estimates for the k10-config bug and the enhancement for the exec to not run the linkerd container?

Userlevel 6
Badge +2

@Garland7362 Glad that the workaround works for you now.

I don’t have any timelines. It has to go through the PM review and it will be worked upon.

All I can say that the bug will be fixed before the enhancement. 

IMO, your workaround to add annotations to the kanister pod  will ensure your exports work until we have the enhancement in this workflow, unless something changes in the linkerd wrt to the annotation.

Userlevel 2

Thanks for the workaround and queued bug fix and enhancement.

I have closed the support case.

Userlevel 1

Hi guys,

It looks like in version 5.5.7 the initial bug has been fixed as deployments now seem to be checking KanisterPodCustomAnnotations from in the k10-config configMap (replacing K with k makes the deployments fail so ...). That’s good news.

Problem is that when I to add annotation sidecar.istio.io/inject=false (as in my case I’m using Istio service mesh) I cannot see that annotation on the kanister-job-* pods created and the istio proxy sidecar container is still injected.

I also tried using the pod-spec-override to add the annotation but without much success.

Ref to the documentation used: https://docs.kasten.io/latest/kanister/override.html

What should I do to have the annotation actually added to the kanister-job pods?

Userlevel 2

Hi @stephw,

I tried to upgrade to 5.5.7 to test this in my cluster but it causes an issue that will take some time to resolve so I cannot not test at present.

In 5.5.2, once I had changed the “K to “k” in k10-config, the annotation I had specified was present in the kanister job pods when they were deployed.

One temporary workaround if you are pressing up against a deadline and you have kyverno on the cluster is to create a kyverno mutate policy that mutates the pod spec at admission time to add the annotation. I have successfully used this to work around problems on other helm charts until they are fixed.

Userlevel 2

@Garland7362 @jaiganeshjk @stephw Any update on this? We’re using Kasten 6.5.1 with Linkerd as our service mesh solution and still experience the problem - adding an annotation for disabling the linkerd proxy injection to either pod-spec-override ConfigMap:

apiVersion: v1
data:
override: |
kind: Pod
metadata:
annotations:
linkerd.io/inject: disabled
kind: ConfigMap
metadata:
name: pod-spec-override
namespace: kasten-io

 or as a key in our Helm Chart installation:

KanisterPodCustomAnnotations: "linkerd.io/inject=disabled"

does not seem to work - the key is properly set in k10-config but kanister jobs are still injected with the Linkerd proxy. We are not planning to use Kyverno. We could disable automatic proxy injection and add the annotation manually to each of our workflows but we treat this as a last resort solution.

Userlevel 4
Badge +2

Hello @peterturnip 

Reposting here also since other users could find this post.

The correct option in this case would be to use custom annotations, since pod-spec-override is not intended to add annotations.

We are aware about kanisterPodCustomAnnotations and kanisterPodCustomLabels not being applied to kanister-job pods, and we will be working to fix this as soon as possible. The annotations and labels added in those fields are being applied to pods that runs on kasten-io namespace during the backup/export when using blueprints (i.e. datamover,copy-vol-data), but kanister-job pods runs on application's namespace.

I will update this thread as soon as we have the fix.

Also would like to recommend to keep an eye in our release pages where we will inform about new features and bug fixes.

https://docs.kasten.io/latest/releasenotes.html

Regards,
Rubens

Comment