Skip to main content
Solved

K10 not working with Linkerd service mesh


Trying to get K10 working with linkerd service mesh in a test cluster.

Prior to deploying linkerd, k10 is working ok doing csi snapshots (Rook Ceph provider) and mysql logical backups (using the kanister example) with backups going to a minio cluster living in its own namespace.

I have mesh enabled on the k10, minio and vaultwarden namespaces (testing using vaultwarden as I know it quite well and it uses a database). mysql logical backup is temporarily disabled. Unfortunately, k10 export fails:

  • k10 appears to come up fine - no errors - all pods communication with each other - ui works fine
  • vaultwarden works fine
  • backup csi snapshot works fine
  • backup export to minio fails with error (clip below)
status:
  state: Failed
  startTime: 2023-02-28T09:58:27Z
  endTime: 2023-02-28T09:59:26Z
  restorePoint:
    name: ""
  result:
    name: ""
  error:
    cause: '{"cause":{"cause":{"cause":{"message":"Failed to exec command in pod:
      Internal error occurred: error executing command in container: failed to
      exec in container: failed to start exec
      \"5e49e0fa33a9aa7ca1c6c1ff161305332754f0b7227f21173c51b44c794e8cba\": OCI
      runtime exec failed: exec failed: unable to start container process: exec:
      \"kopia\": executable file not found in $PATH:
      unknown"},"file":"kasten.io/k10/kio/kopia/repository.go:558","function":"kasten.io/k10/kio/kopia.ConnectToKopiaRepository","linenumber":558,"message":"Failed
      to connect to the backup
      repository"},"fields":[{"name":"appNamespace","value":"vaultwarden"}],"file":"kasten.io/k10/kio/exec/phases/phase/export.go:264","function":"kasten.io/k10/kio/exec/phases/phase.prepareKopiaRepoIfExportingData","linenumber":264,"message":"Failed
      to create Kopia repository for data
      export"},"file":"kasten.io/k10/kio/exec/phases/phase/export.go:166","function":"kasten.io/k10/kio/exec/phases/phase.(*exportRestorePointPhase).Run","linenumber":166,"message":"Failed
      to copy artifacts"}'
    message: Job failed to be executed
  progress: 100

Additional information:

  • no authorization restrictions are set in linkerd in these 3 namespaces so it is wide open
  • the K10 UI shows the minio location profile as validated
  • if I switch to sidecar snapshotting, the snapshot gets sent ok to the minio kasten-backup bucket (I can see the data in the Minio ui) so access to minio from the vaultwarden namespace appears to be ok BUT the k10 export still fails
  • while k10 is exporting, I see pods briefly appearing in the k10 namespace presumably to do the export
  • I previously tried the above with istio and k10 failed with communication errors so linderd is definitely a step further

In sum, it seems that the k10 ui and the sidecar snapshotter work and can communicate with minio but the k10 export fails.

Any ideas even just where to troubleshoot next.

Thanks

Best answer by jaiganeshjk

@Garland7362

I can confirm that the linkerd proxy container is where the commands are being run.
We seem to be running  exec commands in first container as we don't expect to have multiple containers with the pod that we dynamically spin-up.

We will have to enhance this behaviour for your use case. I have filed an enhancement request to support this configuration. 

I went through the Linkerd issues to find out the reason for adding the proxy container as the first container.
They mention that this change was purposefully added to avoid startup issues in this github issue .

It seems that you can use annotation in the workload config.linkerd.io/proxy-await: "disabled"  to disable this feature.


Thinking about working around this, You could try using custom kanister annotations helm value to add this annotation to the pods spinned up by K10.

This would make sure that the kanister-tools container be the first container and should resolve your issue for time-being.

https://docs.kasten.io/latest/kanister/override.html#configuring-custom-labels-and-annotations

Please let me know if it works for you.

View original
Did this topic help you find an answer to your question?

19 comments

Madi.Cristil
Forum|alt.badge.img+8
  • Community Manager
  • 616 comments
  • February 28, 2023

jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • March 1, 2023

Hi @Garland7362 Thank you for posting the question.

From the error message, it seems that the kopia binary is not available in the pod that is spinned-up for exports.

It could be either create-repo or copy-vol-data pod. Do you see which image it is trying to use ?

 


jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • March 1, 2023

I am just checking how k10 runs exec command in the newly created pods. We could be trying to run k10 kopia commands in first container but for some reason it could be running in linkerD proxy container rather than the kanister-tools container

 


  • Author
  • New Here
  • 8 comments
  • March 1, 2023

Hi @jaiganeshjk thanks for your response.

I’ve just checked and it spins up create-repo and backup-data-stats first and later data-mover-svc but the export shows a retry with x2 in the console right after the create-repo.

Also, i had a thought. When these pods are deployed linkerd will cause the admission controller to mutate the pod spec to include a linkerd-proxy container. I wonder because k10 just assumes there is only one container that it just attempts to exec kopia in the first container which happens now to be the linkerd-proxy one?

Tim


  • Author
  • New Here
  • 8 comments
  • March 1, 2023

Ha ha - we both thought of the same cause at the same time!


jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • March 2, 2023

I see that you have a support case with us as well. 
Let me confirm if that is the cause of the issue and see what could be done if this is the case.


  • Author
  • New Here
  • 8 comments
  • March 2, 2023

Thanks @jaiganeshjk - much appreciated


jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • Answer
  • March 2, 2023

@Garland7362

I can confirm that the linkerd proxy container is where the commands are being run.
We seem to be running  exec commands in first container as we don't expect to have multiple containers with the pod that we dynamically spin-up.

We will have to enhance this behaviour for your use case. I have filed an enhancement request to support this configuration. 

I went through the Linkerd issues to find out the reason for adding the proxy container as the first container.
They mention that this change was purposefully added to avoid startup issues in this github issue .

It seems that you can use annotation in the workload config.linkerd.io/proxy-await: "disabled"  to disable this feature.


Thinking about working around this, You could try using custom kanister annotations helm value to add this annotation to the pods spinned up by K10.

This would make sure that the kanister-tools container be the first container and should resolve your issue for time-being.

https://docs.kasten.io/latest/kanister/override.html#configuring-custom-labels-and-annotations

Please let me know if it works for you.


  • Author
  • New Here
  • 8 comments
  • March 2, 2023

@jaiganeshjk 

Thanks for working through this and for raising the enhancement request.

I am trying to implement the workaround but having issues.

Following the documentation link, I added to my helm values file:

kanisterPodCustomAnnotations: "config.linkerd.io/proxy-await=disabled"

When k10 is deployed virtually all the pods remain stuck in CreateContainerConfigError. Looking at one of the pods on events it shows:

9s          Warning   Failed                   pod/executor-svc-6cf948dd75-crvql             Error: couldn't find key kanisterPodCustomAnnotations in ConfigMap kasten/k10-config

I can see the kanisterPodCustomAnnotations in the k10-config configmap:

apiVersion: v1
data:
  AWSAssumeRoleDuration: 60m
  K10BackupBufferFileHeadroomFactor: "1.1"
  K10ExecutorMaxConcurrentRestoreCsiSnapshots: "3"
  K10ExecutorMaxConcurrentRestoreGenericVolumeSnapshots: "3"
  K10ExecutorMaxConcurrentRestoreWorkloads: "3"
  K10ExecutorWorkerCount: "8"
  K10GCDaemonPeriod: "21600"
  K10GCImportRunActionsEnabled: "false"
  K10GCKeepMaxActions: "1000"
  K10GCRetireActionsEnabled: "false"
  K10LimiterCsiSnapshots: "10"
  K10LimiterGenericVolumeCopies: "10"
  K10LimiterGenericVolumeRestores: "10"
  K10LimiterGenericVolumeSnapshots: "10"
  K10LimiterProviderSnapshots: "10"
  K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
  K10RootlessContainers: "false"
  KanisterBackupTimeout: "45"
  KanisterCheckRepoTimeout: "20"
  KanisterDeleteTimeout: "45"
  KanisterEFSPostRestoreTimeout: "45"
  KanisterHookTimeout: "20"
  KanisterPodCustomAnnotations: config.linkerd.io/proxy-await=disabled
  KanisterPodReadyWaitTimeout: "15"
  KanisterRestoreTimeout: "600"
  KanisterStatsTimeout: "20"
  apiDomain: kio.kasten.io
  concurrentSnapConversions: "3"
  concurrentWorkloadSnapshots: "5"
  efsBackupVaultName: k10vault
  k10DataStoreGeneralContentCacheSizeMB: "0"
  k10DataStoreGeneralMetadataCacheSizeMB: "500"
  k10DataStoreParallelUpload: "8"
  k10DataStoreRestoreContentCacheSizeMB: "500"
  k10DataStoreRestoreMetadataCacheSizeMB: "500"
  kanisterFunctionVersion: v1.0.0-alpha
  kubeVirtVMsUnFreezeTimeout: 5m
  loglevel: info
  modelstoredirname: //mnt/k10state/kasten-io/
  multiClusterVersion: "2"
  version: 5.5.2
  vmWareTaskTimeoutMin: "60"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: kasten
    meta.helm.sh/release-namespace: kasten
  creationTimestamp: "2023-03-02T13:57:08Z"
  labels:
    app: k10
    app.kubernetes.io/instance: kasten
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k10
    helm.sh/chart: k10-5.5.2
    heritage: Helm
    release: kasten
  name: k10-config
  namespace: kasten
  resourceVersion: "727624"
  uid: c22f8f9c-4148-457f-90cc-ed8832b45ca2

Checking one of the pods I don’t see the annotation present:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    checksum/config: e9e02307e75a18baf778c4179ed1e3a581d163273e4614c240cb09aa5897ef9d
    checksum/frontend-nginx-config: 44e6086c684885c88e43f79224e688aec51a0672e5f87fc961ed2af9006e60fb
    checksum/secret: 90de018eb29ceff98d4bdbf538a1ec6f1696a830dbd249640db547e571ca8569
    linkerd.io/created-by: linkerd/proxy-injector stable-2.12.4
    linkerd.io/inject: enabled
    linkerd.io/proxy-version: stable-2.12.4
    linkerd.io/trust-root-sha256: 514ac68fae331666e1366e476d3e49e3b72226b9d0b3483f2a15007d510b09bc
    rollme: HLRIq
    viz.linkerd.io/tap-enabled: "true"
  creationTimestamp: "2023-03-02T13:57:10Z"
  generateName: executor-svc-6cf948dd75-
  labels:
    app: k10
    app.kubernetes.io/instance: kasten
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k10
    component: executor
    helm.sh/chart: k10-5.5.2
    heritage: Helm
    linkerd.io/control-plane-ns: linkerd
    linkerd.io/proxy-deployment: executor-svc
    linkerd.io/workload-ns: kasten
    pod-template-hash: 6cf948dd75
    release: kasten
    run: executor-svc
  name: executor-svc-6cf948dd75-2x45t
  namespace: kasten
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: executor-svc-6cf948dd75
    uid: 25f315ef-968d-47bb-9652-efe12fa72a33
  resourceVersion: "728396"
  uid: 59b7ec3f-c946-45c6-8f11-1505f17c7940
spec:
...

 

I’ve tried using “: ” instead of “=” in the annotation but no change.

Is there something I am doing wrong here?

Tim


jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • March 2, 2023

I think I know the answer. It seems to be a bug from our side(I will file a bug and get it fixed).

The keyReference(from the configMap) that is added to the workloads is kanisterPodCustomAnnotations

Where as the Key that was added in the configmap is KanisterPodCustomAnnotations

Notice the upper case K in the configMap and lowercase in the deployments.

You might have to workaround it for the time-being by editing the configmap and changing the key to kanisterPodCustomAnnotations (with lowercase k)

 


  • Author
  • New Here
  • 8 comments
  • March 2, 2023

Thanks, I didn’t spot that. Yes, editing the key name in the configmap solved the startup issue. I’ll test now to see if backup exports now work.


  • Author
  • New Here
  • 8 comments
  • March 2, 2023

@jaiganeshjk,

I have now successfully completed one test cycle of my test app: add items to database, backup (including logical mysql backup), delete items, restore.

This appears to have fixed the problem.

I will continue to test over the next few days to confirm.

Thank you - this is excellent.

Are you able to advise timeframe estimates for the k10-config bug and the enhancement for the exec to not run the linkerd container?


jaiganeshjk
Forum|alt.badge.img+2
  • Experienced User
  • 274 comments
  • March 2, 2023

@Garland7362 Glad that the workaround works for you now.

I don’t have any timelines. It has to go through the PM review and it will be worked upon.

All I can say that the bug will be fixed before the enhancement. 

IMO, your workaround to add annotations to the kanister pod  will ensure your exports work until we have the enhancement in this workflow, unless something changes in the linkerd wrt to the annotation.


  • Author
  • New Here
  • 8 comments
  • March 3, 2023

Thanks for the workaround and queued bug fix and enhancement.

I have closed the support case.


  • Not a newbie anymore
  • 3 comments
  • March 26, 2023

Hi guys,

It looks like in version 5.5.7 the initial bug has been fixed as deployments now seem to be checking KanisterPodCustomAnnotations from in the k10-config configMap (replacing K with k makes the deployments fail so ...). That’s good news.

Problem is that when I to add annotation sidecar.istio.io/inject=false (as in my case I’m using Istio service mesh) I cannot see that annotation on the kanister-job-* pods created and the istio proxy sidecar container is still injected.

I also tried using the pod-spec-override to add the annotation but without much success.

Ref to the documentation used: https://docs.kasten.io/latest/kanister/override.html

What should I do to have the annotation actually added to the kanister-job pods?


  • Author
  • New Here
  • 8 comments
  • March 27, 2023

Hi @stephw,

I tried to upgrade to 5.5.7 to test this in my cluster but it causes an issue that will take some time to resolve so I cannot not test at present.

In 5.5.2, once I had changed the “K to “k” in k10-config, the annotation I had specified was present in the kanister job pods when they were deployed.

One temporary workaround if you are pressing up against a deadline and you have kyverno on the cluster is to create a kyverno mutate policy that mutates the pod spec at admission time to add the annotation. I have successfully used this to work around problems on other helm charts until they are fixed.


  • Comes here often
  • 3 comments
  • February 20, 2024

@Garland7362 @jaiganeshjk @stephw Any update on this? We’re using Kasten 6.5.1 with Linkerd as our service mesh solution and still experience the problem - adding an annotation for disabling the linkerd proxy injection to either pod-spec-override ConfigMap:

apiVersion: v1
data:
  override: |
    kind: Pod
    metadata:
     annotations:
       linkerd.io/inject: disabled
kind: ConfigMap
metadata:
  name: pod-spec-override
  namespace: kasten-io

 or as a key in our Helm Chart installation:

KanisterPodCustomAnnotations: "linkerd.io/inject=disabled"

does not seem to work - the key is properly set in k10-config but kanister jobs are still injected with the Linkerd proxy. We are not planning to use Kyverno. We could disable automatic proxy injection and add the annotation manually to each of our workflows but we treat this as a last resort solution.


FRubens
Forum|alt.badge.img+2
  • Experienced User
  • 96 comments
  • February 27, 2024

Hello @peterturnip 

Reposting here also since other users could find this post.

The correct option in this case would be to use custom annotations, since pod-spec-override is not intended to add annotations.

We are aware about kanisterPodCustomAnnotations and kanisterPodCustomLabels not being applied to kanister-job pods, and we will be working to fix this as soon as possible. The annotations and labels added in those fields are being applied to pods that runs on kasten-io namespace during the backup/export when using blueprints (i.e. datamover,copy-vol-data), but kanister-job pods runs on application's namespace.

I will update this thread as soon as we have the fix.

Also would like to recommend to keep an eye in our release pages where we will inform about new features and bug fixes.

https://docs.kasten.io/latest/releasenotes.html

Regards,
Rubens


FRubens
Forum|alt.badge.img+2
  • Experienced User
  • 96 comments
  • October 29, 2024

Hello @Garland7362 @peterturnip @stephw,

Would you to share that on Veeam Kasten 7.0.9 we have fixed the issue regarding Custom annotations/labels not being applied to Kasten Ephemeral pods.

https://docs.kasten.io/latest/releasenotes.html#relnotes-7-0-9

Added Helm flags global.podLabels and global.podAnnotations that can be used to set labels and annotations on all Veeam Kasten pods globally. It including all ephemeral pods created by Veeam Kasten.

Regards

Rubens


Comment