Skip to main content

This morning we received an alert that our Kasten backups were failing on our production Rancher RKE2 cluster (version v1.30.2+rke2r1). In particular, when attempting to snapshot application components and configuration, we encountered the error “Could not encrypt data”.

We are using K10 version 7.0.4 which was updated automatically a few days ago via GitOps with Flux v2, though the error only started occurring this morning.

By inspecting the pods in the kasten-io namespace, we noticed that the crypto-svc pod is stuck in CrashLoopBackOff - all other pods are running normally:

NAME                          READY   STATUS             RESTARTS       AGE

crypto-svc-6f78dcf599-xhrgp   3/4     CrashLoopBackOff   6 (118s ago)   7m57s

Describing the pod revealed that the failed container is bloblifecyclemanager-svc.

8m53s       Normal    Pulling            pod/crypto-svc-6f78dcf599-xhrgp    Pulling image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4"

9m15s       Normal    Pulled             pod/crypto-svc-6f78dcf599-xhrgp    Successfully pulled image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4" in 307ms (307ms including waiting). Image size: 112468707 bytes.

9m13s       Normal    Created            pod/crypto-svc-6f78dcf599-xhrgp    Created container bloblifecyclemanager-svc

9m12s       Normal    Started            pod/crypto-svc-6f78dcf599-xhrgp    Started container bloblifecyclemanager-svc

9m13s       Normal    Pulled             pod/crypto-svc-6f78dcf599-xhrgp    Successfully pulled image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4" in 242ms (243ms including waiting). Image size: 112468707 bytes.

4m11s       Warning   BackOff            pod/crypto-svc-6f78dcf599-xhrgp    Back-off restarting failed container bloblifecyclemanager-svc in pod crypto-svc-6f78dcf599-xhrgp_kasten-io(d9b7bb69-b48d-49ca-996e-374479e7679e)

Inspecting the container logs for bloblifecyclemanager-svc in the crypto-svc pod reveals a program panic due to a nil pointer dereference.

...

panic: runtime error: invalid memory address or nil pointer dereference

/signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2c5905f]

goroutine 151 brunning]:

kasten.io/k10/kio/bloblifecycle/s3client.(*Store).GetBlobRetention(0xc0000c3808?, {0x54f4a50?, 0xc002724180?}, {0xc002caa710, 0x10}, {0xc0030fc210?, 0xc001755318?}, {0xc002aee980?, 0x53a26d?})

        kasten.io/k10/kio/bloblifecycle/s3client/lifecycler.go:108 +0x93f

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).getBlobRetention.func1({0x54f4a50?, 0xc002724180?})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:932 +0x63

kasten.io/k10/kio/poll.waitWithBackoffWithRetriesHelper({0x54f4a50, 0xc002724180}, {0x0, 0x0, 0x0, 0x0, 0x0}, 0x5, 0x4cba8d8, 0xc000087770)

        kasten.io/k10/kio/poll/poll.go:99 +0x210

kasten.io/k10/kio/poll.waitWithBackoffWithRetries({0x54f4a50, 0xc002724180}, {0x0, 0x0, 0x0, 0x0, 0x0}, 0x5, 0x4cba8d8, 0xc000087770)

        kasten.io/k10/kio/poll/poll.go:83 +0xde

kasten.io/k10/kio/poll.WaitWithRetries(...)

        kasten.io/k10/kio/poll/poll.go:64

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).getBlobRetention(0xc001413200?, {0x54f4a50?, 0xc002724180?}, {0xc002caa710?, 0x70?}, {0xc0030fc210?, 0xc000087850?}, {0xc002aee980?, 0x73bbaffff9f8?})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:930 +0xe5

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).preserveSingleBlobVersion(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0035185c0, {0xc002aee980, 0x20})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:683 +0x6d

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobVersions(0xc002caa710?, {0x54f4a50?, 0xc002724180?}, 0xc0035185c0)

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:513 +0x145

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepo(0xc001413200, {0x54f4a50, 0xc002724180})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:305 +0x2c9

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).initAndManageBlobsInRepo(0xc001413200, {0x54f4a50, 0xc002724180})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:278 +0x66

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoCountActive(0xc001413200, {0x54f4a50, 0xc002724180})

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:269 +0xa5

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoThrottled(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0)

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:247 +0x93

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoWithDurationCheck(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0)

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:207 +0xc9

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).runRefreshCycle(0xc001413200, {0x54f4a50?, 0xc002724180?}, 0x0?)

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:143 +0x25

kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).run(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0)

        kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:133 +0xa5

kasten.io/k10/kio/bloblifecycle/lifecycle.(*Manager).runRepoExtender.func1()

        kasten.io/k10/kio/bloblifecycle/lifecycle/manager.go:374 +0x7a

created by kasten.io/k10/kio/bloblifecycle/lifecycle.(*Manager).runRepoExtender in goroutine 113

        kasten.io/k10/kio/bloblifecycle/lifecycle/manager.go:368 +0x138

The full logs are attached as k10-7.0.4-crypto-svc-bloblifecyclemanager-full-log.txt for reference.

We also tried the following to no avail - the blob lifecycle manager continues to fail due to the same nil pointer dereference:

  1. Killing the crypto-svc pod to have it re-created
  2. Downgrading the Helm chart to K10 7.0.3 (logs attached as well)

We haven’t observed the same issue with a (relatively) fresh K10 7.0.3 installation in our other PoC cluster (K3s version v1.29.5+k3s1) but I suppose the program panic is a bug worth investigating?

Status update: we deleted K10 and the kasten-io namespace, then suspended GitOps for kasten-io and re-installed K10 7.0.3 with our GitOps Helm chart values. The fresh K10 7.0.3 installation runs normally, but running k10restore afterwards fails with the same crypto-svc CrashLoopBackOff error - it appears something related to K10 got corrupted over the past 2 days preventing restores from working correctly 😥

Attached are the full logs for the failed K10 DR restore based on the command suggested in the screenshot.

We’ll try restoring from an earlier restore point over the next few hours (good thing we enabled immutable backups), but if that doesn’t work either then our production cluster is no longer backup-protected ...


Tried restoring from the backup this Monday with the same issue:

helm -n kasten-io install k10-restore kasten/k10restore \

    --set sourceClusterID='b3888a63-c199-45d6-b962-f47f7ad2f249' \

    --set profile.name='rke2-enfinitypoc-wasabi' \

    --set pointInTime="2024-07-15T15:04:05Z"

Seems we might have to ditch the previous backups and re-create all the backup policies + initial backup jobs manually; this way, at least our production cluster should be protected by the most recent backups in case something unexpected occurs.


We just deleted the kasten-io namespace again and re-installed K10 7.0.3 afresh. This time, we did not run k10restore but re-created all the DR and backup policies manually with a fresh S3-compatible bucket for exporting snapshots. The policies ran to completion successfully; however, after manually deleting the crypto-svc pod as a quick test, the pod entered a CrashLoopBackOff state once again. Could this be a bug with recent versions of Kasten itself?

As an aside, we tried the same thing (deleting crypto-svc) with our PoC single-node K3s cluster and could not reproduce the CrashLoopBackOff issue there.

More information about our production cluster encountering the issue:

  • Distribution: Rancher RKE2
  • K8s version: v1.30.2+rke2r1
  • Storage backend: Rook Ceph

We’ll downgrade our Kasten further to 7.0.2 (perhaps down to 7.0.0) as a next step and see if it resolves our issue.


After performing further downgrades of Veeam Kasten, we discovered the following behavior:

  1. When downgrading from 7.0.3 to 7.0.2, the crypto-svc pod enters the same CrashLoopBackOff state with the same nil pointer dereference
  2. When downgrading from 7.0.2 to 7.0.1, the crypto-svc pod starts normally and survives pod deletions but now the catalog-svc pod enters an Init:CrashLoopBackOff state and the schema-upgrade-check init container logs indicate that a model schema downgrade is not possible

Based on the following observations, here’s our theory on what happened: the K10 upgrade from 7.0.1 to 7.0.2 involved a model schema upgrade which silently corrupted whatever crypto-svc might be depending on (perhaps some encryption keys?), causing it to encounter a nil pointer dereference error when the pod was finally restarted for whatever reason. This could also explain why we could not reproduce the issue on our relatively new PoC cluster with K10 7.0.3 installed on the get-go, since there was no model schema upgrade performed.

This is a critical issue rendering all our prior K10 backups useless and unable to create any new backups so we hope Veeam engineering will look into the issue.


Hello @donaldleung 

Let me first summarize the issues, and please confirm if my understanding is correct.

Summary of Issues 

Issue Overview:

  • Alert Received: Kasten backups failing on Rancher RKE2 production cluster (version v1.30.2+rke2r1).
  • Error Message: "Could not encrypt data" during snapshot attempts.
  • K10 Version: Using K10 version 7.0.4, updated automatically via GitOps with Flux v2.
  • Pod Status: crypto-svc pod in CrashLoopBackOff state, while other pods are running normally.

Pod Details:

  • Pod: crypto-svc-6f78dcf599-xhrgp
  • Container: bloblifecyclemanager-svc is failing.
  • Logs: Indicate a nil pointer dereference leading to a runtime panic.

Stack Trace:

  • Error: panic: runtime error: invalid memory address or nil pointer dereference
  • Affected Method: GetBlobRetention in s3client package.

Troubleshooting Steps Taken from your end:

  1. Pod Deletion: Tried killing crypto-svc pod for recreation.
  2. Helm Downgrade: Downgraded to K10 version 7.0.3, issue persisted.
  3. Fresh Installation: Deleted K10 and re-installed version 7.0.3 without success.
  4. Backup Restore: Attempted k10restore, resulting in the same issue.
  5. Manual Re-creation: Re-created backup policies manually; initial success but issue reoccurred after deleting the crypto-svc pod.
  6. Further Downgrades:
    • Downgrading to 7.0.2 resulted in the same CrashLoopBackOff state.
    • Downgrading to 7.0.1 resolved crypto-svc issues but caused catalog-svc pod to fail.

 

Current Status:

  • Backup Protection: Unable to rely on previous backups; considering recreating backup policies and initial jobs manually to ensure protection.
  • Next Steps: Downgrading further ( not recommended ) or awaiting a fix from Veeam engineering.

 

My notes:

  1. Kindly be informed that Kasten does not yet support Kubernetes version 1.30. Your Rancher RKE2 production cluster is running version v1.30.2+rke2r1 ( but we will do our best to support  ) 
  2. We do not recommend downgrading the K10 version, as there are schema upgrades between versions that could cause issues.
  3. There is an issue with K10 v7.0.3; we recommend upgrading to v7.0.4 and skipping v7.0.3.
  4. We can assist with restoring K10 DR if you have a valid backup (in case you want to restore your old policies, backups, restore points, etc.). Please open a trial case through my.veeam.com under 'Trial cases' in products. Collect the debug logs (https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information) and upload them to the case. We will get in touch and take a deeper look at what’s going on.




Thanks
Ahmed Hagag


Hello @donaldleung 

Let me first summarize the issues, and please confirm if my understanding is correct.

Summary of Issues 

Issue Overview:

  • Alert Received: Kasten backups failing on Rancher RKE2 production cluster (version v1.30.2+rke2r1).
  • Error Message: "Could not encrypt data" during snapshot attempts.
  • K10 Version: Using K10 version 7.0.4, updated automatically via GitOps with Flux v2.
  • Pod Status: crypto-svc pod in CrashLoopBackOff state, while other pods are running normally.

Pod Details:

  • Pod: crypto-svc-6f78dcf599-xhrgp
  • Container: bloblifecyclemanager-svc is failing.
  • Logs: Indicate a nil pointer dereference leading to a runtime panic.

Stack Trace:

  • Error: panic: runtime error: invalid memory address or nil pointer dereference
  • Affected Method: GetBlobRetention in s3client package.

Troubleshooting Steps Taken from your end:

  1. Pod Deletion: Tried killing crypto-svc pod for recreation.
  2. Helm Downgrade: Downgraded to K10 version 7.0.3, issue persisted.
  3. Fresh Installation: Deleted K10 and re-installed version 7.0.3 without success.
  4. Backup Restore: Attempted k10restore, resulting in the same issue.
  5. Manual Re-creation: Re-created backup policies manually; initial success but issue reoccurred after deleting the crypto-svc pod.
  6. Further Downgrades:
    • Downgrading to 7.0.2 resulted in the same CrashLoopBackOff state.
    • Downgrading to 7.0.1 resolved crypto-svc issues but caused catalog-svc pod to fail.

 

Current Status:

  • Backup Protection: Unable to rely on previous backups; considering recreating backup policies and initial jobs manually to ensure protection.
  • Next Steps: Downgrading further ( not recommended ) or awaiting a fix from Veeam engineering.

 

My notes:

  1. Kindly be informed that Kasten does not yet support Kubernetes version 1.30. Your Rancher RKE2 production cluster is running version v1.30.2+rke2r1 ( but we will do our best to support  ) 
  2. We do not recommend downgrading the K10 version, as there are schema upgrades between versions that could cause issues.
  3. There is an issue with K10 v7.0.3; we recommend upgrading to v7.0.4 and skipping v7.0.3.
  4. We can assist with restoring K10 DR if you have a valid backup (in case you want to restore your old policies, backups, restore points, etc.). Please open a trial case through my.veeam.com under 'Trial cases' in products. Collect the debug logs (https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information) and upload them to the case. We will get in touch and take a deeper look at what’s going on.




Thanks
Ahmed Hagag

Thanks @Hagag for your summary, your understanding of our issue is very much correct. We’ll try perhaps one or two more troubleshooting steps, and if it doesn’t resolve our issue then we’ll collect the logs and open a trial case on my.veeam.com as suggested.


We were hoping to upgrade K10 to 7.0.4 again but specifically pin catalog-svc to 7.0.1; unfortunately, this does not appear to be a supported K10 Helm chart option. As such, we upgraded the entire K10 instance from 7.0.1 to 7.0.4 (with the crypto-svc issue reappearing as expected), collected the logs and opened a trial case on my.veeam.com (case #07345803). Thanks @Hagag again for your help!


Comment