Solved

Veeam Kasten 7.0.4 crypto-svc CrashLoopBackOff due to nil pointer dereference

8 months ago
July 18, 2024
7 comments
86 views

donaldleung
Comes here often
10 comments

This morning we received an alert that our Kasten backups were failing on our production Rancher RKE2 cluster (version v1.30.2+rke2r1). In particular, when attempting to snapshot application components and configuration, we encountered the error “Could not encrypt data”.

We are using K10 version 7.0.4 which was updated automatically a few days ago via GitOps with Flux v2, though the error only started occurring this morning.

By inspecting the pods in the kasten-io namespace, we noticed that the crypto-svc pod is stuck in CrashLoopBackOff - all other pods are running normally:

NAME READY STATUS RESTARTS AGE crypto-svc-6f78dcf599-xhrgp 3/4 CrashLoopBackOff 6 (118s ago) 7m57s

Describing the pod revealed that the failed container is bloblifecyclemanager-svc.

8m53s Normal Pulling pod/crypto-svc-6f78dcf599-xhrgp Pulling image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4" 9m15s Normal Pulled pod/crypto-svc-6f78dcf599-xhrgp Successfully pulled image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4" in 307ms (307ms including waiting). Image size: 112468707 bytes. 9m13s Normal Created pod/crypto-svc-6f78dcf599-xhrgp Created container bloblifecyclemanager-svc 9m12s Normal Started pod/crypto-svc-6f78dcf599-xhrgp Started container bloblifecyclemanager-svc 9m13s Normal Pulled pod/crypto-svc-6f78dcf599-xhrgp Successfully pulled image "gcr.io/kasten-images/bloblifecyclemanager:7.0.4" in 242ms (243ms including waiting). Image size: 112468707 bytes. 4m11s Warning BackOff pod/crypto-svc-6f78dcf599-xhrgp Back-off restarting failed container bloblifecyclemanager-svc in pod crypto-svc-6f78dcf599-xhrgp_kasten-io(d9b7bb69-b48d-49ca-996e-374479e7679e)

Inspecting the container logs for bloblifecyclemanager-svc in the crypto-svc pod reveals a program panic due to a nil pointer dereference.

... panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2c5905f]

goroutine 151 [running]: kasten.io/k10/kio/bloblifecycle/s3client.(*Store).GetBlobRetention(0xc0000c3808?, {0x54f4a50?, 0xc002724180?}, {0xc002caa710, 0x10}, {0xc0030fc210?, 0xc001755318?}, {0xc002aee980?, 0x53a26d?}) kasten.io/k10/kio/bloblifecycle/s3client/lifecycler.go:108 +0x93f kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).getBlobRetention.func1({0x54f4a50?, 0xc002724180?}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:932 +0x63 kasten.io/k10/kio/poll.waitWithBackoffWithRetriesHelper({0x54f4a50, 0xc002724180}, {0x0, 0x0, 0x0, 0x0, 0x0}, 0x5, 0x4cba8d8, 0xc000087770) kasten.io/k10/kio/poll/poll.go:99 +0x210 kasten.io/k10/kio/poll.waitWithBackoffWithRetries({0x54f4a50, 0xc002724180}, {0x0, 0x0, 0x0, 0x0, 0x0}, 0x5, 0x4cba8d8, 0xc000087770) kasten.io/k10/kio/poll/poll.go:83 +0xde kasten.io/k10/kio/poll.WaitWithRetries(...) kasten.io/k10/kio/poll/poll.go:64 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).getBlobRetention(0xc001413200?, {0x54f4a50?, 0xc002724180?}, {0xc002caa710?, 0x70?}, {0xc0030fc210?, 0xc000087850?}, {0xc002aee980?, 0x73bbaffff9f8?}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:930 +0xe5 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).preserveSingleBlobVersion(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0035185c0, {0xc002aee980, 0x20}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:683 +0x6d kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobVersions(0xc002caa710?, {0x54f4a50?, 0xc002724180?}, 0xc0035185c0) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:513 +0x145 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepo(0xc001413200, {0x54f4a50, 0xc002724180}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:305 +0x2c9 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).initAndManageBlobsInRepo(0xc001413200, {0x54f4a50, 0xc002724180}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:278 +0x66 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoCountActive(0xc001413200, {0x54f4a50, 0xc002724180}) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:269 +0xa5 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoThrottled(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:247 +0x93 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).manageBlobsInRepoWithDurationCheck(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:207 +0xc9 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).runRefreshCycle(0xc001413200, {0x54f4a50?, 0xc002724180?}, 0x0?) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:143 +0x25 kasten.io/k10/kio/bloblifecycle/lifecycle.(*repoBlobExtender).run(0xc001413200, {0x54f4a50, 0xc002724180}, 0xc0000f47e0) kasten.io/k10/kio/bloblifecycle/lifecycle/repo_extender.go:133 +0xa5 kasten.io/k10/kio/bloblifecycle/lifecycle.(*Manager).runRepoExtender.func1() kasten.io/k10/kio/bloblifecycle/lifecycle/manager.go:374 +0x7a created by kasten.io/k10/kio/bloblifecycle/lifecycle.(*Manager).runRepoExtender in goroutine 113 kasten.io/k10/kio/bloblifecycle/lifecycle/manager.go:368 +0x138

The full logs are attached as k10-7.0.4-crypto-svc-bloblifecyclemanager-full-log.txt for reference.

We also tried the following to no avail - the blob lifecycle manager continues to fail due to the same nil pointer dereference:

Killing the crypto-svc pod to have it re-created
Downgrading the Helm chart to K10 7.0.3 (logs attached as well)

We haven’t observed the same issue with a (relatively) fresh K10 7.0.3 installation in our other PoC cluster (K3s version v1.29.5+k3s1) but I suppose the program panic is a bug worth investigating?

Best answer by Hagag

Hello @donaldleung

Let me first summarize the issues, and please confirm if my understanding is correct.

Summary of Issues

Issue Overview:

Alert Received: Kasten backups failing on Rancher RKE2 production cluster (version v1.30.2+rke2r1).
Error Message: "Could not encrypt data" during snapshot attempts.
K10 Version: Using K10 version 7.0.4, updated automatically via GitOps with Flux v2.
Pod Status: crypto-svc pod in CrashLoopBackOff state, while other pods are running normally.

Pod Details:

Pod: crypto-svc-6f78dcf599-xhrgp
Container: bloblifecyclemanager-svc is failing.
Logs: Indicate a nil pointer dereference leading to a runtime panic.

Stack Trace:

Error: panic: runtime error: invalid memory address or nil pointer dereference
Affected Method: GetBlobRetention in s3client package.

Troubleshooting Steps Taken from your end:

Pod Deletion: Tried killing crypto-svc pod for recreation.
Helm Downgrade: Downgraded to K10 version 7.0.3, issue persisted.
Fresh Installation: Deleted K10 and re-installed version 7.0.3 without success.
Backup Restore: Attempted k10restore, resulting in the same issue.
Manual Re-creation: Re-created backup policies manually; initial success but issue reoccurred after deleting the crypto-svc pod.
Further Downgrades:
- Downgrading to 7.0.2 resulted in the same CrashLoopBackOff state.
- Downgrading to 7.0.1 resolved crypto-svc issues but caused catalog-svc pod to fail.

Current Status:

Backup Protection: Unable to rely on previous backups; considering recreating backup policies and initial jobs manually to ensure protection.
Next Steps: Downgrading further ( not recommended ) or awaiting a fix from Veeam engineering.

My notes:

Kindly be informed that Kasten does not yet support Kubernetes version 1.30. Your Rancher RKE2 production cluster is running version v1.30.2+rke2r1 ( but we will do our best to support )
We do not recommend downgrading the K10 version, as there are schema upgrades between versions that could cause issues.
There is an issue with K10 v7.0.3; we recommend upgrading to v7.0.4 and skipping v7.0.3.
We can assist with restoring K10 DR if you have a valid backup (in case you want to restore your old policies, backups, restore points, etc.). Please open a trial case through my.veeam.com under 'Trial cases' in products. Collect the debug logs (https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information) and upload them to the case. We will get in touch and take a deeper look at what’s going on.

Thanks
Ahmed Hagag

View original

Did this topic help you find an answer to your question?

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

Status update: we deleted K10 and the kasten-io namespace, then suspended GitOps for kasten-io and re-installed K10 7.0.3 with our GitOps Helm chart values. The fresh K10 7.0.3 installation runs normally, but running k10restore afterwards fails with the same crypto-svc CrashLoopBackOff error - it appears something related to K10 got corrupted over the past 2 days preventing restores from working correctly 😥

Attached are the full logs for the failed K10 DR restore based on the command suggested in the screenshot.

We’ll try restoring from an earlier restore point over the next few hours (good thing we enabled immutable backups), but if that doesn’t work either then our production cluster is no longer backup-protected ...

1 Attachments

k10-7.0.3-restore-error-all-logs.txt

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

Tried restoring from the backup this Monday with the same issue:

helm -n kasten-io install k10-restore kasten/k10restore \ --set sourceClusterID='b3888a63-c199-45d6-b962-f47f7ad2f249' \ --set profile.name='rke2-enfinitypoc-wasabi' \ --set pointInTime="2024-07-15T15:04:05Z"

Seems we might have to ditch the previous backups and re-create all the backup policies + initial backup jobs manually; this way, at least our production cluster should be protected by the most recent backups in case something unexpected occurs.

1 Attachments

k10-7.0.3-restore-error-all-logs.txt

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

We just deleted the kasten-io namespace again and re-installed K10 7.0.3 afresh. This time, we did not run k10restore but re-created all the DR and backup policies manually with a fresh S3-compatible bucket for exporting snapshots. The policies ran to completion successfully; however, after manually deleting the crypto-svc pod as a quick test, the pod entered a CrashLoopBackOff state once again. Could this be a bug with recent versions of Kasten itself?

As an aside, we tried the same thing (deleting crypto-svc) with our PoC single-node K3s cluster and could not reproduce the CrashLoopBackOff issue there.

More information about our production cluster encountering the issue:

Distribution: Rancher RKE2
K8s version: v1.30.2+rke2r1
Storage backend: Rook Ceph

We’ll downgrade our Kasten further to 7.0.2 (perhaps down to 7.0.0) as a next step and see if it resolves our issue.

1 Attachments

k10-7.0.3-fresh-install-crypto-svc-bloblifecyclemanager-all-logs.txt

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

After performing further downgrades of Veeam Kasten, we discovered the following behavior:

When downgrading from 7.0.3 to 7.0.2, the crypto-svc pod enters the same CrashLoopBackOff state with the same nil pointer dereference
When downgrading from 7.0.2 to 7.0.1, the crypto-svc pod starts normally and survives pod deletions but now the catalog-svc pod enters an Init:CrashLoopBackOff state and the schema-upgrade-check init container logs indicate that a model schema downgrade is not possible

Based on the following observations, here’s our theory on what happened: the K10 upgrade from 7.0.1 to 7.0.2 involved a model schema upgrade which silently corrupted whatever crypto-svc might be depending on (perhaps some encryption keys?), causing it to encounter a nil pointer dereference error when the pod was finally restarted for whatever reason. This could also explain why we could not reproduce the issue on our relatively new PoC cluster with K10 7.0.3 installed on the get-go, since there was no model schema upgrade performed.

This is a critical issue rendering all our prior K10 backups useless and unable to create any new backups so we hope Veeam engineering will look into the issue.

2 Attachments

k10-7.0.1-downgrade-catalog-svc-schema-upgrade-check-logs.txt

k10-7.0.2-fresh-install-crypto-svc-bloblifecyclemanager-all-logs.txt

Hagag
Experienced User
154 comments
Answer
8 months ago
July 18, 2024

Hello @donaldleung

Let me first summarize the issues, and please confirm if my understanding is correct.

Summary of Issues

Issue Overview:

Alert Received: Kasten backups failing on Rancher RKE2 production cluster (version v1.30.2+rke2r1).
Error Message: "Could not encrypt data" during snapshot attempts.
K10 Version: Using K10 version 7.0.4, updated automatically via GitOps with Flux v2.
Pod Status: crypto-svc pod in CrashLoopBackOff state, while other pods are running normally.

Pod Details:

Pod: crypto-svc-6f78dcf599-xhrgp
Container: bloblifecyclemanager-svc is failing.
Logs: Indicate a nil pointer dereference leading to a runtime panic.

Stack Trace:

Error: panic: runtime error: invalid memory address or nil pointer dereference
Affected Method: GetBlobRetention in s3client package.

Troubleshooting Steps Taken from your end:

Pod Deletion: Tried killing crypto-svc pod for recreation.
Helm Downgrade: Downgraded to K10 version 7.0.3, issue persisted.
Fresh Installation: Deleted K10 and re-installed version 7.0.3 without success.
Backup Restore: Attempted k10restore, resulting in the same issue.
Manual Re-creation: Re-created backup policies manually; initial success but issue reoccurred after deleting the crypto-svc pod.
Further Downgrades:
- Downgrading to 7.0.2 resulted in the same CrashLoopBackOff state.
- Downgrading to 7.0.1 resolved crypto-svc issues but caused catalog-svc pod to fail.

Current Status:

Backup Protection: Unable to rely on previous backups; considering recreating backup policies and initial jobs manually to ensure protection.
Next Steps: Downgrading further ( not recommended ) or awaiting a fix from Veeam engineering.

My notes:

Kindly be informed that Kasten does not yet support Kubernetes version 1.30. Your Rancher RKE2 production cluster is running version v1.30.2+rke2r1 ( but we will do our best to support )
We do not recommend downgrading the K10 version, as there are schema upgrades between versions that could cause issues.
There is an issue with K10 v7.0.3; we recommend upgrading to v7.0.4 and skipping v7.0.3.
We can assist with restoring K10 DR if you have a valid backup (in case you want to restore your old policies, backups, restore points, etc.). Please open a trial case through my.veeam.com under 'Trial cases' in products. Collect the debug logs (https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information) and upload them to the case. We will get in touch and take a deeper look at what’s going on.

Thanks
Ahmed Hagag

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

Hagag wrote:

Hello @donaldleung

Let me first summarize the issues, and please confirm if my understanding is correct.

Summary of Issues

Issue Overview:

Alert Received: Kasten backups failing on Rancher RKE2 production cluster (version v1.30.2+rke2r1).
Error Message: "Could not encrypt data" during snapshot attempts.
K10 Version: Using K10 version 7.0.4, updated automatically via GitOps with Flux v2.
Pod Status: crypto-svc pod in CrashLoopBackOff state, while other pods are running normally.

Pod Details:

Pod: crypto-svc-6f78dcf599-xhrgp
Container: bloblifecyclemanager-svc is failing.
Logs: Indicate a nil pointer dereference leading to a runtime panic.

Stack Trace:

Error: panic: runtime error: invalid memory address or nil pointer dereference
Affected Method: GetBlobRetention in s3client package.

Troubleshooting Steps Taken from your end:

Pod Deletion: Tried killing crypto-svc pod for recreation.
Helm Downgrade: Downgraded to K10 version 7.0.3, issue persisted.
Fresh Installation: Deleted K10 and re-installed version 7.0.3 without success.
Backup Restore: Attempted k10restore, resulting in the same issue.
Manual Re-creation: Re-created backup policies manually; initial success but issue reoccurred after deleting the crypto-svc pod.
Further Downgrades:
- Downgrading to 7.0.2 resulted in the same CrashLoopBackOff state.
- Downgrading to 7.0.1 resolved crypto-svc issues but caused catalog-svc pod to fail.

Current Status:

Backup Protection: Unable to rely on previous backups; considering recreating backup policies and initial jobs manually to ensure protection.
Next Steps: Downgrading further ( not recommended ) or awaiting a fix from Veeam engineering.

My notes:

Kindly be informed that Kasten does not yet support Kubernetes version 1.30. Your Rancher RKE2 production cluster is running version v1.30.2+rke2r1 ( but we will do our best to support )
We do not recommend downgrading the K10 version, as there are schema upgrades between versions that could cause issues.
There is an issue with K10 v7.0.3; we recommend upgrading to v7.0.4 and skipping v7.0.3.
We can assist with restoring K10 DR if you have a valid backup (in case you want to restore your old policies, backups, restore points, etc.). Please open a trial case through my.veeam.com under 'Trial cases' in products. Collect the debug logs (https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information) and upload them to the case. We will get in touch and take a deeper look at what’s going on.

Thanks
Ahmed Hagag

Thanks @Hagag for your summary, your understanding of our issue is very much correct. We’ll try perhaps one or two more troubleshooting steps, and if it doesn’t resolve our issue then we’ll collect the logs and open a trial case on my.veeam.com as suggested.

donaldleung
Author
Comes here often
10 comments
8 months ago
July 18, 2024

We were hoping to upgrade K10 to 7.0.4 again but specifically pin catalog-svc to 7.0.1; unfortunately, this does not appear to be a supported K10 Helm chart option. As such, we upgraded the entire K10 instance from 7.0.1 to 7.0.4 (with the crypto-svc issue reappearing as expected), collected the logs and opened a trial case on my.veeam.com (case #07345803). Thanks @Hagag again for your help!

Summary of Issues

1 Attachments

1 Attachments

1 Attachments

2 Attachments

Summary of Issues

Summary of Issues

Comment

Related topics

restitutieicon

restitutie kortingcodeicon

Is een aanrijding overmacht of een geldige reden voor restitutie?icon

Restitutie NS Flex Weekend Vrij bij systeem fouticon

restitutieicon

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded