Skip to main content

Hi all,

 

The infrastructure is as follows:

rke2 kubernetes cluster provided by Rancher.

VMware vSphere 7.0.3

minio s3 bucket for exports

This is our default storage class which is using a tag based placement vSphere storage policy.

apiVersion: v1
items:
- allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
meta.helm.sh/release-name: rancher-vsphere-csi
meta.helm.sh/release-namespace: kube-system
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2023-05-11T12:34:14Z"
labels:
app.kubernetes.io/managed-by: Helm
name: vsphere-csi-sc
resourceVersion: "731"
uid: d91c57ea-3c0d-4fab-bcca-ae1adbd1d84e
parameters:
storagepolicyname: <REDACTED>
provisioner: csi.vsphere.vmware.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
kind: List
metadata:
resourceVersion: ""

 

When I perform a backup, it will perform the first two steps of “Backup” stage (ie. “Snapshotting workload” and “Snapshotting Application configuration”), but it will fail at the “Snapshotting Application Components” stage.  The error seems to imply that it cannot find the volume within vSphere.

Below is the logs from the executor-svc pods.

{
"File": "kasten.io/k10/kio/exec/internal/runner/runner.go",
"Function": "kasten.io/k10/kio/exec/internal/runner.(*Runner).maybeExecJob",
"JobID": "f4bfb713-4325-11ee-887a-4e5290b9f5f1",
"Line": 230,
"ManifestID": "f4bf47ba-4325-11ee-8358-4eafe08e0033",
"QueuedJobID": "f4bfb713-4325-11ee-887a-4e5290b9f5f1",
"RequestID": "8dbb15f3-4314-11ee-91c4-b2e3db1d0aa0",
"SubjectRef": "kasten-io:nagios-db",
"cluster_name": "09ad248c-168b-4943-9984-5d4498ee291b",
"error": {
"message": "Failed checking jobs in group",
"function": "kasten.io/k10/kio/exec/phases/phase.(*queueAndWaitChildrenPhase).Run",
"linenumber": 96,
"file": "kasten.io/k10/kio/exec/phases/phase/queue_and_wait_children.go:96",
"fields": b
{
"name": "manifestID",
"value": "f4bf47ba-4325-11ee-8358-4eafe08e0033"
},
{
"name": "jobID",
"value": "f4bfb713-4325-11ee-887a-4e5290b9f5f1"
},
{
"name": "groupIndex",
"value": 0
}
],
"cause": {
"message": "Failure in snapshotting workload nagios-db",
"function": "kasten.io/k10/kio/exec/phases/phase.(*queueAndWaitChildrenPhase).processGroup",
"linenumber": 196,
"file": "kasten.io/k10/kio/exec/phases/phase/queue_and_wait_children.go:196",
"fields":
{
"name": "FailedSubPhases",
"value":
{
"Phase": "Snapshotting Workload nagios-db",
"Err": {
"cause": {
"cause": {
"cause": {
"message": "Failure in snapshotting application components"
},
"fields":
{
"name": "FailedSubPhases",
"value":
{
"Err": {
"cause": {
"cause": {
"cause": {
"cause": {
"cause": {
"message": "Failed to query the disk: ServerFaultCode: The object or item referred to could not be found."
},
"fields":
{
"name": "VolumeID",
"value": "52ae2e54-cb86-4b0e-9af1-f4be68224e12"
}
],
"file": "kasten.io/k10/kio/exec/phases/phase/snapshot.go:636",
"function": "kasten.io/k10/kio/exec/phases/phase.ProviderSnapshot",
"linenumber": 636,
"message": "Volume unavailable"
},
"fields":
{
"name": "volumeName",
"value": "nagios-mariadb"
},
{
"name": "volumeNamespace",
"value": "nagios-mariadb"
}
],
"file": "kasten.io/k10/kio/exec/phases/backup/snapshot_data_phase.go:848",
"function": "kasten.io/k10/kio/exec/phases/backup.basicVolumeSnapshot.func1.1",
"linenumber": 848,
"message": "Error snapshotting volume"
},
"fields":
{
"name": "appName",
"value": "nagios-mariadb"
},
{
"name": "appType",
"value": "statefulset"
},
{
"name": "namespace",
"value": "nagios-mariadb"
}
],
"file": "kasten.io/k10/kio/exec/phases/backup/snapshot_data_phase.go:859",
"function": "kasten.io/k10/kio/exec/phases/backup.basicVolumeSnapshot",
"linenumber": 859,
"message": "Failed to snapshot volumes"
},
"file": "kasten.io/k10/kio/exec/phases/backup/snapshot_data_phase.go:385",
"function": "kasten.io/k10/kio/exec/phases/backup.processVolumeArtifacts",
"linenumber": 385,
"message": "Failed snapshots for workload"
},
"fields": ],
"message": "Job failed to be executed"
},
"ID": "f50019b9-4325-11ee-8358-4eafe08e0033",
"Phase": "Snapshotting Application Components"
}
]
}
],
"file": "kasten.io/k10/kio/exec/phases/phase/queue_and_wait_children.go:196",
"function": "kasten.io/k10/kio/exec/phases/phase.(*queueAndWaitChildrenPhase).processGroup",
"linenumber": 196,
"message": "Failure in snapshotting application components"
},
"fields":
{
"name": "manifestID",
"value": "f4fa5fba-4325-11ee-8358-4eafe08e0033"
},
{
"name": "jobID",
"value": "f4fbe6ab-4325-11ee-887a-4e5290b9f5f1"
},
{
"name": "groupIndex",
"value": 0
}
],
"file": "kasten.io/k10/kio/exec/phases/phase/queue_and_wait_children.go:96",
"function": "kasten.io/k10/kio/exec/phases/phase.(*queueAndWaitChildrenPhase).Run",
"linenumber": 96,
"message": "Failed checking jobs in group"
},
"fields": ],
"message": "Job failed to be executed"
},
"ID": "f4fa5fba-4325-11ee-8358-4eafe08e0033"
}
]
}
],
"cause": {
"message": "Failure in snapshotting workload nagios-db"
}
}
},
"hostname": "executor-svc-547b97c699-zpggs",
"level": "error",
"msg": "Job failed",
"time": "2023-08-25T09:04:56.848Z",
"version": "6.0.5"
}

When i use govc to find the volume using the VolumeID in the logs above it is valid:

govc volume.ls | grep 52ae2e54-cb86-4b0e-9af1-f4be68224e12
52ae2e54-cb86-4b0e-9af1-f4be68224e12 pvc-b6f70289-5c79-4d20-bfa5-bac1d2f60190

It’s also important to note that although the k8s cluster is deployed using Rancher, I did not use Rancher’s partner repository to install Kasten.  It was installed from Kasten’s own Helm repo.

 

My end goal for this is to export these backups to Veeam backup and replication 12.

 

Please let me know if you need any further info.

Any assistance is gratefully appreciated.

 

Matt

Does anyone have any idea on how I can fix this issue or any pointers to get me going in the right direction?

 

Matt


I ended up solving this one my self.

I had orphaned k8s volumes in vSphere that needed to be remove by the use of

govc disks.ls -R

I also removed old disk snapshots using govc.

 

By clearinig all these down, the snapshot started working.


Comment