Skip to main content

You can review my journey upgrading form 7.06 to 7.0.14 in this post.

I have a 200Gi PVC that is 155GB written of tiny little index files. It does take quite a bit of time to clone the snapshot, and Kasten would time out after 15min of waiting. I fixed it by increasing kanister.backupTimeout  (KanisterBackupTimeout) parameter from 45min to 150min, and the kanister.podReadyWaitTimeout  (KanisterPodReadyWaitTimeout) from 15min to 45min.

During my upgrade journey to 7.0.14, I saw that these parameters were depreciated and replaced by timeout.blueprintBackup and timeout.workerPodReady respectively. So naturally I added them to my helm update command and they were added to my k10-config ConfigMap.  

My ConfigMap now has both old and new parameters for the Worker Pod Timeout (copy-data-xxxxx pod) set to 45min but I am now getting timeout errors that my “Pod did not transition into running state. Timeout:15m0s”

I have tried without the old values in the ConfigMap as well and no change. I have also tried the helm upgrade command with all 4 --set-options as well as manually deleting all of the pods and Kasten is still not respecting the 45min Worker Pod Timeout.


With all the issues I had just upgrading 7.0.6 to 7.0.14 I am afraid to upgrade to 7.5 before this can be fixed. Any help appreciated.

Error:
 

 

K8s: 1.30.6
Longhorn: 1.7.2
Kasten 7.0.14
k10-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: k10
meta.helm.sh/release-namespace: kasten-io
labels:
app: k10
app.kubernetes.io/instance: k10
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: k10
helm.sh/chart: k10-7.0.14
heritage: Helm
release: k10
name: k10-config
namespace: kasten-io
data:
AWSAssumeRoleDuration: 60m
DataStoreFileLogLevel: ""
DataStoreLogLevel: error
K10BackupBufferFileHeadroomFactor: "1.1"
K10DefaultPriorityClassName: ""
K10EphemeralPVCOverhead: "0.1"
K10ForceRootInBlueprintActions: "true"
K10GCActionsEnabled: "false"
K10GCDaemonPeriod: "21600"
K10GCKeepMaxActions: "1000"
K10LimiterCsiSnapshotRestoresPerAction: "3"
K10LimiterCsiSnapshotsPerCluster: "10"
K10LimiterDirectSnapshotsPerCluster: "10"
K10LimiterExecutorThreads: "8"
K10LimiterGenericVolumeBackupsPerCluster: "10"
K10LimiterImageCopiesPerCluster: "10"
K10LimiterSnapshotExportsPerAction: "3"
K10LimiterSnapshotExportsPerCluster: "10"
K10LimiterVolumeRestoresPerAction: "3"
K10LimiterVolumeRestoresPerCluster: "10"
K10LimiterWorkloadRestoresPerAction: "3"
K10LimiterWorkloadSnapshotsPerAction: "5"
K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
K10PersistenceStorageClass: longhorn
K10TimeoutBlueprintBackup: "150"
K10TimeoutBlueprintDelete: "45"
K10TimeoutBlueprintHooks: "20"
K10TimeoutBlueprintRestore: "600"
K10TimeoutCheckRepoPodReady: "20"
K10TimeoutEFSRestorePodReady: "45"
K10TimeoutJobWait: ""
K10TimeoutStatsPodReady: "20"
K10TimeoutWorkerPodReady: "45" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<NEW TIMEOUT SET
KanisterBackupTimeout: "150"
KanisterManagedDataServicesBlueprintsEnabled: "true"
KanisterPodReadyWaitTimeout: "45" #<<<<<<<<<<<<<<<<<<<<<<<<<OLD TIMEOUT SET
KanisterToolsImage: gcr.io/kasten-images/kanister-tools:7.0.14
WorkerPodMetricSidecarCPULimit: ""
WorkerPodMetricSidecarCPURequest: ""
WorkerPodMetricSidecarEnabled: "true"
WorkerPodMetricSidecarMemoryLimit: ""
WorkerPodMetricSidecarMemoryRequest: ""
WorkerPodMetricSidecarMetricLifetime: 2m
WorkerPodPushgatewayMetricsInterval: 30s
apiDomain: kio.kasten.io
efsBackupVaultName: k10vault
excludedApps: kube-system,kube-ingress,kube-node-lease,kube-public,kube-rook-ceph
k10DataStoreDisableCompression: "false"
k10DataStoreGeneralContentCacheSizeMB: "0"
k10DataStoreGeneralMetadataCacheSizeMB: "500"
k10DataStoreParallelDownload: "8"
k10DataStoreParallelUpload: "8"
k10DataStoreRestoreContentCacheSizeMB: "500"
k10DataStoreRestoreMetadataCacheSizeMB: "500"
kanisterFunctionVersion: v1.0.0-alpha
kubeVirtVMsUnFreezeTimeout: 5m
loglevel: info
modelstoredirname: //mnt/k10state/kasten-io/
multiClusterVersion: "2.5"
quickDisasterRecoveryEnabled: "false"
version: 7.0.14
vmWareTaskTimeoutMin: "60"
workerPodResourcesCRDEnabled: "false"







 

Please submit a support ticket so we can test and verify the behavior.

 

 


Please submit a support ticket so we can test and verify the behavior.

 

 

Hi Michael, I have submitted case 07537768. 


Hi ​@NPatel ,
I have exactly the same problem with the same configuration: Kasten 7.0.14 and Longhorn CSI.
The copy-vol-data-XXXX pods only live for 15 minutes even though I set the timeout.workerPodReady to a higher value.
Please let us know the solution provided by the support.

Thank you


Hi ​@smartini ​@NPatel 
i was able to recreate the issue, will get back to you soon with more details.

Thanks,
Ahmed Hagag


Hi ​@Hagag ​@NPatel ,
do you have news about this issue?

Thanks

 


Hi ​@smartini 

The fix should be available in the next release 7.5.2 , please keep monitoring our release note page and upgrade k10.

https://docs.kasten.io/latest/releasenotes.html

Thanks


thank you ​@Hagag.


Hi ​@smartini ​@Hagag 

I can confirm the fix in 7.5.2 for timeout.workerPodReady (K10TimeoutWorkerPodReady) is working correctly. I was able to directly upgrade to 7.5.2 from 7.0.14 without issue. My large PV has grown 20GB during this ticket and the now 175GB PV not only exported successfully, but did so 15min faster than previously. 


Comment