Skip to main content

You can review my journey upgrading form 7.06 to 7.0.14 in this post.

I have a 200Gi PVC that is 155GB written of tiny little index files. It does take quite a bit of time to clone the snapshot, and Kasten would time out after 15min of waiting. I fixed it by increasing kanister.backupTimeout  (KanisterBackupTimeout) parameter from 45min to 150min, and the kanister.podReadyWaitTimeout  (KanisterPodReadyWaitTimeout) from 15min to 45min.

During my upgrade journey to 7.0.14, I saw that these parameters were depreciated and replaced by timeout.blueprintBackup and timeout.workerPodReady respectively. So naturally I added them to my helm update command and they were added to my k10-config ConfigMap.  

My ConfigMap now has both old and new parameters for the Worker Pod Timeout (copy-data-xxxxx pod) set to 45min but I am now getting timeout errors that my “Pod did not transition into running state. Timeout:15m0s”

I have tried without the old values in the ConfigMap as well and no change. I have also tried the helm upgrade command with all 4 --set-options as well as manually deleting all of the pods and Kasten is still not respecting the 45min Worker Pod Timeout.


With all the issues I had just upgrading 7.0.6 to 7.0.14 I am afraid to upgrade to 7.5 before this can be fixed. Any help appreciated.

Error:
 

 

K8s: 1.30.6
Longhorn: 1.7.2
Kasten 7.0.14
k10-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: k10
meta.helm.sh/release-namespace: kasten-io
labels:
app: k10
app.kubernetes.io/instance: k10
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: k10
helm.sh/chart: k10-7.0.14
heritage: Helm
release: k10
name: k10-config
namespace: kasten-io
data:
AWSAssumeRoleDuration: 60m
DataStoreFileLogLevel: ""
DataStoreLogLevel: error
K10BackupBufferFileHeadroomFactor: "1.1"
K10DefaultPriorityClassName: ""
K10EphemeralPVCOverhead: "0.1"
K10ForceRootInBlueprintActions: "true"
K10GCActionsEnabled: "false"
K10GCDaemonPeriod: "21600"
K10GCKeepMaxActions: "1000"
K10LimiterCsiSnapshotRestoresPerAction: "3"
K10LimiterCsiSnapshotsPerCluster: "10"
K10LimiterDirectSnapshotsPerCluster: "10"
K10LimiterExecutorThreads: "8"
K10LimiterGenericVolumeBackupsPerCluster: "10"
K10LimiterImageCopiesPerCluster: "10"
K10LimiterSnapshotExportsPerAction: "3"
K10LimiterSnapshotExportsPerCluster: "10"
K10LimiterVolumeRestoresPerAction: "3"
K10LimiterVolumeRestoresPerCluster: "10"
K10LimiterWorkloadRestoresPerAction: "3"
K10LimiterWorkloadSnapshotsPerAction: "5"
K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
K10PersistenceStorageClass: longhorn
K10TimeoutBlueprintBackup: "150"
K10TimeoutBlueprintDelete: "45"
K10TimeoutBlueprintHooks: "20"
K10TimeoutBlueprintRestore: "600"
K10TimeoutCheckRepoPodReady: "20"
K10TimeoutEFSRestorePodReady: "45"
K10TimeoutJobWait: ""
K10TimeoutStatsPodReady: "20"
K10TimeoutWorkerPodReady: "45" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<NEW TIMEOUT SET
KanisterBackupTimeout: "150"
KanisterManagedDataServicesBlueprintsEnabled: "true"
KanisterPodReadyWaitTimeout: "45" #<<<<<<<<<<<<<<<<<<<<<<<<<OLD TIMEOUT SET
KanisterToolsImage: gcr.io/kasten-images/kanister-tools:7.0.14
WorkerPodMetricSidecarCPULimit: ""
WorkerPodMetricSidecarCPURequest: ""
WorkerPodMetricSidecarEnabled: "true"
WorkerPodMetricSidecarMemoryLimit: ""
WorkerPodMetricSidecarMemoryRequest: ""
WorkerPodMetricSidecarMetricLifetime: 2m
WorkerPodPushgatewayMetricsInterval: 30s
apiDomain: kio.kasten.io
efsBackupVaultName: k10vault
excludedApps: kube-system,kube-ingress,kube-node-lease,kube-public,kube-rook-ceph
k10DataStoreDisableCompression: "false"
k10DataStoreGeneralContentCacheSizeMB: "0"
k10DataStoreGeneralMetadataCacheSizeMB: "500"
k10DataStoreParallelDownload: "8"
k10DataStoreParallelUpload: "8"
k10DataStoreRestoreContentCacheSizeMB: "500"
k10DataStoreRestoreMetadataCacheSizeMB: "500"
kanisterFunctionVersion: v1.0.0-alpha
kubeVirtVMsUnFreezeTimeout: 5m
loglevel: info
modelstoredirname: //mnt/k10state/kasten-io/
multiClusterVersion: "2.5"
quickDisasterRecoveryEnabled: "false"
version: 7.0.14
vmWareTaskTimeoutMin: "60"
workerPodResourcesCRDEnabled: "false"







 

Please submit a support ticket so we can test and verify the behavior.

 

 


Please submit a support ticket so we can test and verify the behavior.

 

 

Hi Michael, I have submitted case 07537768. 


Hi ​@NPatel ,
I have exactly the same problem with the same configuration: Kasten 7.0.14 and Longhorn CSI.
The copy-vol-data-XXXX pods only live for 15 minutes even though I set the timeout.workerPodReady to a higher value.
Please let us know the solution provided by the support.

Thank you


Hi ​@smartini ​@NPatel 
i was able to recreate the issue, will get back to you soon with more details.

Thanks,
Ahmed Hagag


Comment