Skip to main content
Solved

Kasten 7.0.14 not respecting new parameters?


  • Comes here often
  • 7 comments

You can review my journey upgrading form 7.06 to 7.0.14 in this post.

I have a 200Gi PVC that is 155GB written of tiny little index files. It does take quite a bit of time to clone the snapshot, and Kasten would time out after 15min of waiting. I fixed it by increasing kanister.backupTimeout  (KanisterBackupTimeout) parameter from 45min to 150min, and the kanister.podReadyWaitTimeout  (KanisterPodReadyWaitTimeout) from 15min to 45min.

During my upgrade journey to 7.0.14, I saw that these parameters were depreciated and replaced by timeout.blueprintBackup and timeout.workerPodReady respectively. So naturally I added them to my helm update command and they were added to my k10-config ConfigMap.  

My ConfigMap now has both old and new parameters for the Worker Pod Timeout (copy-data-xxxxx pod) set to 45min but I am now getting timeout errors that my “Pod did not transition into running state. Timeout:15m0s”

I have tried without the old values in the ConfigMap as well and no change. I have also tried the helm upgrade command with all 4 --set-options as well as manually deleting all of the pods and Kasten is still not respecting the 45min Worker Pod Timeout.


With all the issues I had just upgrading 7.0.6 to 7.0.14 I am afraid to upgrade to 7.5 before this can be fixed. Any help appreciated.

Error:
 

 

K8s: 1.30.6
Longhorn: 1.7.2
Kasten 7.0.14
k10-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: k10
    meta.helm.sh/release-namespace: kasten-io
  labels:
    app: k10
    app.kubernetes.io/instance: k10
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k10
    helm.sh/chart: k10-7.0.14
    heritage: Helm
    release: k10
  name: k10-config
  namespace: kasten-io
  data:
  AWSAssumeRoleDuration: 60m
  DataStoreFileLogLevel: ""
  DataStoreLogLevel: error
  K10BackupBufferFileHeadroomFactor: "1.1"
  K10DefaultPriorityClassName: ""
  K10EphemeralPVCOverhead: "0.1"
  K10ForceRootInBlueprintActions: "true"
  K10GCActionsEnabled: "false"
  K10GCDaemonPeriod: "21600"
  K10GCKeepMaxActions: "1000"
  K10LimiterCsiSnapshotRestoresPerAction: "3"
  K10LimiterCsiSnapshotsPerCluster: "10"
  K10LimiterDirectSnapshotsPerCluster: "10"
  K10LimiterExecutorThreads: "8"
  K10LimiterGenericVolumeBackupsPerCluster: "10"
  K10LimiterImageCopiesPerCluster: "10"
  K10LimiterSnapshotExportsPerAction: "3"
  K10LimiterSnapshotExportsPerCluster: "10"
  K10LimiterVolumeRestoresPerAction: "3"
  K10LimiterVolumeRestoresPerCluster: "10"
  K10LimiterWorkloadRestoresPerAction: "3"
  K10LimiterWorkloadSnapshotsPerAction: "5"
  K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
  K10PersistenceStorageClass: longhorn
  K10TimeoutBlueprintBackup: "150"
  K10TimeoutBlueprintDelete: "45"
  K10TimeoutBlueprintHooks: "20"
  K10TimeoutBlueprintRestore: "600"
  K10TimeoutCheckRepoPodReady: "20"
  K10TimeoutEFSRestorePodReady: "45"
  K10TimeoutJobWait: ""
  K10TimeoutStatsPodReady: "20"
  K10TimeoutWorkerPodReady: "45"    #<<<<<<<<<<<<<<<<<<<<<<<<<<<<NEW TIMEOUT SET
  KanisterBackupTimeout: "150"
  KanisterManagedDataServicesBlueprintsEnabled: "true"
  KanisterPodReadyWaitTimeout: "45"    #<<<<<<<<<<<<<<<<<<<<<<<<<OLD TIMEOUT SET
  KanisterToolsImage: gcr.io/kasten-images/kanister-tools:7.0.14
  WorkerPodMetricSidecarCPULimit: ""
  WorkerPodMetricSidecarCPURequest: ""
  WorkerPodMetricSidecarEnabled: "true"
  WorkerPodMetricSidecarMemoryLimit: ""
  WorkerPodMetricSidecarMemoryRequest: ""
  WorkerPodMetricSidecarMetricLifetime: 2m
  WorkerPodPushgatewayMetricsInterval: 30s
  apiDomain: kio.kasten.io
  efsBackupVaultName: k10vault
  excludedApps: kube-system,kube-ingress,kube-node-lease,kube-public,kube-rook-ceph
  k10DataStoreDisableCompression: "false"
  k10DataStoreGeneralContentCacheSizeMB: "0"
  k10DataStoreGeneralMetadataCacheSizeMB: "500"
  k10DataStoreParallelDownload: "8"
  k10DataStoreParallelUpload: "8"
  k10DataStoreRestoreContentCacheSizeMB: "500"
  k10DataStoreRestoreMetadataCacheSizeMB: "500"
  kanisterFunctionVersion: v1.0.0-alpha
  kubeVirtVMsUnFreezeTimeout: 5m
  loglevel: info
  modelstoredirname: //mnt/k10state/kasten-io/
  multiClusterVersion: "2.5"
  quickDisasterRecoveryEnabled: "false"
  version: 7.0.14
  vmWareTaskTimeoutMin: "60"
  workerPodResourcesCRDEnabled: "false"







 

Best answer by Hagag

Hi ​@smartini 

The fix should be available in the next release 7.5.2 , please keep monitoring our release note page and upgrade k10.

https://docs.kasten.io/latest/releasenotes.html

Thanks

View original
Did this topic help you find an answer to your question?

8 comments

Forum|alt.badge.img
  • Comes here often
  • 13 comments
  • December 9, 2024

Please submit a support ticket so we can test and verify the behavior.

 

 


  • Author
  • Comes here often
  • 7 comments
  • December 11, 2024
michaelxue wrote:

Please submit a support ticket so we can test and verify the behavior.

 

 

Hi Michael, I have submitted case 07537768. 


  • New Here
  • 3 comments
  • December 12, 2024

Hi ​@NPatel ,
I have exactly the same problem with the same configuration: Kasten 7.0.14 and Longhorn CSI.
The copy-vol-data-XXXX pods only live for 15 minutes even though I set the timeout.workerPodReady to a higher value.
Please let us know the solution provided by the support.

Thank you


Hagag
Forum|alt.badge.img+2
  • Experienced User
  • 154 comments
  • December 15, 2024

Hi ​@smartini ​@NPatel 
i was able to recreate the issue, will get back to you soon with more details.

Thanks,
Ahmed Hagag


  • New Here
  • 3 comments
  • January 9, 2025

Hi ​@Hagag ​@NPatel ,
do you have news about this issue?

Thanks

 


Hagag
Forum|alt.badge.img+2
  • Experienced User
  • 154 comments
  • Answer
  • January 9, 2025

Hi ​@smartini 

The fix should be available in the next release 7.5.2 , please keep monitoring our release note page and upgrade k10.

https://docs.kasten.io/latest/releasenotes.html

Thanks


  • New Here
  • 3 comments
  • January 9, 2025

thank you ​@Hagag.


  • Author
  • Comes here often
  • 7 comments
  • January 19, 2025

Hi ​@smartini ​@Hagag 

I can confirm the fix in 7.5.2 for timeout.workerPodReady (K10TimeoutWorkerPodReady) is working correctly. I was able to directly upgrade to 7.5.2 from 7.0.14 without issue. My large PV has grown 20GB during this ticket and the now 175GB PV not only exported successfully, but did so 15min faster than previously. 


Comment