Solved

Kasten 7.0.14 not respecting new parameters?

3 months ago
December 8, 2024
8 comments
142 views

NPatel
Comes here often
7 comments

You can review my journey upgrading form 7.06 to 7.0.14 in this post.

I have a 200Gi PVC that is 155GB written of tiny little index files. It does take quite a bit of time to clone the snapshot, and Kasten would time out after 15min of waiting. I fixed it by increasing kanister.backupTimeout (KanisterBackupTimeout) parameter from 45min to 150min, and the kanister.podReadyWaitTimeout (KanisterPodReadyWaitTimeout) from 15min to 45min.

During my upgrade journey to 7.0.14, I saw that these parameters were depreciated and replaced by timeout.blueprintBackup and timeout.workerPodReady respectively. So naturally I added them to my helm update command and they were added to my k10-config ConfigMap.

My ConfigMap now has both old and new parameters for the Worker Pod Timeout (copy-data-xxxxx pod) set to 45min but I am now getting timeout errors that my “Pod did not transition into running state. Timeout:15m0s”

I have tried without the old values in the ConfigMap as well and no change. I have also tried the helm upgrade command with all 4 --set-options as well as manually deleting all of the pods and Kasten is still not respecting the 45min Worker Pod Timeout.

With all the issues I had just upgrading 7.0.6 to 7.0.14 I am afraid to upgrade to 7.5 before this can be fixed. Any help appreciated.

Error:

K8s: 1.30.6
Longhorn: 1.7.2
Kasten 7.0.14
k10-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: k10
    meta.helm.sh/release-namespace: kasten-io
  labels:
    app: k10
    app.kubernetes.io/instance: k10
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k10
    helm.sh/chart: k10-7.0.14
    heritage: Helm
    release: k10
  name: k10-config
  namespace: kasten-io
  data:
  AWSAssumeRoleDuration: 60m
  DataStoreFileLogLevel: ""
  DataStoreLogLevel: error
  K10BackupBufferFileHeadroomFactor: "1.1"
  K10DefaultPriorityClassName: ""
  K10EphemeralPVCOverhead: "0.1"
  K10ForceRootInBlueprintActions: "true"
  K10GCActionsEnabled: "false"
  K10GCDaemonPeriod: "21600"
  K10GCKeepMaxActions: "1000"
  K10LimiterCsiSnapshotRestoresPerAction: "3"
  K10LimiterCsiSnapshotsPerCluster: "10"
  K10LimiterDirectSnapshotsPerCluster: "10"
  K10LimiterExecutorThreads: "8"
  K10LimiterGenericVolumeBackupsPerCluster: "10"
  K10LimiterImageCopiesPerCluster: "10"
  K10LimiterSnapshotExportsPerAction: "3"
  K10LimiterSnapshotExportsPerCluster: "10"
  K10LimiterVolumeRestoresPerAction: "3"
  K10LimiterVolumeRestoresPerCluster: "10"
  K10LimiterWorkloadRestoresPerAction: "3"
  K10LimiterWorkloadSnapshotsPerAction: "5"
  K10MutatingWebhookTLSCertDir: /etc/ssl/certs/webhook
  K10PersistenceStorageClass: longhorn
  K10TimeoutBlueprintBackup: "150"
  K10TimeoutBlueprintDelete: "45"
  K10TimeoutBlueprintHooks: "20"
  K10TimeoutBlueprintRestore: "600"
  K10TimeoutCheckRepoPodReady: "20"
  K10TimeoutEFSRestorePodReady: "45"
  K10TimeoutJobWait: ""
  K10TimeoutStatsPodReady: "20"
  K10TimeoutWorkerPodReady: "45"    #<<<<<<<<<<<<<<<<<<<<<<<<<<<<NEW TIMEOUT SET
  KanisterBackupTimeout: "150"
  KanisterManagedDataServicesBlueprintsEnabled: "true"
  KanisterPodReadyWaitTimeout: "45"    #<<<<<<<<<<<<<<<<<<<<<<<<<OLD TIMEOUT SET
  KanisterToolsImage: gcr.io/kasten-images/kanister-tools:7.0.14
  WorkerPodMetricSidecarCPULimit: ""
  WorkerPodMetricSidecarCPURequest: ""
  WorkerPodMetricSidecarEnabled: "true"
  WorkerPodMetricSidecarMemoryLimit: ""
  WorkerPodMetricSidecarMemoryRequest: ""
  WorkerPodMetricSidecarMetricLifetime: 2m
  WorkerPodPushgatewayMetricsInterval: 30s
  apiDomain: kio.kasten.io
  efsBackupVaultName: k10vault
  excludedApps: kube-system,kube-ingress,kube-node-lease,kube-public,kube-rook-ceph
  k10DataStoreDisableCompression: "false"
  k10DataStoreGeneralContentCacheSizeMB: "0"
  k10DataStoreGeneralMetadataCacheSizeMB: "500"
  k10DataStoreParallelDownload: "8"
  k10DataStoreParallelUpload: "8"
  k10DataStoreRestoreContentCacheSizeMB: "500"
  k10DataStoreRestoreMetadataCacheSizeMB: "500"
  kanisterFunctionVersion: v1.0.0-alpha
  kubeVirtVMsUnFreezeTimeout: 5m
  loglevel: info
  modelstoredirname: //mnt/k10state/kasten-io/
  multiClusterVersion: "2.5"
  quickDisasterRecoveryEnabled: "false"
  version: 7.0.14
  vmWareTaskTimeoutMin: "60"
  workerPodResourcesCRDEnabled: "false"

Best answer by Hagag

Hi @smartini

The fix should be available in the next release 7.5.2 , please keep monitoring our release note page and upgrade k10.

https://docs.kasten.io/latest/releasenotes.html

Thanks

View original

Did this topic help you find an answer to your question?

M

michaelxue
Comes here often
13 comments
3 months ago
December 9, 2024

Please submit a support ticket so we can test and verify the behavior.

N

NPatel
Author
Comes here often
7 comments
3 months ago
December 11, 2024

michaelxue wrote:

Please submit a support ticket so we can test and verify the behavior.

Hi Michael, I have submitted case 07537768.

S

smartini
New Here
3 comments
3 months ago
December 12, 2024

Hi @NPatel ,
I have exactly the same problem with the same configuration: Kasten 7.0.14 and Longhorn CSI.
The copy-vol-data-XXXX pods only live for 15 minutes even though I set the timeout.workerPodReady to a higher value.
Please let us know the solution provided by the support.

Thank you

N

+2

Hagag
Experienced User
154 comments
3 months ago
December 15, 2024

Hi @smartini @NPatel
i was able to recreate the issue, will get back to you soon with more details.

Thanks,
Ahmed Hagag

N

S

smartini
New Here
3 comments
2 months ago
January 9, 2025

Hi @Hagag @NPatel ,
do you have news about this issue?

Thanks

N

+2

Hagag
Experienced User
154 comments
Answer
2 months ago
January 9, 2025

Hi @smartini

The fix should be available in the next release 7.5.2 , please keep monitoring our release note page and upgrade k10.

https://docs.kasten.io/latest/releasenotes.html

Thanks

N

S

smartini
New Here
3 comments
2 months ago
January 9, 2025

thank you @Hagag.

N

NPatel
Author
Comes here often
7 comments
1 month ago
January 19, 2025

Hi @smartini @Hagag

I can confirm the fix in 7.5.2 for timeout.workerPodReady (K10TimeoutWorkerPodReady) is working correctly. I was able to directly upgrade to 7.5.2 from 7.0.14 without issue. My large PV has grown 20GB during this ticket and the now 175GB PV not only exported successfully, but did so 15min faster than previously.

S

Comment

Related topics

Error while uploading a document in appicon

Limiting document classes available when uploading filesicon

IFS Report Rule - Check In To Document Management not uploading to shared Repositoryicon

Error when importing Document via REST API with Chinese characters in document titleicon

Error trying to upload document via RESTful APIicon

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded