Question

Kasten Error code 137 in block-upload pod on large Kasten backup


Userlevel 2

Hi,

 

We would like to backup some large PVC (+ 5To) running with vSAN CSI.

Snapshot running great but export in block mode run fail after 20 minutes with a code exit 137.

I thought it was a memory problem, but looking at the system consumption, I don't see any overconsumption.

 

Here last logs from “/tmp/vmware-root/vixDiskLib-42.log” from block-mode-upload pod :

 

2023-05-10T08:31:20.956Z In(05) host-45 VixDiskLib: VixDiskLib_OpenEx: Open a disk.
2023-05-10T08:31:24.241Z In(05) host-45 NBD_ClientOpen: attempting to create connection to SKIPZ-ha-nfc://[vsanDatastore] 84047962
-d2a8-e4d6-f5c8-5cba2c166f30/_00fa/fc40568c02fd4d9d9dfaa0be8bad30db-000008.vmdk@esx-XX.vmware.XXX902!5292837d-ceaf-3075-0877-
4b6381dc970b
2023-05-10T08:31:24.391Z In(05) host-45 Setting NFC log level to 1
2023-05-10T08:31:24.391Z In(05) host-45 Setting NFC log level to 1
2023-05-10T08:31:24.392Z In(05) host-45 NFC Async IO session is established for '[vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166
f30/_00fa/fc40568c02fd4d9d9dfaa0be8bad30db-000008.vmdk' with log level 1.
2023-05-10T08:31:24.393Z In(05) host-45 NfcAioOpenSession: the socket options client snd buffer size 87040, rcv buffer s
ize 131072.
2023-05-10T08:31:24.393Z In(05) host-45 NfcAioOpenSession: the socket options server snd buffer size 753750, rcv buffer
size 753750.
2023-05-10T08:31:24.394Z In(05) host-45 Opening file [vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166f30/_00fa/fc40568c02fd4d9d9d
faa0be8bad30db-000008.vmdk (SKIPZ-ha-nfc://[vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166f30/_00fa/fc40568c02fd4d9d9dfaa0be8bad
30db-000008.vmdk@esx-XX.vmware.XXX:902!5292837d-ceaf-3075-0877-4b6381dc970b)
2023-05-10T08:31:24.489Z In(05) host-45 Nbd_ClientSetCallback: Set callback on '[vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166f
30/_00fa/fc40568c02fd4d9d9dfaa0be8bad30db-000008.vmdk' with type 0.
2023-05-10T08:31:24.489Z In(05) host-45 VixDiskLib: VixDiskLib_GetInfo: Retrieve disk info.
2023-05-10T08:31:24.497Z In(05) host-45 VixDiskLib: VixDiskLib_FreeInfo: Clean up VixDiskLib.
2023-05-10T08:31:24.498Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:25.021Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:25.458Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:25.834Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:26.237Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:26.654Z In(05) host-46 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:27.021Z In(05) host-50 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:27.458Z In(05) host-50 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:27.825Z In(05) host-50 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:31:28.170Z In(05) host-50 VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
2023-05-10T08:54:13.642Z In(05) host-70 DISKLIB-LIB : numIOs = 50000 numMergedIOs = 0 numSplitIOs = 0
 
command terminated with exit code 137
 
 
Thank you for your help !
 
Have a nice day
 

6 comments

Userlevel 5
Badge +2

Hello @Florian Lacrampe 
When a pod shows the "command terminated with exit code 137" error, it indicates that the pod has exhausted its resources. To resolve this issue, you may consider raising its resource limits and trying the command again.

The following Helm parameters can be used to accomplish this:

“The following numbers provided are solely for illustrative purposes. Please adjust them according to the available resources on your cluster.”

 

 

--set genericVolumeSnapshot.resources.requests.cpu= 100m
--set genericVolumeSnapshot.resources.requests.memory= 800Mi
--set genericVolumeSnapshot.resources.limits.cpu= 1200m
--set genericVolumeSnapshot.resources.limits.memory= 4000Mi

 

Userlevel 2

Hello @Hagag 

Unfortunately, we have same behavior with these settings :

genericVolumeSnapshot:
  resources:
    requests:
      memory: "4G"
      cpu: "4000m"
    limits:
      memory: "4G"
      cpu: "4000m"

 

Here last logs from “/tmp/vmware-root/vixDiskLib-42.log” from block-mode-upload pod :

2023-05-11T06:55:57.243Z In(05) host-45 Opening file [vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166f30/_00fa/fc40568c02fd4d9d9d
faa0be8bad30db-000011.vmdk (SKIPZ-ha-nfc://[vsanDatastore] 84047962-d2a8-e4d6-f5c8-5cba2c166f30/_00fa/fc40568c02fd4d9d9dfaa0be8bad
30db-000011.vmdk@esx-xx.vmware.xx.xx:902!52797fe6-6df5-f216-652f-9e4f6a3bfb94)
******
2023-05-11T07:19:22.871Z In(05) host-68 DISKLIB-LIB : numIOs = 50000 numMergedIOs = 0 numSplitIOs = 0
command terminated with exit code 137

 

When i check RAM usage on node, i don’t any over-consumption nor OOMKilled.

 

BTW, what do you think about block-mode-upload ? I’ll try to upload without block mode, maybe that can fix this issue.

 

Thank you for your help

Userlevel 5
Badge +2

@Florian Lacrampe A block mode export accesses the content of the disk snapshot at the block level using infrastructure-specific APIs.
Snapshot content is accessed at the block level directly through the network using the VMware VDDK API.

If changed block tracking is enabled in the infrastructure for the disk volume then K10 will export just the incremental changes to the export location if possible, since you have (+ 5T) of data it will reduce the amount of data transferred during subsequent exports.

Could you attempt to run the Block mode again and verify if there are any I/O issues? It's important to note that an exit code of 137 indicates that there may be issues with resource utilization beyond just memory.

in the logs, it mentioned that the specified host (host-68) has performed a total of 50,000 input/output operations.

Userlevel 2

I’ll continue to troubleshoot to understand why block-mode-upload receive a SIGKILL signal.

I’ve planned to :

  • Empty my Minio (S3) bucket
  • Remove all backups
  • Recreate all k10 pods

If you have any idea about log files or any stdout/stderr i can check, tell me ;)

I already check I/O perfs but i don’t see any problem.

 

Thank you so much for your help !

Have a nice day

Userlevel 2

Well… SIGKILL was sended due to KanisterTimeout which was not set, my bad.

 

For now, i continue to test large backups and also test vSAN I/O during k10kopia transfer.

 

Thank you for your help !

Userlevel 5
Badge +2

@Florian Lacrampe 

Sharing how you figured out that the Kanister timeout was missed would be helpful. Typically, this information can be found in the debug logs.

Comment