Solved

10 hour timeout backing up 1TB


Userlevel 3

Trying to backup 1TB over a 10M connection with K10. The “Monitoring Actions” task of backup times out after 10 hours exactly even though I set “kanister.backupTimeout” to 2 weeks.

Is this a bug?

Also, if the job is canceled, will a second run discover the data in the repo where the former left off or does the entire export have to complete in a single run?

icon

Best answer by Satish 29 April 2022, 00:52

View original

6 comments

Userlevel 3
Badge +1

Hi @Aaron Oneal , Thanks for reaching out to us. 

Could you provide the output of the below values:-

helm get values k10 -n kasten-io

helm get all k10 -n kasten-io | grep -I kanister


#if possible provide us the debug logs to look in detail

curl -s https://docs.kasten.io/tools/k10_debug.sh | bash;

Possibly 10h is a value hardcoded , as part of the safety check to avoid job running forever . 

For your second question :- The entire export has to be completed in Single Run . Once the job is canceled and re-triggered, it initiates from the start. 

Regards
Satish Valasa

Userlevel 3

@Satish, thanks for responding. I’m not able to run the helm commands because I installed from the helm template using kustomize. I don’t think they would be that helpful though because I used the default values and only customized as follows.

  valuesInline:
global:
persistence:
storageClass: nvme
auth: ...
externalGateway:
create: true
kanister:
backupTimeout: 7200

There’s too much info in the logs for me to share, and also nothing that indicates anything other than the task aborting after 10 hours. It does appear to be a hard coded limit.

Seems like that monitoring task should share the `backupTimeout` setting, else long running jobs will always fail at 10 hours.

Also, if subsequent runs of a job really don’t make use of the blobs that were previously uploaded, then K10 seems a non-viable backup solution for large datasets to the cloud. With multiple TBs to seed over a 10M link, it’s nearly impossible to maintain a stable connection for so long. I hope you’ll consider a way to support resume or block reuse. While I realize a subsequent job is sending a different snapshot, given the data is nearly identical to the prior run (aborted or not), I would have expected Kanister/Kopia to figure that out and reuse blobs rather than resend them.

Userlevel 3

Hi @Satish,

  1. Will the team take a look at making that “Monitoring Actions” timeout configurable or tied to `backupTimeout` so we can run jobs longer than 10 hours?
  2. What’s the best way to do a multiple TB backup over an intermittent connection? Any chance checking the “Ignore Exceptions and Continue if Possible” setting allows K10 to leverage previously sent blocks from a failed run? 
Userlevel 3

K10 support—

This is clearly a bug for Monitoring Actions to be hard coded to 10 hours when it’s possible (and desirable) to configure backups to run longer than that.

Will this be addressed?

Ideally Monitoring Actions should inherit the longest timeout configured.

Userlevel 3

The case I opened was closed without an opportunity to respond so I will continue discussion here and open a new case if needed.

10M link to do backup, even with a VM it would fail... It is better to review your architecture because backup is not the only issue you will have, if you want to resume backup where it stopped or failed, you will rely on a different snapshot , that`s a real consistency issue.

 

It’s hard to parse what support is saying.

First of all, my understanding is that Kopia repos just store blocks based on a hash. So from one snapshot to the next, it doesn’t really matter if the preceding snapshot didn’t finish uploading. While true I can’t use a failed snap1 to restore, the blocks are still in the repo from the failed snap1 so snap2 doesn’t have to resend them. Eventually all the data will make it to the repo and snaps will complete.

Which leads to the fact that we are talking about 1TB over a 10M link taking a while is only true for the initial backup. Once that repo is in place, the incrementals are trivial because that entire 1TB is not churning or being sent, only the minor deltas.

So it would be extremely helpful to be able to keep the jobs running longer and not have Kasten decide for me that 10 hours is long enough for my initial backup to run.

Userlevel 3

@Hagag -

BTW, this also applies to local backups, not just network. I have a 10TB application getting backed up over USB and it fails at 10 hours as well. Please allow configuration of this limit.

Comment