Solved

10 hour timeout backing up 1TB

2 years ago
28 April 2022
6 comments
300 views

Userlevel 3

Aaron Oneal
Comes here often
29 comments

Trying to backup 1TB over a 10M connection with K10. The “Monitoring Actions” task of backup times out after 10 hours exactly even though I set “kanister.backupTimeout” to 2 weeks.

Is this a bug?

Also, if the job is canceled, will a second run discover the data in the repo where the former left off or does the entire export have to complete in a single run?

icon

Best answer by Satish 29 April 2022, 00:52

View original

6 comments

Userlevel 3

Satish
Comes here often
37 comments
2 years ago
29 April 2022
Answer

Hi @Aaron Oneal , Thanks for reaching out to us.

Could you provide the output of the below values:-

helm get values k10 -n kasten-io

helm get all k10 -n kasten-io | grep -I kanister


#if possible provide us the debug logs to look in detail 

curl -s https://docs.kasten.io/tools/k10_debug.sh | bash;

Possibly 10h is a value hardcoded , as part of the safety check to avoid job running forever .

For your second question :- The entire export has to be completed in Single Run . Once the job is canceled and re-triggered, it initiates from the start.

Regards
Satish Valasa

Userlevel 3

Aaron Oneal
Author
Comes here often
29 comments
2 years ago
29 April 2022

@Satish, thanks for responding. I’m not able to run the helm commands because I installed from the helm template using kustomize. I don’t think they would be that helpful though because I used the default values and only customized as follows.

  valuesInline:
    global:
      persistence:
        storageClass: nvme
    auth: ...
    externalGateway:
      create: true
    kanister:
      backupTimeout: 7200

There’s too much info in the logs for me to share, and also nothing that indicates anything other than the task aborting after 10 hours. It does appear to be a hard coded limit.

Seems like that monitoring task should share the `backupTimeout` setting, else long running jobs will always fail at 10 hours.

Also, if subsequent runs of a job really don’t make use of the blobs that were previously uploaded, then K10 seems a non-viable backup solution for large datasets to the cloud. With multiple TBs to seed over a 10M link, it’s nearly impossible to maintain a stable connection for so long. I hope you’ll consider a way to support resume or block reuse. While I realize a subsequent job is sending a different snapshot, given the data is nearly identical to the prior run (aborted or not), I would have expected Kanister/Kopia to figure that out and reuse blobs rather than resend them.

Userlevel 3

Aaron Oneal
Author
Comes here often
29 comments
1 year ago
4 May 2022

Hi @Satish,

Will the team take a look at making that “Monitoring Actions” timeout configurable or tied to `backupTimeout` so we can run jobs longer than 10 hours?
What’s the best way to do a multiple TB backup over an intermittent connection? Any chance checking the “Ignore Exceptions and Continue if Possible” setting allows K10 to leverage previously sent blocks from a failed run?

Userlevel 3

Aaron Oneal
Author
Comes here often
29 comments
1 year ago
9 May 2022

K10 support—

This is clearly a bug for Monitoring Actions to be hard coded to 10 hours when it’s possible (and desirable) to configure backups to run longer than that.

Will this be addressed?

Ideally Monitoring Actions should inherit the longest timeout configured.

Userlevel 3

Aaron Oneal
Author
Comes here often
29 comments
1 year ago
24 May 2022

The case I opened was closed without an opportunity to respond so I will continue discussion here and open a new case if needed.

10M link to do backup, even with a VM it would fail... It is better to review your architecture because backup is not the only issue you will have, if you want to resume backup where it stopped or failed, you will rely on a different snapshot , that`s a real consistency issue.

It’s hard to parse what support is saying.

First of all, my understanding is that Kopia repos just store blocks based on a hash. So from one snapshot to the next, it doesn’t really matter if the preceding snapshot didn’t finish uploading. While true I can’t use a failed snap1 to restore, the blocks are still in the repo from the failed snap1 so snap2 doesn’t have to resend them. Eventually all the data will make it to the repo and snaps will complete.

Which leads to the fact that we are talking about 1TB over a 10M link taking a while is only true for the initial backup. Once that repo is in place, the incrementals are trivial because that entire 1TB is not churning or being sent, only the minor deltas.

So it would be extremely helpful to be able to keep the jobs running longer and not have Kasten decide for me that 10 hours is long enough for my initial backup to run.

Userlevel 3

Aaron Oneal
Author
Comes here often
29 comments
1 year ago
25 May 2022

@Hagag -

BTW, this also applies to local backups, not just network. I have a 10TB application getting backed up over USB and it fails at 10 hours as well. Please allow configuration of this limit.

Comment

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded