Solved

K10 Disaster Recovery fills NFS share


Userlevel 1

Hi! I’m using a NFS share for K10 DR. The retention policy in the DR policy was changed to only keep 1 hourly version. Therfore only one version should be kept. It looks liket the old version get retired. But the space on the nfs volume isn’t freed up. Also the usage report shows that the storage is in use by kasten-io. What can I do to free up the space? Is there any manual disk space reaclaim job which needs to be run?

icon

Best answer by Hagag 28 March 2022, 14:14

View original

26 comments

Userlevel 7
Badge +22

Hi ph1l1pp,

 

If you don’t need to keep the old backups it might be better to delete the policy and create a new one. I am not using NFS but with S3 if I delete a policy and then run my report again it claims that it is gone:

 

Userlevel 7
Badge +22

wait wrong screen shot ;)

Userlevel 7
Badge +22

Actually I took my S3 offline last night in the basement.. bear with me :) 

Userlevel 7
Badge +22

Ok while waiting I did find this. There could be issues with permissions but try looking around the CLI too. 

Take a look at this:

 

https://docs.kasten.io/latest/api/restorepoints.html#api-delete-rpc

Userlevel 7
Badge +22

Ok so looks like this is the case after all. I deleted the policy and recreated but the old restore points are still there so next I will try to manually remove and explore the api to see if there is a purge of some sort. 

 

Userlevel 1

I deleted the policy and rerun the report. Still shows me the same amount of data:

After that I deleted all restore points, which triggered a retire action for all remaining restore points:

kubectl delete restorepointcontents.apps.kio.kasten.io --selector=k10.kasten.io/appName=kasten-io
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-nm28p" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-8bb5w" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-lw55l" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-pnnjx" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-vf5sw" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-mhvvs" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-bjnzh" deleted

After a while the retire actions get stoped with state failed:

status:
  actionDetails: {}
  endTime: "2022-01-13T12:14:23Z"
  error:
    cause: '{"cause":{"cause":{"cause":{"cause":{"Code":1,"Err":{}},"function":"kasten.io/k10/kio/kanister/function.deleteDataPodExecFunc.func1","linenumber":156,"message":"Error
      executing kopia GC"},"function":"kasten.io/k10/kio/kanister/function.DeleteData","linenumber":92,"message":"Failed
      to execute delete data pod function"},"function":"kasten.io/k10/kio/exec/phases/phase.GenericVolumeSnapshotDelete","linenumber":676,"message":"Failed
      to delete Generic Volume Snapshot data"},"function":"kasten.io/k10/kio/exec/phases/phase.(*retireRestorePointPhase).retireGenericVolumeSnapshots","linenumber":435,"message":"Failed
      to retire some of the generic volume snapshots"}'
    message: Job failed to be executed
  plan: {}
  restorePoint:
    name: ""
  result:
    name: ""
  startTime: "2022-01-13T11:40:39Z"
  state: Failed

 

Problem still  persists. Any idea what causes this kopia error?

Userlevel 7
Badge +22

Hi ph1l1pp,

 

You might need someone from Kasten to check that since it is at the kopia level. You can put a support case in even if you are using the communion edition.

cheers

Userlevel 7
Badge +22

Hi again, so I went through all the steps (manually deleting etc) and I saw a result right away in the report.

So did manual deletions:

 

And ran the report on demand before and after:

So immediately the report reflects the change.

 

That kopia error that you found is obviously the cause. Please get back to us if Kasten support figure it out so we can be aware if it pops up.

cheers

Geoff

Userlevel 4
Badge +2

Hello @ph1l1pp 

Thank you for using our community and K10.

Could you please raise a ticket to K10 support at http://my.veeam.com/, to speed up the troubleshooting, please add to your case the debug logs:

https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information

Best Regards

Fernando R.

Userlevel 1

@Geoff Burke Thanks for going through all the steps on your s3 environment. That’s how I expect to be also on my environment. I openend now a support case.

After freeing up some space manually the kopia error disappeared. Looks like this one was caused by the full disk. But the root cause still persists. The volume starts filling up again and old data isn’t removed.

Userlevel 7
Badge +22

@Geoff BurkeThanks for going through all the steps on your s3 environment. That’s how I expect to be also on my environment. I openend now a support case.

After freeing up some space manually the kopia error disappeared. Looks like this one was caused by the full disk. But the root cause still persists. The volume starts filling up again and old data isn’t removed.

Ah ok that is good to know. Probably the logs were not clear. I have seen that before when logs lead people on a wild goose chase and then it turns out to be something like space or permissions :). I am noting this down in case I see it too, thanks.

Userlevel 5
Badge +2

Hello Philipp,

Please check the reply of your case.


I have checked the logs that sent from your side, currently retire actions could not be reclaim space from the external repositories  ( NFS volume ) 
retire actions or manually deleting the restorepoints either from the backend or kasten dashboard will free the local spaces in the cluster but not outside the cluster.

i checked the screenshot and i found that all the sanpshot and data were removed but the only data exists is the kasten services data ( resources data ) which is used by loggin, catalog and jobs objects or resrources in kasten-io namesapce
for more details about this data you can check   "more chart and alerts" option under "usage and reports page" in kasten dashboard.

back again to your first inquiry if there is any way to remove the data from NFS volume which retire action triggered , unfortunatley we dont have such option at the moment  but It is a good feature that could be possible in the coming versions.

I would like to emphasize that the retire action or manually deleting restore point will clean the local space in the cluster ( e.g revmoing snapshots ).

Please let us know if you still have some inquiries or more clarifications needed from our side.
Best Regards
Ahmed Hagag

Userlevel 7
Badge +22

Hi Ahmed,

 

Is this the same for Object Storage as well? i.e. the data won’t retire automatically? 

 

Thanks 

Userlevel 1

We are facing the same behavior on our NFS location profile. @Hagag you said that it is the expected behavior, i.e. exported restore points are never deleted from Object/NFS storage ?

Userlevel 5
Badge +2

Hello @KelianSB @Geoff Burke 

Sorry for the confusion , retire actions should remove the data from external repository ( regardless it is object/NFS storage ) and it seems something wrong from our side
we are checking with the engineering team how can it be fixed and will come back to you soon.

 

Thanks

Ahmed Hagag

Userlevel 7
Badge +22

Hello @KelianSB @Geoff Burke 

Sorry for the confusion , retire actions should remove the data from external repository ( regardless it is object/NFS storage ) and it seems something wrong from our side
we are checking with the engineering team how can it be fixed and will come back to you soon.

 

Thanks

Ahmed Hagag

Thanks Ahmed. It will be interesting to see. I will try again later with the latest release as well.

 

cheers

Userlevel 1

Hello @KelianSB @Geoff Burke 

Sorry for the confusion , retire actions should remove the data from external repository ( regardless it is object/NFS storage ) and it seems something wrong from our side
we are checking with the engineering team how can it be fixed and will come back to you soon.

 

Thanks

Ahmed Hagag

Hello @Hagag, thanks for your response. Were you finally able to identify the problem?

Userlevel 5
Badge +2

Hi @KelianSB 

we are still checking the case ,but let me explain a bit how a Kopia tool that we are using for fast and secure backups and why such issue exists.

Kopia marks the blobs of the snapshot as deleted but not delete them until garbage collections runs,It seems the garabage collections runs after 1 or 2 days and if there is a backup before that which references the same blob that were marked deleted, they will be unmarked and rest will be removed.

it is kind of fail safe mechanism used by Kpoia tool but we are still testing that.

 

could you clarify what is the status of the policy you are using for backup ? because if you paused it , the kopia mechnism that i mentioned wont work.

 

and as i mentioned above we are in the testing phase to be able to confirm what is the root cause of this issue!

Userlevel 1

We tried with active policies, paused policies but also with manual export and the result is the same each time, the data is not deleted from the NFS storage.

Hello, I have a similar issue with the NFS mount FileStore policy.

I wasnt able to find any documentation relating to this matter, however I discovered that the permissions of the create-repo-pod didn’t seem to match up with my NFS export permissions:

```
cause:
  cause:
    cause:
      cause:
        Code: 1
        Err: {}
      file: kasten.io/k10/kio/kopia/repository.go:528
      function: kasten.io/k10/kio/kopia.ConnectToKopiaRepository
      linenumber: 528
      message: Failed to connect to the backup repository
    fields:
      - name: appNamespace
        value: backup
    file: kasten.io/k10/kio/exec/phases/phase/export.go:210
    function: kasten.io/k10/kio/exec/phases/phase.prepareKopiaRepoIfExportingData
    linenumber: 210
    message: Failed to create Kopia repository for data export
  file: kasten.io/k10/kio/exec/phases/phase/export.go:132
  function: kasten.io/k10/kio/exec/phases/phase.(*exportRestorePointPhase).Run
  linenumber: 132
  message: Failed to copy artifacts
message: Job failed to be executed
fields: []
```

Additionally, changing my NFS directory to be chmod 777 allowed the repository to be created and written, however it writes it as nobody/nogroup. When I attempt to delete snapshots/exports it fails to delete the NFS exported files (due to permissions). Only after chmoding the directory to be 777 did it allow deletion.

Is there any way to configure the `kopia` command to provide gid/uid https://kopia.io/docs/reference/command-line/common/repository-connect-filesystem/?

```
c [0] tcp.0: [1646973928.325036348, {"Command"=>"kopia --log-level=error --config-file=/tmp/kopia-repository --log-dir=/tmp/kopia-log --password=<****> repository connect --no-check-for-updates --cache-directory=/tmp/kopia-cache --content-cache-size-mb=0 --metadata-cache-size-mb=500 --override-hostname=create-repo-pod --override-username=k10-admin filesystem --path=/mnt/data/default/repo/662b98ec-f136-4957-a100-2075a642f128/", "File"=>"kasten.io/k10/kio/kopia/kopia.go", "Function"=>"kasten.io/k10/kio/kopia.stringSliceCommand", "Level"=>"info", "Line"=>132, "Message"=>"kopia command", "Time"=>"2022-03-11T04:45:28.109692023Z", "cluster_name"=>"625a2a72-f41b-4697-8b55-7ab69805021f", "hostname"=>"executor-svc-b549cbdfc-jkvpd", "version"=>"4.5.10"}]
```

 

Userlevel 5
Badge +2

workaround to run kopia full maintenance on NFS share. to clear external storage space.

 

Problem Description:

When the restorepoints for DR expires and are deleted (manually or retire actions) the space is not reclaimed from external storage (S3,NFS..) filling it up

 

Workaround/Resolution:

It is a bit tricky since NFS used as a location profile for the DR backup location.
the idea is to mount your NFS Share in the kanister-sidecar container in catalog-svc-xxxx-xxx POD

1- create PV and PVC in your kasten-io namespace , check the below example

cat test-nfs-pv2.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
   name: test-pv-nfs2
spec:
   capacity:
      storage: 2Gi
   volumeMode: Filesystem
   accessModes:
      - ReadWriteMany
   persistentVolumeReclaimPolicy: Retain
   storageClassName: nfs
   mountOptions:
      - hard
      - nfsvers=4.1
   nfs:
      path: /mnt/backups
      server: NFS-IP-address

 

cat test-nfs-pvc2.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: nfs-pvc2
   namespace: kasten-io
spec:
   storageClassName: nfs
   accessModes:
      - ReadWriteMany
   resources:
      requests:
         storage: 2Gi

 

2-patch the current catalog-svc deployment by creating yaml file similar to the below:

cat catalogsvc2.yaml

spec:
  template:
    spec:
      volumes:
        - name: nfs-storage1
          persistentVolumeClaim:
            claimName:  nfs-pvc2
      containers:
        - name: kanister-sidecar
          volumeMounts:
          - mountPath: /mnt/backup
            name: nfs-storage1

3- you can connect to the kopia repoistory from kanister-sidecare by issue commands similar to the below

- you need to change the path as you have it in your NFS Share

kopia repository connect filesystem --path /mnt/backup/k10/b8d3ead3-ea44-48a0-ade8-11a746e79f0b/migration/b8d3ead3-ea44-48a0-ade8-11a746e79f0b/k10/repo/

- it will request password and you can fetch the password by issue the below command

kubectl get secret -n kasten-io k10-dr-secret -o jsonpath='{.data.key}'|base64 -d

 

4- change the owner and run kopia full maintenance command , feel free to remove the --safety opetion.

kopia maintenance set --owner=me → please save the current owner information before run this command
kopia maintenance run --log-level=debug --full --safety=none

  • once you finish change the owner back for example

kopia maintenance set --owner=k10-admin@31593f98-45b4-41a1-afe8-8ce6291ed242-maintenance

Userlevel 1

Hello @Hagag, is this case resolved? I mean, without using the suggested workaround.

Userlevel 5
Badge +2

@KelianSB still, the fix is in progress.

Hello folks,

We have similar problem using s3 storage.

k10: 5.5.6

Kanister: 0.89.0

Kopia: 0.12.1

Minio: 2023-03-13T19:46:17Z

Kubernetes: v1.24.9+rke2r2

Do you have some news?

Thank you.

Userlevel 6
Badge +2

We are still improving the maintenance process.

However, If you need help in running a maintenance for your S3, Please open up a case with us through my.veeam.com 

Comment