Question

K10 Disaster Recovery fills NFS share


Userlevel 1

Hi! I’m using a NFS share for K10 DR. The retention policy in the DR policy was changed to only keep 1 hourly version. Therfore only one version should be kept. It looks liket the old version get retired. But the space on the nfs volume isn’t freed up. Also the usage report shows that the storage is in use by kasten-io. What can I do to free up the space? Is there any manual disk space reaclaim job which needs to be run?


13 comments

Userlevel 2

Hello @ph1l1pp 

Thank you for using our community and K10.

Could you please raise a ticket to K10 support at http://my.veeam.com/, to speed up the troubleshooting, please add to your case the debug logs:

https://docs.kasten.io/latest/operating/support.html#gathering-debugging-information

Best Regards

Fernando R.

Userlevel 1

@Geoff Burke Thanks for going through all the steps on your s3 environment. That’s how I expect to be also on my environment. I openend now a support case.

After freeing up some space manually the kopia error disappeared. Looks like this one was caused by the full disk. But the root cause still persists. The volume starts filling up again and old data isn’t removed.

Hello Philipp,

Please check the reply of your case.


I have checked the logs that sent from your side, currently retire actions could not be reclaim space from the external repositories  ( NFS volume ) 
retire actions or manually deleting the restorepoints either from the backend or kasten dashboard will free the local spaces in the cluster but not outside the cluster.

i checked the screenshot and i found that all the sanpshot and data were removed but the only data exists is the kasten services data ( resources data ) which is used by loggin, catalog and jobs objects or resrources in kasten-io namesapce
for more details about this data you can check   "more chart and alerts" option under "usage and reports page" in kasten dashboard.

back again to your first inquiry if there is any way to remove the data from NFS volume which retire action triggered , unfortunatley we dont have such option at the moment  but It is a good feature that could be possible in the coming versions.

I would like to emphasize that the retire action or manually deleting restore point will clean the local space in the cluster ( e.g revmoing snapshots ).

Please let us know if you still have some inquiries or more clarifications needed from our side.
Best Regards
Ahmed Hagag

Userlevel 7
Badge +8

Hi ph1l1pp,

 

If you don’t need to keep the old backups it might be better to delete the policy and create a new one. I am not using NFS but with S3 if I delete a policy and then run my report again it claims that it is gone:

 

Userlevel 7
Badge +8

wait wrong screen shot ;)

Userlevel 7
Badge +8

Actually I took my S3 offline last night in the basement.. bear with me :) 

Userlevel 7
Badge +8

Ok while waiting I did find this. There could be issues with permissions but try looking around the CLI too. 

Take a look at this:

 

https://docs.kasten.io/latest/api/restorepoints.html#api-delete-rpc

Userlevel 7
Badge +8

Ok so looks like this is the case after all. I deleted the policy and recreated but the old restore points are still there so next I will try to manually remove and explore the api to see if there is a purge of some sort. 

 

Userlevel 1

I deleted the policy and rerun the report. Still shows me the same amount of data:

After that I deleted all restore points, which triggered a retire action for all remaining restore points:

kubectl delete restorepointcontents.apps.kio.kasten.io --selector=k10.kasten.io/appName=kasten-io
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-nm28p" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-8bb5w" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-lw55l" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-pnnjx" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-vf5sw" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-mhvvs" deleted
restorepointcontent.apps.kio.kasten.io "kasten-io-scheduled-bjnzh" deleted

After a while the retire actions get stoped with state failed:

status:
  actionDetails: {}
  endTime: "2022-01-13T12:14:23Z"
  error:
    cause: '{"cause":{"cause":{"cause":{"cause":{"Code":1,"Err":{}},"function":"kasten.io/k10/kio/kanister/function.deleteDataPodExecFunc.func1","linenumber":156,"message":"Error
      executing kopia GC"},"function":"kasten.io/k10/kio/kanister/function.DeleteData","linenumber":92,"message":"Failed
      to execute delete data pod function"},"function":"kasten.io/k10/kio/exec/phases/phase.GenericVolumeSnapshotDelete","linenumber":676,"message":"Failed
      to delete Generic Volume Snapshot data"},"function":"kasten.io/k10/kio/exec/phases/phase.(*retireRestorePointPhase).retireGenericVolumeSnapshots","linenumber":435,"message":"Failed
      to retire some of the generic volume snapshots"}'
    message: Job failed to be executed
  plan: {}
  restorePoint:
    name: ""
  result:
    name: ""
  startTime: "2022-01-13T11:40:39Z"
  state: Failed

 

Problem still  persists. Any idea what causes this kopia error?

Userlevel 7
Badge +8

Hi ph1l1pp,

 

You might need someone from Kasten to check that since it is at the kopia level. You can put a support case in even if you are using the communion edition.

cheers

Userlevel 7
Badge +8

Hi again, so I went through all the steps (manually deleting etc) and I saw a result right away in the report.

So did manual deletions:

 

And ran the report on demand before and after:

So immediately the report reflects the change.

 

That kopia error that you found is obviously the cause. Please get back to us if Kasten support figure it out so we can be aware if it pops up.

cheers

Geoff

Userlevel 7
Badge +8

@Geoff BurkeThanks for going through all the steps on your s3 environment. That’s how I expect to be also on my environment. I openend now a support case.

After freeing up some space manually the kopia error disappeared. Looks like this one was caused by the full disk. But the root cause still persists. The volume starts filling up again and old data isn’t removed.

Ah ok that is good to know. Probably the logs were not clear. I have seen that before when logs lead people on a wild goose chase and then it turns out to be something like space or permissions :). I am noting this down in case I see it too, thanks.

Userlevel 7
Badge +8

Hi Ahmed,

 

Is this the same for Object Storage as well? i.e. the data won’t retire automatically? 

 

Thanks 

Comment