Question

Will Kasten use Shallow Read-Only CephFS snapshots?

  • 18 October 2022
  • 7 comments
  • 151 views

Userlevel 2

Hi all,

With release 3.7.0, Ceph-CSI has finally introduced shallow read-only CephFS snapshots which are pretty much zero-cost compared to the "old" snapshot approach which required a full copy of all data on the values in order to mount the snapshot for reading. https://github.com/ceph/ceph-csi/blob/devel/docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md

CephFS backups onto S3 with Kasten are almost impossible for me currently as the volume generation from the snapshot takes much longer than the configured timeout limits for even a medium amount of data (50GB) and uses a lot of resources.

I would except shallow read-only snapshots to solve this, but it seems that Kasten still uses the old way. From what I gathered from the documentation, as long as the snapshot volume is mounted read-only, ceph-csi should use shallow snapshots.

Is my configuration incorrect or has Kasten not yet implemented this?

 

Thank you very much,

Pascal


7 comments

Userlevel 5
Badge +1

@pascalzero Thank you for posting this question.

As you mentioned, CephFS takes a lot of time to be restore from a snapshot and we have seen this happening a lot causing failures to export.

We could help you in tweaking the timeout for the wait period in this case.

However, I will try to go through the ceph-csi docs for 3.7 release and this shallow read-only volumes.

Do you know if this feature require any changes in accessModes for the PVC that is created with the volumesnapshot as datasource ?

Currently K10 uses the accessMode of the original PVC for the temporary PVCs created during exports.
If there is a change in the Spec that is required to utilise the shallow readonly clone, then we will have to file a feature request. 

Userlevel 2

Hi @jaiganeshjk , thanks for the quick response.

I believe this can be a real game changer for CephFS backups then. We would not want to change timeouts for now as this also puts quite a bit of load on the system, so a proper solution is definitely required.

We have reverted to other software/scripts for the time being, but it would be ideal of course if Kasten handled this nicely.

As I understand the spec, if you just use readonly access for the temporary PVC, that should already result in shallow clone being used, as per this item from the design doc:

  • Volume source is a snapshot, volume access mode is *_READER_ONLY.

I'd be really happy to see this happen any time soon - if there is a way for me to support (e.g. through testing), please let me know.

 

Cheers,

Pascal

Userlevel 5
Badge +1

Thanks Pascal.
Reading about this feature leads me to believe that we can get it working with K10 out of the box.

I will do further reading and some tests to get this working and keep you posted on the same.

Userlevel 2

You got me excited 🙂 Good luck!

Userlevel 5
Badge +1

@pascalzero I was going through through the testing and found that it cannot work with K10 out of the box.

Unfortunately, this feature needs the PVCs to be created with the accessMode set to `ROX`(which is the only supported accessMode for snapshot-backed volumes).

However, K10 uses the accessMode for the temporary PVC from the original PVCs manifest.

We don’t have a way to override this as of today.

I will be opening a feature request to support this and keep you informed once we have this supported in the product.

Userlevel 2

Hi @jaiganeshjk ,

Thanks a lot for the investigation and getting back on this.

That’s somewhat what I expected, but crossing my fingers now that it might land soon as it’s surely a vital feature for a lot of users once Ceph CSI 3.7 adoption has spread a bit further.

Again if there’s anything I can do to help testing, let me know.

Userlevel 2

Hi @jaiganeshjk ,

 

Just wanted to check in if you have any insights into the release planning and whether this has can be plotted on the timeline, yet?

We are still struggling with the issue every night when backups are running as storage load goes up so bad due to the CephFS copying that it impacts the overall system stability.

 

Thanks a lot!

Comment