Immutable Storage actual disk used per folder

3 years ago
July 28, 2021
15 comments
1521 views

ThePlaneskeeper
Not a newbie anymore
3 comments

Inherency:

Since we bill by actual disk usage for our client utilizing CloudConnect, the existing reports are non-functional for reporting and billing (They report pre-deduplication/reflink data). It took a while to figure out, but there is a way to calculate actual disk space used on a per directory on an immutable repository.

Solution:

Using: https://community.veeam.com/blogs-and-podcasts-57/check-reflink-and-spared-space-on-xfs-repositories-244

I have extrapolated a script that will give the disk usage of each folder (IE: client) on an immutable repo. This isn't "Data used"; that is what Veeam reports. This is "Disk Used". The actual size on disk after reflinks (duplicated data is only counted once).

A note about this script; it appears that the original blog entry is wrong on the size of a block. They attribute it to 4096... which is true... on disk... but the utility used explicitly gives the information in block sizes of 512:
https://linux.die.net/man/8/xfs_bmap
"units of 512-byte blocks"

To use this script, we use a cron task and pipe the output to a mail client on the repo itself (IE: script.bash 2>&1 | mail -s “Immutable storage report for $HOSTNAME” someemail@email.place)

#!/bin/bash

for clientDir in `find /backups/disk-01/backups/ -mindepth 1 -maxdepth 1 -type d`
do
    echo $clientDir
    clientSpaceUsed=$(find $clientDir/*/* -xdev -type f -exec xfs_bmap -l {} + | awk '{ print $3 " " $4 }' | sort -k 1 | uniq | awk '{ print $2 }' | grep -Eo '[0-9]{1,7}' | paste -sd+ | bc | awk '{print $1*512/1024/1024/1024}')
    #block sizes of 512 bytes.  Divided by 1024 for KB.  Divided by 1024 for MB. Divided by 1024 for GB.
    echo "$clientSpaceUsed GB"
done

To break down how this works:

For each client directory in “/backups/disk-01/backups/”

output the directory being reported on

run xfs_bmap -l (this tells us all about the blocks in question)
Take columns 3 and 4 (now becomes column 1 and 2, the rest are discarded)
sort by column 1
remove duplicate rows of data (reflinks for fast cloning; keeps a single copy of the data for counting purposes)
Select only column 2 (now becomes column 1)

remove anything other than numbers

Add those numbers together

multiple by block size (512)

divide by 1024 (now KB)

divide by 1024 (now MB)

divide by 1024 (now GB)

output text

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8562 comments
3 years ago
July 28, 2021

Great post. Another script to add to my repo. :sunglasses:

+17

JMeixner
On the path to Greatness
2651 comments
3 years ago
July 28, 2021

Interesting script 👍🏼

Do you have a similar solution for ReFS repositories, too?

ThePlaneskeeper
Author
Not a newbie anymore
3 comments
3 years ago
July 29, 2021

JMeixner wrote:

Interesting script 👍🏼

Do you have a similar solution for ReFS repositories, too?

See:
http://dewin.me/refs/

He has a tool that will do that for you. Parsing that output into a script shouldn’t be too hard. Should be called “blockstat” or something.

+21

Chris.Childerhose
Veeam Legend, Veeam Vanguard
8562 comments
3 years ago
July 29, 2021

ThePlaneskeeper wrote:

JMeixner wrote:

Interesting script 👍🏼

Do you have a similar solution for ReFS repositories, too?

See:
http://dewin.me/refs/

He has a tool that will do that for you. Parsing that output into a script shouldn’t be too hard. Should be called “blockstat” or something.

This one is very good for ReFS. Use it all the time.

+17

JMeixner
On the path to Greatness
2651 comments
3 years ago
July 29, 2021

ThePlaneskeeper wrote:

JMeixner wrote:

Interesting script

Do you have a similar solution for ReFS repositories, too?

See:
http://dewin.me/refs/

He has a tool that will do that for you. Parsing that output into a script shouldn’t be too hard. Should be called “blockstat” or something.

Thank you for your reply.

Yes, I know about blockstat. This tool takes a lot of time to get the results.
I was hoping that there is a faster solution anywhere out there. :sunglasses:

I have a script ready which parses the information via blockstat for each folder in a repository….

+13

marcofabbri
On the path to Greatness
990 comments
3 years ago
September 14, 2021

Never see this, thanks @ThePlaneskeeper

Backups are like pizza, I love them. | Linkedin: @marco-fabbri-it

mvl
New Here
2 comments
2 years ago
December 11, 2022

When running this script the values i am getting are not the reality. After the first run (so without any fastclone) with the 512K blocks the output was wrong. When changed to 4096 it was fine. Now after a new full has run with reflink changing it back to 512 is giving me like a 3rd of the usage that it should be.

I must be doing something wrong, or does the output only shows reflink data and do we need to count the actual full on top of it?.

Example: one job's full was around 21TB after the full. Now with a 2nd full with du -h it is showing 49TB usage. When running the script the output is showing 8931GB

ThePlaneskeeper
Author
Not a newbie anymore
3 comments
2 years ago
December 28, 2022

mvl wrote:

I must be doing something wrong, or does the output only shows reflink data and do we need to count the actual full on top of it?.

Example: one job's full was around 21TB after the full. Now with a 2nd full with du -h it is showing 49TB usage. When running the script the output is showing 8931GB

It’s going to be hard to troubleshoot that without shell access. Not sure how anyone can help with this one.

mvl
New Here
2 comments
2 years ago
January 4, 2023

ThePlaneskeeper wrote:

mvl wrote:

It’s going to be hard to troubleshoot that without shell access. Not sure how anyone can help with this one.

Well i have a test box with the same behavior. Maybe i can reach you directly so we can look at this together? Would be nice for us to have this working :)

JayST
Comes here often
11 comments
2 years ago
February 10, 2023

i'm looking into this as well. I'm not able to make the “used” output by “df -h” match the output of the script that uses xfs_bmap. It doesn’t even come close in some cases, the df -h output is always higher.

we're currently suspecting xfs_bmap does not include some pre-allocated space for the backupfiles, done by xfs, where df -h would probably include this.

next step is to figure out how to get it matching between the two to better understand.

JayST
Comes here often
11 comments
2 years ago
February 11, 2023

ok time to let ChatGPT help out and tell me about reasons why the df commands reports higher used capacity compared to xfs_bmap reports on summarized unique data of all files. (hard lesson: ask ChatGPT for help MUCH earlier… :))

here are some answers that were quite interesting. please note, this is direct chatgpt output, be carefull in taking it for truth, but i tend to think i learned some stuff and got some direction in to searching for a next step.

I also asked it to write me some scripts for xfs_bmap. I suggest anyone should try the same and see what we come up with.

======

Metadata overhead: The df command reports the total used space, including both data blocks and metadata blocks, whereas xfs_bmap only reports on data blocks. This means that the df command will show a higher used capacity because it is including the overhead from metadata blocks.
Unwritten extents: XFS uses a technique called "delayed allocation" to optimize disk performance. This means that when a file is written, the data blocks are not immediately allocated on the disk. Instead, the blocks are marked as "unwritten" and are only written to disk when the file system needs to reclaim disk space. These unwritten extents are counted as "used" by the df command, but are not counted by xfs_bmap because they have not yet been written to disk.
Disk fragmentation: Over time, disk fragmentation can cause the same data blocks to be scattered across different parts of the disk. This can cause the df command to report a higher used capacity, because it is counting each block used by a file as a separate block, even if it is stored in multiple physical locations on the disk.
Inode allocation: The ‘df’ command reports the used capacity based on the block allocation information, whereas ‘xfs_bmap’ provides the physical disk usage of files. XFS uses inodes to store metadata about files, and these inodes consume disk space. if a large number of inodes are allocated but not used, this will increas the used capacity reported by the ‘df’ command but not by ‘xfs_bmap’
Delayed allocation: XFS uses a feature call delayed allocation, which allows it to defer allocating disk space until it is actually needed. This can result in the used capacity reported by ‘df’ begin higher than the actual physical disk usage reported by ‘xfs_bmap’, since ‘df’reports the used capacity based on the block allocation information, which does not take into account the blocks that have been deferred.

Timothy Dewin
Comes here often
4 comments
1 year ago
June 14, 2023

The problem with this script is that it does a uniq on the lines. Theoretically if all blocks would be aligned, this would work. But of course you are not block cloning perfect intervals. Eg some of the output I have from just before uniq

9892208..9898463 6256
9892208..9898463 6256
9892208..9900311 8104
9902000..9905327 3328
9902000..9905327 3328
9902000..9907471 5472

Ok the scripts detect that line 1 and 2 are not unique so it will deduplicate it. However look at the last 3 lines, they all have the same starting block but not all the same end block. This script will account for all 2 blocks but in reality the last line line covers the 3 lines.

ThePlaneskeeper
Author
Not a newbie anymore
3 comments
5 months ago
October 24, 2024

Timothy Dewin wrote:

9892208..9898463 6256
9892208..9898463 6256
9892208..9900311 8104
9902000..9905327 3328
9902000..9905327 3328
9902000..9907471 5472

That’s interesting, and I had not seen that yet. That can certainly be accounted for but will be somewhat more complicated.

If there is a significant desire for such a thing, I’ll work on that, but at this time that difference is not notable enough (to my client base) that I’ll be investing time into developing. It’s certainly worth considering for people though. Thanks for bringing that up!

+17

JMeixner
On the path to Greatness
2651 comments
5 months ago
October 25, 2024

It would be great to get this sorted out.

Hardened Repositories are getting more and more common. A possibility to determine the exact storage usage for each client or each job would be very appreciated.

alfonsrv
New Here
1 comment
5 months ago
November 9, 2024

Great script, thanks a lot! Though for me it somehow still shows a weird storage result – all of my 340TB are used according to df – but according to the script it should only be 200TB. Not sure why.

Anyway – minor adjustment to deal with paths that include spaces:

#!/bin/bash

find "/backups/disk-01/backups/" -mindepth 1 -maxdepth 1 -type d | while IFS= read -r clientDir
do
    echo $clientDir
    clientSpaceUsed=$(find "$clientDir/*" -xdev -type f -exec xfs_bmap -l {} + | awk '{ print $3 " " $4 }' | sort -k 1 | uniq | awk '{ print $2 }' | grep -Eo '[0-9]{1,7}' | paste -sd+ | bc | awk '{print $1*512/1024/1024/1024}')
    #block sizes of 512 bytes.  Divided by 1024 for KB.  Divided by 1024 for MB. Divided by 1024 for GB.
    echo "$clientSpaceUsed GB"
done

Comment

Related topics

Backup security: Why go with on-premises object storage?

Bad Practice vs Good Practice in a real use case

Why is my backup repository folder size much greater than disk capacity used?icon

Onboarding for Veeam Backup for Microsoft 365 - Step 2.3 Infrastructure planning & installation package

Onboarding for Veeam Backup for Microsoft 365 - Step 2.3 Infrastructure planning & installation package

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded