Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support efficient hashing of sparse files #451

Open
frankdavid opened this issue Feb 19, 2025 · 0 comments
Open

Support efficient hashing of sparse files #451

frankdavid opened this issue Feb 19, 2025 · 0 comments

Comments

@frankdavid
Copy link

frankdavid commented Feb 19, 2025

b3sum is quite slow when operating on sparse files.

To reproduce, create a sparse file with some data and a hole and invoke b3sum:

$ dd bs=1G count=1 if=/dev/random of=/tmp/sparse.dat
$ truncate -s=30G /tmp/sparse.dat

$ ls -lsh /tmp/sparse.dat
# Observe that the file size is 30G but only 1.1G on disk

$ time b3sum /tmp/sparse.dat

real	0m22.195s
user	0m12.049s
sys	3m46.161s

Solution:

  1. Linux provides a way using lseek with SEEK_DATA+SEEK_HOLE to identify where the data resides in a sparse file. This can be used to optimize the implementation.
  2. Before processing a file, we collect the list of data segments using lseek.
  3. While we process the input, we use the list of data segments to know whether we are in a data segment or a hole segment.
  4. When we are in a hole segment, instead of reading from the mmap-ed input, we read from a static zero array of the same size. The benefit of this is that reading from the static zero array does not cause a page fault and a context switch. Furthermore, the static zero array will likely be in L1 cache.

The effect of this optimization if quite significant:

real	0m0.667s
user	0m9.013s
sys	0m0.228s

The POC can be found here. If you're happy with this direction, I'll still refactor the code slightly and add tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant