Support efficient hashing of sparse files #451

frankdavid · 2025-02-19T17:13:27Z

b3sum is quite slow when operating on sparse files.

To reproduce, create a sparse file with some data and a hole and invoke b3sum:

$ dd bs=1G count=1 if=/dev/random of=/tmp/sparse.dat
$ truncate -s=30G /tmp/sparse.dat

$ ls -lsh /tmp/sparse.dat
# Observe that the file size is 30G but only 1.1G on disk

$ time b3sum /tmp/sparse.dat

real	0m22.195s
user	0m12.049s
sys	3m46.161s

Solution:

Linux provides a way using lseek with SEEK_DATA+SEEK_HOLE to identify where the data resides in a sparse file. This can be used to optimize the implementation.
Before processing a file, we collect the list of data segments using lseek.
While we process the input, we use the list of data segments to know whether we are in a data segment or a hole segment.
When we are in a hole segment, instead of reading from the mmap-ed input, we read from a static zero array of the same size. The benefit of this is that reading from the static zero array does not cause a page fault and a context switch. Furthermore, the static zero array will likely be in L1 cache.

The effect of this optimization if quite significant:

real	0m0.667s
user	0m9.013s
sys	0m0.228s

The POC can be found here. If you're happy with this direction, I'll still refactor the code slightly and add tests.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support efficient hashing of sparse files #451

Support efficient hashing of sparse files #451

frankdavid commented Feb 19, 2025 •

edited

Loading

Support efficient hashing of sparse files #451

Support efficient hashing of sparse files #451

Comments

frankdavid commented Feb 19, 2025 • edited Loading

frankdavid commented Feb 19, 2025 •

edited

Loading