đź…­

What are sparse files and how to tell if a file is stored sparsely

Sparse files are files stored in a file system where consecutive data blocks consisting of all zero-bytes (null-bytes) are compressed to nothing. There is often no reason to store lots of empty data, so the file system just records how long the sequence of empty data is instead of writing it out on the storage media. This optimization can save significant amounts of storage space for other purposes.

On most file systems, programs don’t need to do anything special to take advantage of sparse files when it’s supported by the file system. It all depends on how individual programs use the lower-level operating system application programming interfaces (API) to save the data to a storage medium. Some programs might write out consecutive blocks of null-bytes to the disk. These programs are non-sparse-aware.

For example, every song purchased from Apple Music includes a 0,5 MB block of null bytes. Apple Music doesn’t take advantage of sparse files and writes them to disk as-is.

Better written software knows to take advantage of sparse storage, however. Sparse files are commonly supported in BitTorrent clients and virtual hard disk image files for virtualized machines. In both of these scenarios, you don’t have all the data that will eventually go into the file when it’s created. You also won’t be writing data to it sequentially, e.g. BitTorrent downloads files in randomly ordered chunks, so it also needs to write the data to disk in an unpredictable order.

There are situations where both approaches, sparse and pre-allocation, make sense. For example, pre-allocation might make sense if you’re leaving empty space in a file because your program knows it will go back and fill it in later.

Sparse storage makes sense when you plan to maybe write some more data in that part of the file at some time in the slight future. Sparse files are much faster to write as you can skip most of the work and delays introduced by writing empty data.

Which file systems support sparse files

All modern file systems and operating systems support sparse files. It’s universally recognized as a beneficial feature with many use-cases. Some file systems, especially older ones, may require special handling to create sparse files.

OS FS Sparse support
Linux Bachefs Supported.
Btrfs
Ext4
XFS
OCFS2
OpenZFS
FreeBSD
ZFS
UFS2
Solaris Oracle ZFS
MacOS APFS
HFS+ Unsupported
exFAT
Windows
FAT32
NTFS Requires FSCTL_SET_SPARSE attribute.
ReFS

There’s some extra work required to use sparse files on Windows. You must first create an empty file, set the FSCTL_SET_SPARSE attribute on it, and close the file pointer. You then need to re-open a file pointer to the file before you can write any data to it sparsely.

It’s not a hugely complicated process, but the added complexity is enough to cause many programs to not bother supporting sparse files on Windows. Notably, closing file handlers is slow on Windows because it triggers indexing and virus scans (video).

How to identify sparse files

On FreeBSD, Linux, MacOS, and Solaris; you can check on a file’s sparseness using the ls -lsk test.file command and some arithmetic. The commands return a couple of columns; the first contains the sector allocation count (the on-disk storage size in blocks) and the sixth column returns the apparent file size in bytes. Take the first number and multiply it by 1024. The file is sparse (or compressed by the file system!) if the resulting number is lower than the sector allocation count.

The above method is easy to memorize and is portable across all file systems and operating system (except Windows). On Windows, you can check whether a file is sparse using the fsutil sparse queryflag test.file command.

You can also get these numbers everywhere (except Windows) with the du (disk usage) command. However, its arguments and capabilities are different on each operating system, so it’s more difficult to memorize.

Recent versions of FreeBSD, Linux, MacOS, and Solaris include an API that can detect sparse “holes” in files. This is the SEEK_HOLE extension to the lseek function, detailed in its man page.

I built a small program using the above API. My sparseseek program scans files and lists how much of it is stored as spares/holes and how much is data. (I initially named it sparsehole and didn’t notice the problem before reading it aloud.)

Program support

Every program that doesn’t operate at the lowest levels of the file systems should transparently support sparse files. That being said, you generally don’t end up with a sparse file when you make a copy of a sparse file. The same applies if you use a non-sparse aware program to manipulate a sparse file.

For example, a text editor may overwrite the entire contents of the file, including writing out all the null-bytes, if you make a single change to the beginning of the file. Some programs will seek to the correct position in the file, remove what’s there, apply your changes, and close the file without breaking its sparseness.

On Linux, the cp (copy) command will make a sparse copy of a sparse file. However, you can specify that it should attempt to make a sparse file from a non-sparse original using cp --sparse=always. The copy command on FreeBSD, MacOS, and Solaris doesn’t support this feature.

On a copy-on-write (COW) file system, you can clone files instead of copying them. A cloned file will preserve the sparseness (among other attributes) of the original file.

Finder on MacOS will clone files on APFS when you copy them. Nautilus and Dolphin on Linux aren’t file-system aware and will make complete and unsparse copies (even when you specifically choose Clone from their context menus).

Other classic file archival and transfer programs like tar and rsync supports sparse copies. On FreeBSD, MacOS, and Linux; you can extract tarballs in a sparse-aware way using the -S argument. On Linux, the same argument can create sparse tarballs as well.

You can convert any file to a sparse on any operating system (except Windows) using rsync -S test.file copied.file. On Linux, you can make a file sparse in-place using fallocate -d test.file (assuming it contains any blocks that can be stored as sparse).

Sparse file vs file system compression

Continuous blocks of null-bytes compress well. You can expect to get about the same on-disk savings from the null-byte regions in a sparse file and a file system that uses file system compression. As discussed earlier in the article, it can be difficult to distinguish between compressed and sparse files.

A compressing file system may be better at compressing smaller sets of null-bytes in a file with many small holes. However, the two aren’t equivalent when it comes to performance. It takes significantly more time, even with a fast processor and storage system, to write and compress all those null-bytes than it takes to never write them in the first place.

You can benchmark the difference on your file system using my sparse vs nullwrites.rb program. The test creates two test files: one 20 GB sparse and one that writes out 20 GB of null-bytes. The program also tests that the file system supports sparse storage.

On my workstation, with a fast octa-core processor and an M.2 PCIe NVMe disk, the sparse file is created in hundreds of a second and the full write takes a full 10 seconds. This benchmarking program doesn’t set the required sparse attribute in Windows.