How to distinguish between files, links, and cloned files

Except for the Apple File System (APFS), file systems don’t keep track of whether a file is a clone or has been cloned. The information is inconsequential to how modern file systems work. However, curious people might want to know whether a path is a regular file, symbolic- or hard link, or a clone of another file. This article is for you.

For this article, I’ll assume you’re somewhat familiar with cloning-capable/deduplicating file systems. Read my primer on which file systems support cloning if you’re new to the topic.

It’s hard to identify whether a file is a clone of another file. On Linux, there are a few programs available that claim to do this. I’ve tested five such programs and have concluded that they’re red herrings, and fail to definitively answer the question. I’ll not name the programs or shame their developers, but I’ll outline the problems, so you know what to look out for.

Four followed symbolic- and hard links and wrongly concluded that the links were clones rather than links. They should have inspected the links and not their destinations.
Three only considered identical files to be cloned, and didn’t account for partial clones (files that share some but not all data).
Two only checked the first block of data, and simply assumed the rest of the file to be an identical clone. This approach is much faster than comparing the entire file, but it’s also likely to give you the wrong answer.
One checksummed the file contents and proclaimed them to be cloned if they matched up, regardless of how the files were actually stored in the file system. This approach is naïve but often right.

On Linux, you can compare two file paths to get a definite answer about whether one is a clone of the other. The rest of this article guides you through the process.

First, let’s weed out files that are stored on different devices, and symbolic- and hard links:

Run the command stat file1 file2 (part of gnu-coreutils).
Compare the Device identifier and confirm that they match.

The device identifier uniquely identifies the file systems’ mount point. You can only clone files within the same mount point. Btrfs subvolumes, for example, are mounted on different mount points even though they’re the same “file system.” You can’t clone a file from one subvolume to another.

Check that the files are reported as “regular file” and not “symbolic link.”
Compare their inode numbers for similarity.

Hard links share the same inode number as their destination, whereas clones have their own inodes. This distinction (plus a copy-on-write file system) is what enables clones to act independently of their originals even when modified by non-cloning aware programs.

So far, we’ve confirmed that your paths are [probably] real files. Next, let’s check whether they’re sharing any cloned data between them.

Run the command filefrag -v file1 file2 (part of e2fsprogs).
Compare the files’ physical_offset ranges within the extent rows that have the shared flag set.

The two files share deduplicated/cloned data on the storage drive if they share any identical or overlapping ranges.

As to determining which is the original and which is the clone … . That is almost impossible to determine without a time machine. Luckily, many copy-on-write file systems can act as time machines. Assuming you’re taking frequent file system snapshots, you can compare old ones to the current state to determine which file was created first.

You can alternatively look at the files’ birth times (when the files were created) as reported by the stat command. This isn’t guaranteed to give the definitive answer, but it’s a strong indicator. You can expect to find this field in supported file systems (including Btrfs, Ext4, OCFS2, and XFS) starting in Fedora Linux 33 and Ubuntu 20.10 (both releases expected around October 2020). The birth time field is empty on older Linux distributions.

I haven’t been able to identify a similar process for identifying cloned files on MacOS or Windows. The tools for it simply doesn’t exist yet. APFS counts the number of times a file has been cloned. This count doesn’t help you identify its clones or whether its sharing data with any of them currently.

Strictly speaking, identifying whether a file is a clone or not is only relevant if you’re troubleshooting a file-cloning capable application. However, it shouldn’t be this complicated to debug this file system feature!