🅭

Four P2P distribution tools for Git repositories compared

Git is a version control system that is decentralized by design. Anyone can run git daemon in a repository to start a Git server. You can also host your repository using a regular web server and HTTP infrastructure. More commonly, though, repositories are distributed through centralized hub services like BitBucket, GitHub, and GitLab. It’s quick, easy, and free to “throw your code up on GitHub” and call it a day. However, there is also a growing number of peer-to-peer (P2P) distributed options to consider as well.

What if you could distribute your Git repository using the BitTorrent P2P protocol without the need for a central server? Without having to depend on a commercial business’ hosting-generosity and infrastructure. That’s the idea behind GitTorrent, an experimental Git helper and overlay protocol for transferring Git repositories over the popular P2P protocol.

GitTorrent does away with the idea of a central code distribution server. Instead, it relies on the people who’re using and participating in the project to contribute bandwidth and handle its distribution.

Similar concepts have popped up around other peer-to-peer protocols including Dat Protocol and IPFS. Each implementation has made different implementation choices and ended up with systems that appear similar at first glance but have fundamentally different trade-offs and priorities. In this article, I’ll explore these differences in-depth and do a comprehensive comparison.

I’ll kick off with the following comparison table with some key features and limitations. There’s a lot to digest in it and I’ll discuss each item in turn below the table.

Comparison table of P2P Git clients
Feature Client
GitTorrent HyperGit igis-remote ipld-remote
Protocol BitTorrent Dat IPFS
git-remote gittorrent: hypergit: ipns: ipld:
Project activity Inactive (2015) Inactive (2018) Active (2020) Active (2019)
Runtime Node.js Go
Peer discovery DHT (not bootstrapped) Tracking server, mDNS-SD DHT, mDNS-SD
Repo. updates Git server, side-channel, DHT Peer swarm IPNS Side-channel
Repo mutability Mutable Immutable
Packing strategy On-demand packing Unpacked
Data de-duplication None, compressed None, un-compressed Global, un-compressed
File size limit RAM No inherent limits. 2 MB
Data loss Of course not. Your repository is deleted when IPFS runs GC.
Hash algorithm SHA-1 Ed25519 SHA-256 SHA-1/SHA-256

I’ll start by discussing the status of each project and then move on to discuss how they do things differently.

Project activity

GitTorrent saw a burst of development in by the project seems to have been abandoned by its creator by the beginning of . You’ll notably start out with several security and deprecation warnings if you try to install and run it. It requires some tweaks to dependencies and its code to work with today’s version of the Node.js runtime.

HyperGit similar saw an initial burst of development in and also appears to have been abandoned. It also requires some fixes minor fixes to install on a recent version of Node.js. HyperGit seems to be the least polished option of the ones discussed in this article.

IPFS/IPLD has seen steady development since . The IPFS/IGIS fork came along in and addressed many of the limitations of the IPFS/IPLD implementation. Both IPFS implementations are excruciatingly slow to process pushes even though they take place locally on your computer.

Peer discovery

GitTorrent uses an implementation of the BitTorrent mainline distributed hash table (DHT) to discover others who’re sharing the repository you want to download. Instead of querying a peer database on a centralized server, you query the other participants in the DHT to discover which peers host the Git repository you’re interested in.

To connect to the DHT you need to go through what is known as a bootstrap/introduction server. GitTorrent’s bootstrap server has been offline since . I’ve discussed previously how DHT can be made more resilient. You can configure a different mainline DHT-compatible bootstrap server. However, other people using GitTorrent must also manually configure a bootstrap server on the same DHT. This makes it harder to adopt GitTorrent in a project.

HyperGit relies on the Dat Protocol project’s tracking server. A tracking server is a centralized database server that fulfills the same function as a DHT. It keeps track of which clients have which repositories and answers queries from other clients. As evidenced by GitTorrent’s bootstrapping server being offline, the tracking server is a single-point-of-failure in otherwise distributed systems. I’ve recorded over 21 outages of the Dat Protocol’s tracking server in 2018 and 2019.

IPFS also uses DHT. Its DHT bootstrapping process could benefit from increased resilience the same way the others can. Notably, HyperGit and GitTorrent use the DHT to discover a Git repository — and then query the peers it discovers for who has which parts or “chunks”, of the repository. IPFS, on the other hand, uses a DHT for every single data chunk globally. This design is part of the project’s goal of global data de-duplication (more on that later).

However, IPFS architecture creates an enormous overhead of DHT traffic compared to the other protocols. It also fails to benefit from the assumed knowledge that peers who have one chunk of the repository you’re interested in are likely to also have more chunks you’re interested in.

Lastly, Dat and IPFS can discover peers on the immediate local network (LAN) through Multicast DNS Service Discovery (mDNS-SD). This process — also known as zero-configuration networking (Zeroconf), or Apple Bonjour or Rendezvous — can be useful in office settings where everyone interested in the Git repository is connected to the same local network. It’s not as relevant or useful in these times of remote work, however.

Updates and mutability

Dat archives — as used by HyperGit — are append-only file systems. An archive’s creator holds a special private cryptographic key that allows them to append new data to the end of the archive. They can add new files and new revisions of existing files to it, but can’t remove or change an old version from the file system log. Everything is versioned.

Peers in the network announce to each other what’s the newest version they’ve got of an archive, and query for newer versions at the same time.

On the other hand, “archives” on BitTorrent — as used by GitTorrent — and IPFS are immutable. You normally can’t make changes to a “torrent” transfer or an IPFS file ones it has been created. Both protocols use a file’s cryptographic hash as its network address. Change the file and you change its hash.

GitTorrent solved this by building mutable torrents on top of BEP-44: Storing arbitrary data in the DHT. This “arbitrary data” is signed with the same cryptographic key as the main torrent. To push updates, the private key-holder pushes the hash of the newest commit to the DHT.

Clients initially download the full Git repository as of the time the torrent was created. Clients can then query the DHT to find the latest commit. They can then query other peers for the commits between the latest version they already have and the latest commit it found in the DHT. I’ll discuss this a bit more in the next section.

The IPFS project has an experimental side-project called the InterPlanetary Name Service (IPNS) address. It’s like the Domain Name System (DNS) but stored in the DHT. Like DNS, IPNS can translate one address to another type of address. In the case of IPNS, it turns one mutable hash into an immutable hash. You can also use something called DNSLink to piggyback on the same type of look-up using DNS instead of IPNS.

IPNS has been plagued by unreliability and poor performance since its inception. I’d recommend you use DNSLink instead of IPNS with IPFS. Remember, DNS is designed to be decentralized through features like secondary authoritative servers and caching recursive revolvers.

Your IPNS address or your DNSLink-enabled domain name would resolve to the IPFS hash of the repository’s newest commit. The IPFS/IGIS implementation support doing this automatically for IPNS on Git pushes. The IPFS/IPLD implementation requires you to update your IPNS or DNSLink manually or communicate updates through another side-channel.

Packing strategy and data de-duplication

Git normally packs individual object files (commits) into packed single-file objects. These are deflate-compressed on disk to de-duplicate repeated data within the same pack and shrink their file size. This greatly reduces the disk storage requirement of your Git repository. Git may need to repack these pack files when commits are orphaned (e.g. from a dropped branch), or to improve packing-efficiency.

You don’t want to make changes to existing data in a distributed file system over time, though. Needlessly changing data that everyone already has a copy of requires them to redownload the same data within a slightly different packaging. The data payload of a Git commit is supposed to be immutable (unchangeable). This is where an unpacked Git repository comes into play.

You can simply choose to not use object packing within your repository. An unpacked repository stores each commit in a set of separate files rather than being packed all neatly into one compressed file. That might sound like a trivial difference. However, the Git software project repository (as of commit 07d8ea56f2) is 118 MB packed and 2,8 GB unpacked. That’s a massive 2273 % increase in the amount of data people will need to download to retrieve a copy of the Git project repository.

IPFS objects are content-addressable and immutable. You can’t modify a file without changing its IPFS address. IPFS’ whole deal, however, is global de-duplication of content-addressable data. Two IPFS nodes that add the exact same file would end up with the same content-address for it. This also means that — assuming the Git repository is unpacked — each individual Git object is de-duplicated globally.

Global data de-duplication, in relation to peer-to-peer distribution, is most interesting with regards to forked repositories. An upstart project that forks off from an established project will share commit history (and hosting) with its parent repository for eternity. The more people that are interested in the same content chunks, the greater its availability and longevity in the IPFS network. Every project gains increased availability in the network by having more shared data chunks.

IPFS pinning service can help increase the availability of your Git repository. However, they’re likely to overcharge for duplicated chunks.

BitTorrent and Dat, on the other hand, are entirely focused around the model of a “torrent” or “Dat archive”. Data is only exchanged around one of these objects and it never crosses over. BitTorrent has a vaguely defined standard for leaching chunks off another somehow-related torrent that the downloader is assumed to maybe have previously downloaded. This isn’t implemented in many clients and it’s not found in GitTorrent either.

However, GitTorrent is smarter than your average BitTorrent client. It can request that peers pack and transfer a set of Git objects it needs. E.g. if the last commit it has is commit aaaa and that the newest commit is dddd, it can request that a peer packs all the commits between those two commits. The sender will need to spend extra processing time on assembling and compressing a pack for each receiver. However, this approach significantly reduces the disk I/O and network overhead involved in sending loads of tiny files. This is similar to how a “Git smart” web server works.

HyperGit is built on top of HyperDB — an append-only database — and Dat. Existing database entries are immutable. HyperGit stores Git objects directly in the database unpacked. Although the Git objects are “packed” into a single database file, it can’t take advantage of compression. The database itself is also immutable. Like with the IPFS implementations, you end up with the same file size bloating effecting both transfer sizes and increased storage requirements.

Data loss

P2P can be great for distributing redundant copies of your repositories. Anyone interested in it will, at least temporarily, also participate in hosting and distributing it. Every project collaborator will, at least intermittently, participate in its distribution and has a complete backup copy of the project in its entirety.

However, the IPFS options come with a huge caveat. An IPFS node will cache and distribute all data that passes through it. This can consume a lot of local storage capacity. IPFS nodes use a garbage collector to clean out ephemeral cached data from the local IPFS repository to free up disk space as needed. The garbage collector will delete every object from the repository that hasn’t been “pinned”.

Both the IPLD and IGIS implementation pins Git objects when you initialize or push to an IPFS–Git repository. However, neither pin the root directory of the repository! The root directory is the collection of files that together make up the repository. You don’t lose your data per se, as the Git objects are safe. However, you do lose the primary object that holds it all together. This is also the hash you’d share directly with other contributors for them to pull complete copies of your repository. You might be able to retrieve a copy of the repository root directory from another contributor who hasn’t run the garbage collector.

File size limit

WebTorrent, which GitTorrent is based upon, stores files in memory. Individual files you download through GitTorrent can’t exceed your available memory capacity.

That might sound bad, but the InterPlanetary Linked Data (IPLD), a mapping layer between Git’s object hashes and the corresponding IPFS objects, is limited to just under 2 MB. This should be fine for smaller projects as long as you don’t refactor the entire project in one go or add large art or other binary assets to the repository.

Neither HyperGit or IGIS has any inherent file size limits. HyperGit can randomly produce error messages saying something about 8 MB being the maximum. This is a temporary problem in the underlying Dat implementation and not is only tangentially related to your Git repository. Git itself can become slow when dealing with large files, however.

Hash algorithm

The BitTorrent protocol uses SHA-1 to identify and locate file transfers. SHA-1 has been deprecated for years, however. It’s considered a weak hashing signature at best. It has even been demonstrated that it’s possible to produce a controlled hash collision; an identical hash from different input data.

BitTorrent protocol version 2 migrates the protocol to SHA-256. It’s 264 times less likely to get a hash collision with SHA-256. Version 2 has been on the book for years, but there haven’t been many implementations. GitTorrent, being based on the WebTorrent project, uses a WebRTC-variant of protocol version 1.

Git also uses SHA-1 internally to reference commits. However, it’s also possible to sign commits with a GPG to shore up security.

IPFS’ sister-project, InterPlanetary Linked Data (IPLD), is a mapping layer between Git’s object hashes and the corresponding IPFS object (SHA-256) for the same data. To replace a Git commit, you’d need to create a collision for both the SHA-1 Git object and the corresponding IPFS object. The InterPlanetary Git Service (IGIS) implementation doesn’t bother with the IPLD translation layer and relies on SHA-256 exclusively.

Conclusions

Peer-to-peer Git hosting may sound appealing to some. At least, it sounded very appealing to me! However, many of the current implementations sound less appealing after digging deeper into the subject.

I don’t think anyone should use any of the P2P options unless they’re committed to also working to improve the tools. It’s too complicated to get started and they’re hard to understand enough to confidently deploy using them. You don’t want your project’s distribution method to be so complicated that it becomes an unreasonable burden to its adoption.

GitTorrent seems to have made the best implementation of a P2P overlay for Git. Unlike the other options, it doesn’t have a huge storage and transfer-size overhead from relying on unpacked Git repositories. However, it can’t be used out-of-the-box and it would require some work to resolve security issues and update its dependencies.

Bonus: Decentralized options

So maybe the distributed options for Git isn’t quite there yet. However, you can also consider using a decentralized option instead. Instead of relying on a network of peers, decentralized options rely on one or more servers.

If you just don’t want to host your next project on GitHub; host it on your web server. It’s quick and easy to do and it helps increase the diversity in the Git hosting ecosystem.

There are more decentralized options available than there are P2P options. It’s easier to implement a decentralized option as one or more centralized servers that take care of a lot of the complexity you’d introduce to use a P2P implementation. The two most notable options are Git itself and Secure Scuttlebutt (SSB).

Comparison table of decentralized Git clients
Client git (+ any web server) git-ssb
Protocol Git, HTTPS HTTPS Secure Scuttlebutt
git-remote git:, git+https: https: (“dumb”) git-ssb:
Status Active (2020)
Packing strategy On-demand packing Packed or Unpacked Unpacked
File size limit strategy Unlimited 5 MB (soft)