Smarter reusable IPFS chunks from HTML documents

The InterPlanetary File System (IPFS) is a BitTorrent-like protocol for the peer-to-peer distributed web. IPFS splits large files into small chunks. IPFS clients can fetch chunks from anyone on the IPFS network that has that exact same chunk.

Older peer-to-peer implementations like BitTorrent arrange its peer swarms around file sets (“torrents”) that exchange chunks with each other. This means that peers don’t exchange data with each other between different torrents even if the chunks are identical. Chunks on IPFS is globally deduplicated and exchanged.

I discussed how IPFS stores files in more detail in my article on how IPFS hosting services overcharge for duplicated chunks.

IPFS currently uses a size-based chunker. Files are split into chunks of 256 kilobytes. You generally don’t want your file chunks to be too small for performance reasons. It takes more time and overhead to download more chunks.

When you change one bit of data it changes one file chunk. However, adding a single bit at the beginning of the file changes every chunk since the offset from the beginning of the file changes for the entire file.

IPFS is planning to migrate to a smarter rolling-hash based chunker in a future version. However, I’d like to discuss the possibility of using an even smarter chunking logic for HTML documents.

A webpage consists of a HTML document file plus assets such as style, scripts, and images. Loading these external assets over the traditional HTTP/1.1 protocol introduces lookup and transfer latency-overhead that reduces page-load performance.

Fetching files and chunks from IPFS has a higher performance cost than HTTP/1.1. The majority of the cost comes from the additional overhead involved in locating peers who has the exact chunk you want.

HTTP supports asset caching which can remove the need to repeatedly fetch the same asset again from the same server. As mentioned briefly above, IPFS chunks are globally deduplicated. Clients can cache chunks and reuse them much the same way HTTP caching works. The more peers share a specific chunk the higher its availability in the network gets.

Over time, web developers developed best practices for improving page load performance. One such pattern involves inlining critical resources. such as render-blocking stylesheets and scripts. into the HTML document itself. This made the HTML documents bigger and subsequently page visits didn’t benefit from these resources already being cached in the client.

A quick aside: You should not inline resources over HTTP/2! Use Server-Initiated Push instead.

The average HTML document is unlikely to exceed 256 kilobytes. It will comfortably fit in one IPFS chunk. However, you’ll still need to fetch extra chunks to load the external resources. This is where we can do something clever with the resource-inlining method that was popular with HTTP/1.1.

You can split file chunks at arbitrary points in a file. You can split your HTML documents immediately after the opening <style> element and again immediately before the closing </style> element. In a simplified example below, you’d end up with three chunks: the top part of the HTML document, a reusable stylesheet chunk, and then the remainder of the document.

<!doctype html>
<html>
 <head>
  <title>Example Doc</title>
  <style>
   body {color:inherit}
   h1 {text-align:left}
  </style>
 </head>
 <body>
  <h1>Example Doc</h1>
 </body>
</html>

An example HTML document. Ideal chunks would be placed right after the opening and before the closing of the style element.

From IPFS’ perspective, the stylesheet chunk from inside the HTML page would produce a chunk that would be indistinguishable from the same stylesheet saved in a CSS file. It’s the same chunk whether you load it into the document as an external chunk or embed it into the document itself. The chunk can still be cached in the client and reused in all your HTML documents.

Crucially, your IPFS client will discover the chunk much earlier when its part of the document than if it first needs to fetch the HTML document and then discover the external dependencies.

You can make changes the stylesheet without making changes to the two other chunks in the document chunks. (Reminder: IPFS chunks are immutable.) Clients that have cached and share an older version of the document can still contribute useable chunks to readers of the updated version.

This method can be used with other reusable chunks from your HTML documents of embedded data. Embedded template, script, style, and svg elements are good candidates.

You can also chunk your documents around content you expect to update often, e.g. splitting out a Related or Recent articles widget on a blog in a chunk of its own.

You don’t want to overdo the chunking, however. Too many or too small chunks will badly hurt performance. You can still make good use of smaller chunks when its done sparingly and at the right places in the file.

Inlining external dependencies also nicely works around IPFS chunk addressing limitations in the current web browser implementations of IPFS. You’ll often come across webpages without working styles, images, and script when you browse the IPFS distributed-web. I’ll not go into details here, but let me just say that it was a bad idea to try to replace URLs — a cornerstone component of the web platform — with IPFS NURLs.

This is all just an intriguing concept at this stage. You can achieve custom chunking through object manipulation in the go-ipfs reference client. It’s not easy but it’s doable. Better tooling would be required for this and other content-boundary-aware chunking strategy to see any adoption.