Web Reading Mode: The non-standard rendering mode

All the leading web browsers, except for Google Chrome, include a separate “reading mode” that extracts the main content from pages, reformats it to be more readable, and hides distractions like advertisements, comments, and even page navigation. This separate rendering mode isn’t governed by any standards and as such it behave differently from web browser to web browser. So, what is a web developer to do to properly support this distinctly separate and non-standard rendering mode?

This article is part one in a series on web reading mode and reading mode parsers. The article is broken up into multiple parts as each part is written for a slightly different audience.

The non-standard rendering mode
The history of reading mode, a look at the different parsers we have today and how they came to be, and a small criticism of the Apache 2.0 license.
Determining the main page content
There are many approaches to content-analysis and extraction, and most only work well with English-content. Why is reader mode so slow to activate, anyway?
Title, author, and date metadata extraction
Visual page inspections, standard metadata, or guesswork? Everyone has their own ideas about how to best determine the metadata describing an article.
Inconsistent and bad reading experience
Encourage publishers to fix their designs, and standardize reading mode now.

What is reading mode?

Reading mode is an alternative web rendering mode where the web browser try to strip out repeated and irrelevant content; such as page navigation, ads, and distractions. The main article is extracted and brought to the front, and displayed in a clean and consistent page design.

Reading mode is increasingly useful as more and more websites seem to have abandoned any ideas of design and readability and focus on littering their pages with as much noise as they can fit in on to a single page.

You can think of reading mode as what you would think of as the ideal paper-print copy of an article you’re reading online. However, like when printing webpages, reading mode can produce odd results and you end up with a useless page.

I’ll look into how reading mode parsers extract the main content and metadata from webpages in part two and three of this series. Before that, however, I’d like to start by establishing a sort of history of reading mode parsers and then quickly talk about parser diversity and the lack of standards for reading modes.

Readability.js and the birth of a new rendering mode

Instapaper was first on the scene in 2008 with their ability to extract text from webpages; remove page navigation, ads, and distractions, and even make regular webpages readable on iPhone back before the “mobile-first” mentality had set with web developers. Instapaper also let users bookmark webpages and save them for later. However, Instapaper’s proprietary license make them little more than a footnote in the history of web reading modes.

Arc90 was inspired by Instapaper, and launched the Readability bookmarklet (an early form of a web browser extension) a year later. Readability.js was a parser that could extract the main text of a webpage, and would reformat it with large and readable text in an era where most websites still relied on tiny text to better serve visitors with smaller older displays. More importantly, Readability.js was released under the Apache 2.0 software license. In brief, this is a permissive license that allows anyone to take the source and built upon it — even for commercial purposes — without sharing the sources of the changes they make to the software.

A few months later, Arc90 launched Readability.com as a reading list and reading mode web service. The service was discontinued in 2016, but I’ll cover reading as a service in more detail in part four.

Readability.com made improvements to Readability.js which gave web publishers more control over the appearance of their content in reading mode, but they didn’t share these changes back to the open-source project and discontinued Readability.js entirely in 2010.

Readability.js makes the leap to web browsers

Apple picked up Readability.js in 2010 (probably contributing to Arc90’s decision to stop maintaining it), bundled it and gave its own Reader button in their Safari 5 web browser for MacOS and Windows. Safari Reader made the leap to iOS the following year. Apple has a reputation for clean and uniform designs, and it’s no wonder they preferred the neatness of Safari Reader to the busy and cluttered web designs many websites comes with from publishers.

Over the years, Safari have made changes and improvements to Readability.js in their own proprietary fork known as Apple Readability. Apple’s changes are quite extensive, and Apple Readability is probably the most capable reading mode parser available today.

In 2014, Microsoft introduced Reading View in Internet Explorer 11. This reading mode uses Microsoft’s own proprietary parser, loosly inspired by Readability.js. Mozilla didn’t want to be left behind and introduced their own Reader View in Firefox 38 the following year. Mozilla forked the abandoned 2010-version of Readability.js and re-released it as Mozilla Readability. Mozilla Readability has received some updates but no major overhauls. Mozilla Readability is also licensed under the Apache 2.0 license.

Mozilla Readability’s permissive license made it the de-facto reading mode parser found in everything from the Samsung Browser for Android to niche web browsers like GNOME Web, Maxthon, Vivaldi, and Yandex Browser.

Maxthon have also forked Mozilla Readability as their own proprietary Maxthon Reader with optimization for Chinese-Japanese-Korean (CJK) language specific content parsing and introduced some Apple Readability-inspired parsing rules.

Reader mode parser diversity

I’ve been talking a lot about Readability.js so far. There are quite a few other reading mode parsers, however. Here is a quick overview of web browsers and services and which parser they’re known to use:

Vendor	Product	Parser	Environments
Mozilla	Firefox	Mozilla Readability	Desktop and Android
GNOME	Web		Desktop
Vivaldi	Vivaldi
Yandex	Browser
Samsung	Browser		Android
Apple	Safari	Safari Reader	MacOS and iOS
Maxthon	Maxthon	Maxthon Reader	Desktop
Microsoft	Edge	EdgeHTML	Windows and Windows Mobile
Microsoft	Edge Mobile	Chrome DOM Distiller	Android
Google	Chrome	Chrome DOM Distiller	Android
Postlight	Mercury Reader	Web Reader	Web / browser extension
Instant Paper	Instapaper	Instaparser
Mozilla	Pocket	Unknown

One notable absentee in the above table is the market leader on the desktop: Google Chrome. Google is primarily an advertisement company, and they’re not too keen on any technology that aims to hide advertisements and distractions. Chrome for Android has an optional reading mode hidden away in the accessibility menu.

No standards, little documentation

One of the big problems with reading mode is that there’s no standards for how it’s supposed to work, and I can find little evidence of cooperation between the different reading mode parser vendors. More worryingly, only Microsoft has ever published any documentation about how their reading mode work and how web developers can target it; yet theirs is the parser that there’s the least information available about.

Ideally, web developers should be able to deliver a consistent and good user experience for their users in reading mode. However, web developers who want to support reading mode have a tedious job of testing and retesting across all the different reading modes; all of which behave differently and have different and even conflicting ideas on the way documents should be structured.

The majority of web browsers don’t set a custom URI for content displayed in reading mode, but all the market leading browsers do — yet they’ve been unable to agree on what URI to use. An agreement on what URI to use would mean websites, or even external programs, could link directly to a page in reading mode. This is indeed possible in some browsers, but only Safari and Microsoft Edge have registered their reading mode URIs in the operating system and enabled other programs to open pages directly in their reading modes.

Reading mode is unreliable for users as even though they’ve experienced that it works on a given website before, they never know whether the button will appear in their browser or whether the whole article will be included once they click it. Clicking on the reading mode button is always a bit of a gamble.

There was almost a standard

After Readability.com went proprietary, they settled on the hNews microformat as the preferred way of parsing pages while keeping what is now the Mozilla Readability implementation as a fallback. hNews is a structured metadata format that adds semantics for identifying titles, publication time, authors, and even the summary and the main content of an article. In other words, there was a defined standard for parsing documents and retrieving both the metadata and the main content.

The hNews microformat did gain some traction, but have since been superseded by the h-entry microformat. h-entry microdata is everywhere on the web thanks to it being included by default in the default themes produced by leading content management systems as well as being recognized as a semantic data structure by leading search engines like Bing, Google, and Yandex.

There was aaalmost a standard in place. The only trouble was that Readability.com had stopped contributing to their open-source Readability.js repository, so these changes never made it into Apple or Mozilla’s forks of Readability. The web continued with the non-microformats oriented legacy Readability implementation. When you hear someone argue against the use of the Apache 2.0 software license for open-source projects, things like this is what they’re referring to.