Web Reading Mode: Determining the main page content

Humans know full-well what the main content of any given page are. Regardless of how many advertisements, social widgets, cookie overlays, or other distractions there are; humans can always spot the actual content. Reading mode maybe not so much.

This article is part two in a series on web reading mode and reading mode parsers.

The non-standard rendering mode
The history of reading mode, a look at the different parsers we have today and how they came to be, and a small criticism of the Apache 2.0 license.
Determining the main page content
There are many approaches to content-analysis and extraction, and most only work well with English-content. Why is reader mode so slow to activate, anyway?
Title, author, and date metadata extraction
Visual page inspections, standard metadata, or guesswork? Everyone has their own ideas about how to best determine the metadata describing an article.
Inconsistent and bad reading experience
Encourage publishers to fix their designs, and standardize reading mode now.

Content analysis basics

It hasn’t always been the case, but web developers today can expect that their websites will look and behave mostly the same across the different standards-compliant web browsers. There’s good and plentiful documentation for the standard web however this isn’t at all the case with reading mode. This is where I wish I could write-up a neat summary of how the different reading mode parsers (see table in part one) detect the main content of the page and provide some useful guidelines into how you can ensure your pages are compliant. Unfortunately, there’s no neat summary or guidelines to work from besides the following algorithm:

All the different reading mode parsers try to identify an HTML element that contains a significant portion of text at nearly the same depth in the Document Object Model (DOM). Each node in the DOM is scored and evaluated against a varied set of criteria. Generally, nodes with little textual content or that consists mostly of links may be ignored entirely (including illustrations and tables). The exact criteria vary greatly from parser to parser.

The exact text selection algorithms are complex and varies from implementation to implementation. For example, Mozilla Readability evaluates and scores candidate DOM nodes based on the total number of characters in each node and how many commas (“”), there are in the complete text. Mozilla Readability makes some assumptions here about the average length of a paragraph of text, and the use of commas that don’t hold true in every text or in every language. I’ll get back to this in the next section. Maxthon Reader and Safari Reader works the same way, but adjust the scores significantly based on whether the content is visible on screen at the moment the reading mode button is pressed.

So, what is the main content?

The semantic HTML5-standard way to determine the main content of a page would be to look for the <main> or <article> elements. When it comes to reading mode, there’s little consensus. For reading mode, you either need significant amounts of content near the same DOM-level, or some luck.

Pocket gets confused by articles with <section>s in them, and will often pick the largest section of an article and display that as the entire article; ignoring every other section of the article. Early versions of Microsoft Edge also did this, but this has improved in later versions.

Safari Reader considers any <footer>s inside the main content (such as the sources list at the bottom of this page) to be part of the main content, while Readability explicitly hides footers. Apple Readability also does a good job at hiding inline advertisements and “suggested reading” call-to-actions link paragraphs inside articles. However, it sometimes hides short paragraphs that are a part of the main content. Web developers can direct this somewhat by applying the instapaper_hide class name to elements they wish to hide. Instaparser and Apple Readability are the only two implementations that support this explicit class name.

Some reading mode parsers work with some video embeds. Some will only allow video embeds from youtube.com and other may only allow videos embedded from youtube-nocookie.com and dailymotion.com. Others still only allow the native audio and video elements, while even others allow no embedded content at all, and many of the largest video hosting websites on the web aren’t allowed anywhere.

Chrome DOM Distiller finds the DOM node it think is the most likely to be the article, and copies it over into reading mode as-is which often result in cluttered pages.

Every browser will analyze page content when the page has finished loading. Which is why it can take some seconds before the reading mode button appears in browser’s main toolbar. However, websites that use infinite-scrolling or other dynamic loading behavior may load their content after the main document is loaded. Every reading implementation is overall terrible with determining which article is on screen at the moment; meaning that when you click on the Reading Mode button you may see an unexpected article instead of the article you were reading.

Content analysis internationalization issues

Reading mode is optimized for use with Latin script and European languages, or if we’re being totally honest it’s optimized exclusively for the English language. I’m not an internationalization expert but I’ve identified a couple of serious issues with the popular reading mode parsers.

Mozilla Readability scores text blocks based on the total number of characters in each block and how many commas (“”), there are in the text in total. However, not every languages and writing system uses the same character density in sentences or paragraphs, and not every language and writing system use the same character for the role that the humble comma serves in most European languages.

Comma internationalization support table
Script	Symbol	Unicode	Readability
Most European	,	U+2C	Supported
Arabic	،	U+60C	Unsupported
Chinese and Japanese	、	U+3001	Unsupported

Unicode includes more than 30 additional symbols not included in the above table that serve the same purpose as commas in stylistic settings and in other writing systems and languages. Additionally, some European languages rely more on the en-dashes (–) or em-dash (—) where other languages would use a comma.

In CJK languages, you would normally count each character as a complete word and not just a single character. Mozilla Readability’s algorithm doesn’t address this issue at all. The Apple-fork and Maxthon-fork of Readability addresses these issues by relying more on how much visual space the whole article takes up on screen instead of character counts.

Only Safari Reader and Microsoft Edge reliably supports right-to-left (RTL) languages like Arabic, and only these browsers have specifically set fonts for Arabic script. Only Safari Reader and Maxthon Reader has consistent and good support for even triggering reading mode for Chinese, Japanese, and Korean (CJK) languages. Only Vivaldi supports vertical text orientation and layouts as used in CJK languages, but will often not offer reading mode because of the Latin script and English language oriented content selection algorithm.

Mozilla Readability determines the document title based on a selection algorithm described in detail in part three. This algorithm always derives titles left-to-right (LTR) which, of course, doesn’t work too well in RTL languages.

Stale content

Instapaper and Pocket saves copies of reading-mode parsed pages with the services when you or anyone else save a given webpage. This saves these services huge amounts of potentially duplicated work for parsing and storing these articles. These savings also creates some issues that are unique to them.

The page that gets saved isn’t necessarily the same page you wanted to save. It could be years out of date and lack refinements like spelling corrections and updates to the original article. Even if you’re the only reader who’ve saved a specific webpage to the service, that copy could still be out of date by the time you want to read it. Say you saved a breaking news story in the morning and then began reading it later in the next afternoon. You’d miss out on any developments, updates, and corrections that have been added to the original story.

Neither service offer publishers a method to notify them of updates to articles, and they don’t support HTTP cache-revalidation or any of the other standard control mechanisms for content delivery on the web. Readers also don’t have a way to force-update a saved page but have to check in with the page they originally saved to see if it has been updated since it was first saved with the service.

Safari Reader also has a reading list and also stores articles for later offline consumption but it will also redownload pages if it has been awhile since you saved it to prevent them from going stale.