Mozilla Readability will now extract document metadata from Dublin Core and Open Graph Protocol meta tags instead of trying to guess article titles.
Earlier , I documented how reader mode in web browsers extract metadata about articles. After learning about the messy state of metadata extraction for reader mode, I sought to improve the extraction logic used in Mozilla Readability. Mozilla Readability was one of the first reader mode parsers and it’s used in Firefox as well as other web browsers.
Hundreds of thousands of websites already provide high-quality metadata in the document though Dublin Core and Open Graph Protocol meta tags and many content management systems will include these by default. Many social media platforms including Twitter and Facebook consume these meta tags which has pushed their adoption across the web.
These meta tags are good source of the type of metadata Readability needs, so I submitted a pull request that made these the primary source of metadata for Readability. What follows is an outline of the changes that have been included in Firefox 63, released earlier .
The old preferred methods for extracting document titles tried to derive the document title by applying a complicated set of rules to the document
<title>. This method could produce unexpected results for short titles and right-to-left titles. The new logic trusts web developers to provide accurate titles through metadata. Here are some sample metadata shown in the order of preference:
The blue highlight indicates which title would be used for each of the examples. Note that the old title guessing is still used as a fallback if there are no better sources for the meta tags available.
Mozilla Readability used to preferred the
meta[name="author"] element for sourcing author names. The Open Graph Protocol doesn’t have a platform-neutral way to indicate an author’s name, so the only change is that the Dublin Core creator is now the preferred data source as shown below.
Lastly, there’s the document description that now prefers Dublin Core and Open Graph metadata over the
meta[name="description"] element. Many websites use the same description for every page, rather than a snippet or summary of individual pages as they may never see these descriptions. However, everyone sees a page’s Open Graph Protocol description when sharing links to e.g. Facebook, so the metadata in these tags tend to be better maintained by web publishers.
As far as I know, no web browser use the description for anything in Reading mode but it’s supported for other users of Readability. Dublin Core is the only format that has the vocabulary can express the author name, and the article title and description on its own which is why its the preferred format. You can use any combination of metadata formats, however.
Just to be clear: you don’t need to include both Dublin Core and Open Graph Protocol meta tags in a document even though I show both used at the same time in the above examples.
For the sake of completeness, I’d also like to mention that Twitter metadata tags and Sina Weibo are also supported though I don’t want to promote implementations of platform specific metadata tags. Twitter tags are in widespread use even though Twitter also use Open Graph Protocol meta tags are a fallback. Sina Weibo’s meta tags are similar to Open Graph Protocol tags and are common on Chinese language websites.
As I’ve written about before: there’s no standards for how reader mode works. If you can think of a better way that would reliably extract metadata from billions of webpages, I encourage you to take a look at Readability.js and submit changes to Mozilla.