Why are Facebook and Twitter so bad at parsing RDFa metadata?

Facebook and Twitter both chose to use RDFa for their “optimized link sharing” metadata formats. Well, it would seem Twitter didn’t realize that was what they’d done until a later stage. In any case, why are both of these social media platforms with vast engineering resources at their disposal so bad at parsing RDFa data from webpages?

The gist of this article is to highlight how Facebook chose RDFa — and Twitter walked backwards into it, but don’t even implement a minimal RDFa parser to retrieve metadata from websites.

This means that, for example, a meta tag that stores a content attribute applicable in multiple vocabularies (Twitter Cards and Open Graph) must needlessly be duplicated, and that even the meta elements themselves are redundant. But I’m getting ahead of myself, let me get back on track.

RDFa “ultra-light”

Twitter Cards’ is Twitter’s name for their tiny metadata format for enabling rich-snippets in shared links. These snippets can include details such as a thumbnail and the title of the link being shared among the platform’s users. Twitter Cards rely on markup that appears to be RDFa at first glance, but is the apparent results of cargo-cult copying something resembling a metadata format without a real understanding of their purpose nor context on the web.

Facebook’s ‘Open Graph Protocol’ (OGP) is more technically capable, if only in theory, as its based on an “ultra-light” variant of RDFa. From the available documentation, the main differences between RDFa and whatever OGP is using is that the latter can only be set on meta elements inside the head element. This isn’t a restriction from RDFa nor necessary for OGP’s purposes, but it’s an arbitrary limitation set by Facebook’s engineers.

Speaking of RDF attributes (RDFa), you should be at least somewhat familiar with RDFa Core syntax to follow along with this article. Here is a quick refresher: RDF attributes allow embedding semantically meaningful content within HTML tags. The syntax relevant to this article is limited to embedded key-value pairs of metadata using the property attribute (key) and content attribute (value):

@content: a string, for supplying machine-readable content […].

@property: a white space separated list of [terms.]

Note that there’s no mention of the name attribute as this holds no significance in RDFa whatsoever. When used in examples in this article, it’s for the benefit of non-RDFa metadata consumers and legacy systems (and to make a point.) I’ll get back to this in the next section.

For brevity, please just imagine that all the examples in this article have already declared the following namespaces. Neither Twitter nor Facebook care about namespaces and only support their own hardcoded namespace prefixes. That is a discussion for another time, however.

<html prefix="dc: http://purl.org/dc/terms/#
              og: http://ogp.me/ns#
              twitter: https://dev.twitter.com/cards/markup#">

Twitter hasn’t published or documented a namespace URL for its Twitter Cards vocabulary. However, it has one which is broadly used by convention (as shown in the above example).

I’ve done some experiments to see how Twitter and Faceboook’s bots process metadata in Twitter Cards and Open Graph Protocol format. The next two section deals with extracting the page title using these formats.

Let’s observe Twitterbot parsing

Even though Twitter incorrectly recommends specifying the key-values in name attributes, I’ll disregard those in favor of the property attribute in all the examples. Twitter supports both variants.

The property attribute is the correct choice as per the RDFa standard and it’s also supported by Twitterbot. The decision to use name is likely based on a poor design decision that I suspect is rooted in having engineers not being familiar with RDFa when they designed their RDFa-imitation format.

Let’s just look at some non-working but perfectly valid examples:

<meta content="Great Title" name="title" property="twitter:title">
<meta content="Great Title" property="dc:title twitter:title">
<meta content="Great Title" data-sub="marine" property="twitter:title">

Then please compare the above examples to these working examples:

<meta content="Great Title" property="twitter:title" name="title">
<meta property=" twitter:title " name="title" content="Great Title">

All five examples are perfectly valid and should have worked. Yet, Twitter can only read the metadata from the last two examples. Worryingly, significance seems to be placed in the order of the element attributes. The contents of the property attribute is also treated as an exact-match key rather than a list of tokens.

There’s only two conclusions that can be drawn from the differences between the working and non-working examples:

  1. Twitter engineers didn’t know anything about HTML and RDFa when they designed Twitter Cards.
  2. Twitterbot is committing the greatest sin imaginable when parsing HTML! They’re regex-soup-matching tags and attributes rather than parsing the markup! They’re doing it quite badly too.

So, what does this mean for the web? More redundant markup. In a perfect world every metadata consumer would have agreed on a common core of metadata descriptors. In lieu of one metadata standard, every metadata consumer should at least be expected to parse RDFa data in a sensible manner. The following example should have covered all bases:

<meta content="Great Title" property="dc:title og:title twitter:title">

Let’s observe Facebot parsing

But wait — what about Facebook? Their Open Graph Protocol implementation seems to be designed around an understanding of RDFa syntax and not just cargo-culting (imitating) the syntax like we see in Twitter Cards. Facebook’s facebookexternalhit bot, lovingly known as Facebot, seems to be somewhat more competent than Twitterbot and it’s error messages indicate that it’s parsing the markup. However, the implementation still leaves a lot to be desired.

Let’s look at some non-working but perfectly valid examples:

<meta content="Great Title" property="dc.title og:title">
<meta content="Great Title" property="  og:title  ">
<meta content="Great Title" name="title" property="og:title">

Then please compare the above examples to these working examples:

<meta content="Great Title" property="og:title">
<meta content="Great Title" property="og:title" name="title">

Facebot, exactly like Twitterbot, seems to not have got the memo about the property attribute being a space-separated list of terms. Facebot will only do an exact match of property="og:title", without even handling any whitespace stripping. Twitterbot at least handles whitespace stripping even though it doesn’t understand that it can be a space separated list.

Like Twitterbot, Facebot treats the name property as significant even though it holds no significance in RDFa parsing. The two bots only has a problem with any unexpected attributes if they appear before a property in the element. Again, I’d like to remind readers that the order of attributes holds no significance in either RDFa nor HTML.

Wouldn’t you know … the facebookexternalhit bot is also regex-soup-matching tags and attributes rather than properly parsing the metadata. This lazy approach introduces unnecessary inculpabilities with other products. Facebook only gets away with its sloppy work because of its near-monopoly position in the market.

How should this have been solved?

The answer to this question is almost always to use XPath on a parsed XML representation like DOM or XDM. Parsing raw HTML documents with XPath is unlikely to succeed as authors write in HTML and XPath only operate on well-formatted XML documents.

Luckily, there are standardized processes for processing HTML into a parsed XML representation like the DOM or XDM. This document as displayed in your web browser right now is an example of that process. There are plenty of libraries for developers to choose from that can create such representations from HTML documents.

The following XPath will read out Open Graph Protocol title metadata from a document and respecting the current limitations in the OGP standard:

meta[contains(concat(
  ' ', normalize-space(@property), ' '),
  ' og:title ')]/@content

Or if you want full RDFa support, then this will let you parse the value from either the content attribute or fallback to the element text node if no content attribute is set. This can read the data from any element (the title element seems appropriate, or maybe a h1 element somewhere further down on the page) and not just the meta element.

(
  *[contains(concat(
    ' ', normalize-space(@property), ' '),
    ' og:title ')]/@content
  |
  *[contains(concat(
    ' ', normalize-space(@property), ' '),
    ' og:title ')]/@text()
)[position() = 1]

The above XPath would return the string “Great Title” in accordance with the RDFa standard from all of following examples:

<title content="Great Title" property="og:title">Okay Title</title>
<meta name="title" property="og:title" content="Great Title">
<meta content="Great Title" name="title" property="dc:title og:title">
<h1 property="og:title">Great Title</h1>

The following DOM querySelector manages the same feat:

document.querySelector(
  'meta[property~="og:title"], *[property~="og:title"]'
);

In conclusion

—so is XPath the right tool for Twitterbot and Facebot? Well, yeah. Probably. Why don’t they use it? They don’t care in the slightest. Publishers will adjust their markup to work with the weak implementations that are in actual use.

To answer the leading question of the headline: Facebook and Twitter are bad at extracting metadata from websites in their own standard formats because they’re not parsing the data but rather use metadata extraction techniques based on wishful thinking. If their solution had been suggested as an answer to this problem on the StackOverflow coding forum, it would have been down-voted to oblivion.

Now that you’ve read this article and are better informed, there’s no excuse for you not to implement proper document parsing in your projects. Don’t repeat the mistakes of Facebook and Twitter.