Many popular feed readers don’t handle plain text containing XML-entities correctly

The headline for my recent article titled URLs don’t belong in <meta> elements didn’t fare too well in syndication in feed readers and on other platforms. So what went wrong with this particular headline?

Let’s start with a quick look at how this particular entry appeared in Ctrl blog’s Atom syndication feed:

<title type="text">
  URLs don’t belong in &lt;meta&gt; elements
</title>

The type="text" attribute is a bit redundant as it’s the default text mode in Atom. A second text mode, type="html", is also available but then the attribute has to be set expressly. I’ve decided to expressly include the type attribute with the default value to give a signal to feed parsers that the content really is plain XML-entity encoded text and not HTML.

However, it would seem that a number of popular feed readers have never read RFC 4287 Atom Syndication Format and don’t know how to parse Atom feeds properly. There is even an example of this in RFC 4287 clarifying how it’s supposed to be handled.

The following popular feed readers/feed-based services screwed up the headline and displayed it as “URLs don’t belong in elements” (note that “<meta>” is missing) in their user interfaces:

  • Apple News
  • Google News
  • Feedly (web)
  • Flipboard (apps)
  • Newsblur
  • The Old Reader
  • Aggregator
  • NetNewsWire
  • Vienna
  • Pushbullet

You may recognize some of the above names if you’ve ever seen a list of the most popular feed readers on the market.

Update (): Aggregator and NetNewsWire have fixed this problem in response to this article.

Zapier, the feed-to-Twitter gateway that tweets for @CtrlBlog; and my own feed to email newsletter delivery system also had problems with this title.

So what did I and all these feed readers get wrong about this title?

I can break down pretty exactly where my own and all these other feed parsers went wrong. To get a plain-text string from an HTML encoded string delivered in an Atom feed you’d first need to XML-entity decode the string and then strip away the resulting HTML tags. The following demonstrate this process using PHP:

strip_tags (
  html_entity_decode (
    $feed->$entry->$title
  )
)

The problem is of course that you can’t process a plain XML-entity encoded text string with the same method. What happens is that you’d turn a string containing an XML-entity encoded HTML tag (e.g. a tag you want to display as-is rather than having it interpreted as HTML) into HTML and then in the next step you’d strip out that HTML which wasn’t ever intended to be handled as HTML.

A few of these feed readers may have skipped the tag-stripping step and included the <meta> element as-is. This element isn’t displayed by default so it would be hidden in the title. I didn’t find any traces of this element in any of the readers that I could inspect, but it’s a distinct possibility.

So what did we all learn here? Feed readers need to put <title [type="text"]/> and <title type="html"/> through separate processes if you want to get a text representation. The Atom specification is over a decade old but many feed readers assume HTML is the default text parsing mode instead of text (or they simply don’t support the default mode.

The following feed readers handled the headline correctly:

  • Flipboard (web)
  • Inoreader
  • Liferea
  • Akregator

A round of applause for the spec compliant feed readers, everyone! 👏

—and yes, I was aware of this issue before I published the affected article. However, I didn’t expect that it would affect my newsletter or Twitter integration.