HTML5 Tidy

HTML5 parsed as 4 with uppercase DOCTYPE in HTML Tidy

HTML Tidy (libtidy) is a small program for identifying problems with, cleaning up, and producing consistent HTML formatting. It has full support for HTML5 parsing mode as well as the many legacy HTML 4 parsing modes. It can be a great aid for HTML experts and novices alike.

You may remember that I ran into another issue with php-tidy (also built on libtidy) when I tried using it with my old PHP based content management system. I’ve since migrated to a static-file content management system so that should have been the end of my problems with libtidy. However, I kept running into a strange bug with some of my document that were detected as HTML 5 but processes as they were HTML 4.

None of the legacy HTML 4 parsing modes allow modern practices like wrapping a block element like <h1> inside what was formerly defined as an inline element like an <a> element. It should work in most old web browsers as web authors have been doing this years before HTML 5 became a formal standard. But HTML Tidy in HTML 4 mode would try to clean up the mess and break the document in the process.

An HTML document’s parsing mode is detected from the DOCTYPE declaration on the very first line of the document. In standard compliant HTML 5, the DOCTYPE should always be the case-insensitive string <!DOCTYPE HTML>. HTML 4 has multiple modes and many variations on the DOCTYPE declaration which are all distinctly different from the HTML 5 DOCTYPE.

After some digging I realized that HTML Tidy version 5.6.0 didn’t do case-insensitive matching of the DOCTYPE string. <!DOCTYPE HTML> (uppercase “HTML”) would always be parsed as HTML 4 whereas <!DOCTYPE html> (lowercase “html”) would be parsed as HTML 5. The HTML 5 standard is very clear that the DOCTYPE should be matched case-insensitively. Tidy would correctly indicate that both variants had been detected as HTML 5, but incorrectly state that both variants had used HTML 5 parsing. The casing of “DOCTYPE” had no impact on this issue.

The obvious work-around for this bug is to always use a lowercase DOCTYPE when passing markup through HTML Tidy. I wrote a little program for myself that lowercases the DOCTYPE of any HTML files that I pass through it to help me avoid this particular issue in the future. It shouldn’t make any different to web browsers or any other software that will parse the document.

This was a really annoying issue and I spent way too much time narrowing down why some documents behaved correctly and some reformatted in strange and unexpected ways. I later realized that this “strange” reformatting was the expected result for HTML 4 strict parsing mode.

On a more positive note: I’m really happy with HTML Tidy overall. It has helped me identify and correct dozens of issues with this site already. I’ve also found, reported, or fixed some other minor issues using Tidy with things like HTML5+RDFa extension attributes.

You can follow tidy-html5 issue #815 if you’re interested in updates on the DOCTYPE issue. A tiny patch for the issue is already available. HTML Tidy releases are few and far between so I don’t expect that we’ll see this patch make its way to a release and users any time soon.

Sources

  • The HTML syntax, , HTML 5.2, W3C Recommendation, W3C
  • HTML Tidy version 5.6.0, , HTACG