The PHP logo side-by-sie with the HTML Tidy logo.

PHP HTML Tidy ironically doesn’t tidy up after itself

I learned HTML and the value of open readable source code back in the day through looking at the sources of the websites of the time and I still value these things. PHP Tidy is an interface to the Tidy HTML library (libtidy) that’s used to neaten up HTML documents and pretty-print their source code. However, every time I’ve tried to use it I’ve had to disable it as it started gobbling up every file processed by PHP, even the ones it was never told to tidy up. I finally took the time to figure out what the issue was.

The HTML Tidy library is supposed to be used as a post processing tool to help uncover formatting issues and beautify the source HTML to increase readability. The tool does an excellent job at that with HTML but it can’t be used to process other markup languages.

PHP Tidy includes an output buffer processing callback function called ob_tidyhandler that you can register at the top of your HTML documents to have HTML Tidy perform post processing on the current document before returning it to the client. Due to a bug in PHP Tidy, registering the callback for one document would cause every subsequent request handled by the same PHP processor to run through HTML Tidy as well. This would be a destructive action for any XML or binary files returned by PHP.

The heart of the issue is the management and abuse of the global on/off switch for enabling HTML Tidy processing on all documents. If you enable the tidy.clean_output option then every request going through the PHP processor would be given a tidying up; even unsupported file formats. This option should be left off and avoided.

However, the ob_tidyhandler function doesn’t process the buffer as one would expect. Instead, it flips the global tidy.clean_output runtime option to on causing PHP Tidy to send the whole document (as opposed to just the buffer the function was supposed to work on) to HTML Tidy for post processing. This behavior doesn’t at all match how the function is documented.

This little cheat was probably fine back in the day when a PHP processor would be started by the web server for every request and put to rest when it had done its task and it matched how most developers would have wanted to use the function. Today, the same PHP processor instances is reuses the same runtime for hundreds or thousands of requests.

This is a longstanding issue and I believe that I’ve observed issues caused by his bug around the web for the last decade. Specifically, I’ve repeatedly come across servers that have returned Atom syndication feeds (an XML file format) that have been butchered into a broken HTML document with a <meta name="generator" content="libtidy"> element near the top. Reloading the URL a dozen times would normally return the Atom formatted feed as expected.

My recommendation is clear: don’t use ob_tidyhandler!

If you’re interested in this issue you can follow PHP bug #77594.

Sources

  • ext/tidy/tidy.c, commit 91ef4124e5, , PHP project repository, GitHub
  • ob_tidyhandler, , Tidy, Other Basic Extensions, PHP Function Reference, PHP Manual, The PHP Group