Machine complexity

Problems to expect with Apache mod_cache

The mod_cache module in the Apache HTTP Server (httpd) is a standard HTTP web cache that you can use to accelerate your website by caching dynamically generated pages (or pages passed through a proxy). This can bypassing large parts of the processing chain and fulfill each request much faster than if your server had to regenerate the request each time.

I’ve used mod_cache for various projects over the years, and almost every time I run into the same set of problems. Here is a collection of some of the problems I’ve ran into repeatedly over the years to help you know what to expect when decide to use mod_cache.

Some familiarity with HTTP caching and control headers is assumed for this article.

The unlucky visitor

I’ve sung praises for asynchronous stale-while-revalidation (RFC 5861) on this blog before. Simply put, it’s a standard control header that instructs a caching server to serve a stale/expired version of a resource to visitors while it revalidates and updates the cached version in the background.

mod_cache kind-of-but-not-really support a primitive version of this system that isn’t controlled by a standard HTTP header. mod_cache will use one visitor request to revalidate the resource while everyone who comes after that first visitor gets served stale content until it can be updated. The one unlucky visitor will of course get a much slower and less satisfactory experience than everyone else.

This is know as the “thundering herd mitigation” and is specifically designed to prevent thousands of requests flooding through to the origin server when a request expired. It’s not designed to give everyone a quick response and a great user experience. This is a relatively small problem for extremely popular websites where every resource is frequently requests. It’s a bigger problem for a smaller website or blog where not every page is super popular. The one unlucky visitor doesn’t care whether you run a small or enormous website; they’ll still have a worse experience than everyone else.

Too many resource Vary-ants

One of the most used HTTP cache control headers is Vary. Read my explainer article for a quick refresher. HTTP Server will include a Vary: Accept-Encoding response header by default when either mod_brotli or mod_deflate is enabled to provide compression.

The default HTTP Server filter chain will store compressed variants of each resource in the cache. This saves processing cycles on the server as cached resources are all set and ready to be sent to the visitor without further processing immediately.

However, mod_cache doesn’t remove duplicate HTTP request header values, sorts their values, normalizes white space characters, or handle quality scores. It does compare values s lower-case and merge duplicate request headers. You’ll see a lot of variation in request headers that from browsers, bots, and other clients; depending on the request header in question.

I’ll show a few example Accept-Encoding HTTP headers to demonstrate the problems this causes:

Accept-Encoding: gzip, br, br
Accept-Encoding: br, gzip
Accept-Encoding: gzip,br
Accept-Encoding: br;q=1, gzip;q=1

All of the above headers are perfectly valid in the HTTP standard, and all of them are equivalent and should have been normalize to just one variant:

Accept-Encoding: br, gzip

From those two candidates, HTTP Server should pick the preferred compression algorithm based on which modules are enabled. You should only end up with one variant from the above Vary header. What you get instead is four different variants.

You can use the RequestHeader directive from mod_headers to manually process and normalize the Accept-Encoding request header and override it to your preferred content encoding. Below is an example of such a header normalization configuration (quality qualifier unaware) for normalizing known header values of the Accept-Encoding header:

CacheQuickHandler off
<If "%{HTTP:Accept-Encoding} =~ /br/">
  RequestHeader set Accept-Encoding br early
<ElseIf "%{HTTP:Accept-Encoding} =~ /gzip/">
  RequestHeader set Accept-Encoding gzip early
  RequestHeader unset Accept-Encoding early

Similar normalization should be applied to all Vary-ing headers. The thing is, mod_cache really should be doing this automatically to ensure good cache-hit ratios. I can’t think of any situation where request header normalization would not be the right thing to do. However, this does negatively impact performance as you must disable the CacheQuickHandler directive to make HTTP Server process RequestHeaders directives and not just immediately look for the request in the cache.

Doesn’t mix too well with mod_rewrite

Incredibly long story short: mod_rewrite rewrites variables and URLs. Even though it’s right there in the name, you don’t often see the effects of rewriting HTTP Server’s processing variables. Let’s look at a super-easy but common example where every URL is handed off to /index.php who then gets to decide what content to serve for all URLs:

RewriteRule ^(.*)$ /index.php [L,QSA]

The rewrite rule in effect rewrites REQUEST_URI for every request to /index.php; so every page is served the same cached page even though the dynamic script would have handled different URLs differently. Enabling mod_cache at this stage would mean some visitors would get the correct page and some visitors would get the wrong page depending on how lucky they were with their timing and the state of the cache.

You’ll find rewrites like the above in in the default configuration for content management systems like WordPress. You can end up with similar problems in a lot of ways and it can be hard to detect the issue unless you’re aware of this problem. This is expected behavior from HTTP Server’s perspective, but not for server administrators.

You really need to check each and every instance of mod_cache and reevaluate whether you really have to use a RewriteRule directive, and if can’t live without it then be absolutely sure it doesn’t modify any of the parameters that make up the cache lookup key. This includes every part of the request URL as well as any header and corresponding module listed in the Vary header.

When the ‘cache is unwilling’ to cache

The CacheDetailHeader directive can help you verify and test that your cache is behaving the way you expect it to by including additional details from mod_cache about the request. This can be useful when it indicates client-initiated invalidation through control headers.

The cache detail header can also be infuriatingly vague:

cache miss: cache unwilling to store response

The ‘details’ are somewhat spares on why the cache didn’t want to store the response, and the HTTP Server documentation is no help here. This is a catch-all message for situations were a more specific error message isn’t written. You may see this message in any number of situations including conflicting or invalid cache control directives, chunked transfer encoding, or most likely you’re missing the Content-Length response header.

I mention this error specifically because I’ve wasted hours of my life trying to decipher exactly why — this time around — Apache was unwilling to do my bidding. The problems have always been trivial to fix; but really difficult to identify as you’re giving very little information to work with. This actually perfectly summaries all my experiences with using mod_cache: the problems are difficult to identify but easy to fix once you know what the problem is.

For the record, I prefer Varnish HTTP Cache over mod_cache. Varnish gives you better documentation and better performance than mod_cache. Like mod_cache, Varnish doesn’t handle header normalization automatically but it has fewer quirks and handles stale-while-revalidate perfectly out-of-the-box. If you’re not dead set on using mod_cache, I suggest you seriously consider using Varnish instead.