mod_cache module in the Apache HTTP Server (httpd) is a standard HTTP web cache that you can use to accelerate your website by caching dynamically generated pages (or pages passed through a proxy). This can bypassing large parts of the processing chain and fulfill each request much faster than if your server had to regenerate the request each time.
mod_cache for various projects over the years, and almost every time I run into the same set of problems. Here is a collection of some of the problems I’ve ran into repeatedly over the years to help you know what to expect when decide to use
Some familiarity with HTTP caching and control headers is assumed for this article.
The unlucky visitor
I’ve sung praises for asynchronous
stale-while-revalidation (RFC 5861) on this blog before. Simply put, it’s a standard control header that instructs a caching server to serve a stale/expired version of a resource to visitors while it revalidates and updates the cached version in the background.
mod_cache kind-of-but-not-really support a primitive version of this system that isn’t controlled by a standard HTTP header.
mod_cache will use one visitor request to revalidate the resource while everyone who comes after that first visitor gets served stale content until it can be updated. The one unlucky visitor will, of course, get a much slower and less satisfactory experience than everyone else.
This is know as the “thundering herd mitigation” and is specifically designed to prevent thousands of requests flooding through to the origin server when a request expired. It’s not designed to give everyone a quick response and a great user experience. This is a relatively small problem for extremely popular websites where every resource is frequently requests. It’s a bigger problem for a smaller website or blog where not every page is super popular. The one unlucky visitor doesn’t care whether you run a small or enormous website; they’ll still have a worse experience than everyone else.
Too many resource
One of the most used HTTP cache control headers is
Vary. Read my explainer article for a quick refresher. HTTP Server will include a
Vary: Accept-Encoding response header by default when either
mod_deflate is enabled to provide compression.
The default HTTP Server filter chain will store compressed variants of each resource in the cache. This saves processing cycles on the server as cached resources are all set and ready to be sent to the visitor without further processing immediately.
mod_cache doesn’t remove duplicate HTTP request header values, sorts their values, normalizes white space characters, or handle quality scores. It does compare strings in ASCII lower-case and merge duplicate request headers, however. You’ll see a lot of variation in request headers that from browsers, bots, and other clients; depending on the request header in question.
I’ll show a few example
Accept-Encoding HTTP headers to demonstrate the problems this causes:
All of the above headers are perfectly valid in the HTTP standard, and all of them are equivalent and should have been normalize to just one variant:
From those two candidates, HTTP Server should pick the preferred compression algorithm based on which modules are enabled. You should only end up with one variant from the above
Vary header. What you get instead is four different variants.
You can use the
RequestHeader directive from mod_headers to manually process and normalize the
Accept-Encoding request header and override it to your preferred content encoding. Below is an example of such a header normalization configuration (quality qualifier unaware) for normalizing known header values of the
Similar normalization should be applied to all
Vary-ing headers. The thing is,
mod_cache should be doing this automatically to ensure good cache-hit ratios. I can’t think of any situation where request header normalization would not be the right thing to do. However, this does negatively impact performance as you must disable the
CacheQuickHandler directive to make HTTP Server process
RequestHeaders directives and not just immediately look for the request in the cache.
It doesn’t mix too well with
Incredibly long story short:
mod_rewrite rewrites variables and URLs. Even though it’s right there in the name, you don’t often see the effects of rewriting HTTP Server’s processing variables. Let’s look at a super-easy but common example where every URL is handed off to
/index.php who then gets to decide what content to serve for all URLs:
The rewrite rule in effect rewrites REQUEST_URI for every request to
/index.php; so every page is served the same cached page even though the dynamic script would have handled different URLs differently. Enabling
mod_cache at this stage would mean some visitors would get the correct page and some visitors would get the wrong page depending on how lucky they were with their timing and the state of the cache.
You’ll find rewrites like the above in the default configuration for content management systems like WordPress. You can end up with similar problems in a lot of ways and it can be hard to detect the issue unless you’re aware of this problem. This is expected behavior from HTTP Server’s perspective, but not for server administrators.
You need to check each and every instance of mod_cache and reevaluate whether you’ve to use a
RewriteRule directive, and if can’t live without it then be absolutely sure it doesn’t modify any of the parameters that make up the cache lookup key. This includes every part of the request URL as well as any header and corresponding module listed in the Vary header.
When the ‘cache is unwilling’ to cache
CacheDetailHeader directive can help you verify and test that your cache is behaving the way you expect it to by including additional details from
mod_cache about the request. This can be useful when it indicates client-initiated invalidation through control headers.
The cache detail header can also be infuriatingly vague:
The ‘details’ are somewhat spares on why the cache didn’t want to store the response, and the HTTP Server documentation is no help here. This is a catch-all message for situations were a more specific error message isn’t written. You may see this message in any number of situations including conflicting or invalid cache control directives, chunked transfer encoding, or most likely you’re missing the
Content-Length response header.
I mention this error specifically because I’ve wasted hours of my life trying to decipher exactly why — this time around — Apache was unwilling to do my bidding. The problems have always been trivial to fix; but difficult to identify as you’re giving very little information to work with. This perfectly summaries all my experiences with using
mod_cache: the problems are difficult to identify but easy to fix once you know what the problem is.
For the record, I prefer Varnish HTTP Cache over
mod_cache. Varnish gives you better documentation and better performance than
mod_cache, Varnish doesn’t handle header normalization automatically but it has fewer quirks and handles
stale-while-revalidate perfectly out-of-the-box. If you’re not dead set on using
mod_cache, I suggest you seriously consider using Varnish instead.