Getting yourself into trouble with NGINX caching reverse proxy and rewritten URLs

This story starts with two seemingly similar but distinctly different URLs: //document and /document. They’re semantically two distinct and unique URLs (RFC 386) and programs can’t assume they can be normalized by removing the “extra” forward slash. Extra slashes can sneak into URLs unintentionally from all sorts of websites and external software, so it can be tempting to attempt to combine seemingly redundant slashes into a single slash.

You can only safely normalize and redirect duplicate slashes in situations where you know that there shouldn’t ever be any duplicate slashes in your URL design and planning. However, caching reverse proxies and content delivery networks (CDN) can’t arbitrarily decide to merge slashes without breaking expectations of the web platform.

But I’m getting ahead of myself. Before going into that in more details, I first need to establish some internal workings of the NGINX web server.

Two variables

The first variable of note is $request_uri which contains the original URL as requested by the client including the path and query arguments. $uri$is_args$args are three separate variables that in combination recreates the same information that make up the $request_uri variable.

When an NGINX caching reverse proxy receives a request, it will lookup the URL in the cache based on the $request_uri variable (the cache key) and serve the client a cached copy of the resource if one exist. If the resource can’t be served from the cache, NGINX will pass on the $request_uri to the origin server and save the response in its cache and serve it to the client. These behaviors can be changed by modifying the proxy_cache_key or proxy_pass options.

The NGINX web server has an option called merge_slashes (off by default) that compresses duplicated forward slashes into a single forward slash. In practice this rewrites the internal $uri variable so that e.g. /hello///world will become /hello/world. Notably, it doesn’t trigger a client redirect which is the method I prefer, see the referenced article above. The original incoming URL is preserved in the $request_uri variable which remains unchanged. There are multiple other ways to rewrite the $uri variable, but this one option will serve as a stand-in for all of them in this article.

By default, even with the merge_slashes option enabled, NGINX will not modify the request URL as used for caching or the URL sent through the proxy nor the response coming back through from the origin server. So far, everything is humming along nicely.

Mixing up the variables

A common strategy for improving a caching proxy’s cache hit ratio, the ratio at which a request can find a match to be served from the cache, is to remove and essentially ignore the query parameters from the URL. Meaning that instead of looking up results in the cache based on the full URL, only the path is used to evaluate cache hits (the $uri variable sans the $is_args$args variables in NGINX parlor). This strategy is achieved by changing the proxy_cache_key and the proxy_pass option to $uri.

However, if you don’t change both options then the origin server and the cache will disagree on which URL is being served. Say we’ve enabled the merge_slashes option and only modified the proxy_cache_key option. If a user then requests //document, the proxy will look in its cache for /document and if it hasn’t been cached already, it will ask the origin for //document. The origin will handle the request for //document and send it back to the proxy server which will store it as /document in its cache and serve it to the user. Any future requests for /document will get the same resource.

So where does this go wrong? Well, if the origin server handles //document and /document differently (which is to be expected as they’re distinctly different URLs), then the proxy server will begin serving the wrong resource to any subsequent users that request /document. This may be okay in some situations but in general, it will cause issues. Let us look at a more concrete example.

BunnyCDN and the infinite redirect

The broken setup in the above description is essentially what you’ll find at BunnyCDN, the winner of my budget CDN comparison. They rewrote the URLs they use for their cache keys without informing the origin server that the URLs had been normalized in a non-standard way.

In my case, my own normalization would result in a infinite redirect loop. Given my own knowledge of my URL structure I could work out where the visitors meant to go and send them to the right location. E.g. turning //document to /document. However, BunnyCDN would cache the redirect on //document as /document; causing an infinite redirect loop as /document would then serve a redirect to itself forever. Every visitor coming to the correct URL would also be served the cached version and stuck in a redirect loop. As discussed earlier, redirect loops with HTTP/2 server-push can get pretty bad for some clients.

BunnyCDN have been positive to change their ways but hesitant to implement the changes. So I’ve had to implement a work-around involving making all redirects from URLs containing // non-cacheable. This reduces page load speeds where the cache would normally handle redirects. However, this workaround should never have been necessary to implement. There are well established standards for how proxy servers should behave and BunnyCDN simply isn’t living up to those expectations in this instance.

Take away lessons

  • Don’t remove duplicate slashes in URLs other than your own. Extra slashes are distinctly different addresses even though some servers might serve the same content in some situations. It’s a destructive and non-standard normalization operation.
  • When rewriting URLs, be sure either do it consistently and apply the rewritten URLs everywhere or simply don’t do it at all.
  • Chose service providers and software that deliver on web standards. You’ll save yourself having to debug weird and unexpected problems.