Always care for old links after moving domain and reorganizing URLs

At the same time as I moved this blog to a new domain, I took the opportunity to restructure and reorganizing all the existing URLs. This in turn gave me the opportunity to jot down some thoughts on things you can do to help make sure your old addresses keep working for a very long time.

I usually agree with the recommendations of the World Wide Web Consortium (W3C) in all things. However, even the W3C have chosen to change the URL of their Cool URIs don’t change URI-style recommendation document. (They moved it from HTTP to HTTPS, but redirected the old address to the new page.)

This blog is built with WordPress, but I’ve customized it heavily to structure the blog more like a website built with Textpattern. This involved a lot of custom code, and while moving domain I took the opportunity to simplify things and get rid of the added complexity of mimicking another content management system. As part of this work I’ve chosen to rewrite a few hundreds links. The exact details aren’t whats important here, other than the fact that a lot of URLs have changed and some content and navigational structures were removed.

Mapping old URLs and creating redirects

I started by carefully mapping out every URL that I’d created plus every URL that were in active use [from external websites that had mislinked] on the old domain. I then went through all URLs that stuck around from my old static CMS from before WordPress.

With a complete map of ever existing URL, I mapped each and every URL to their new home and setup HTTP 301 Permanent Redirects. I’d removed some navigational structures (all sections and most tags), so there was quite a lot of manual work involved. I didn’t have exact equivalents for every of the old pages, especially for the section archives, and marked these as removed with a proper HTTP 410 Gone response.

I took care to redirect everything else with a 301 redirect where the content had just moved to a new address, and a HTTP 303 See Other redirect where the content had been replaced by something that wasn’t the exact replica of the old content. Sending the correct redirect responses means clients (well, search engine crawlers) can have a high confidence in my redirects.

Every redirect leads directly to the new location without having to hop through a chain of redirects. This speeds up page loading times as well as help reinforce the new canonical address. I’ve often seen webmasters end up with long and intricate redirect chains as they move through CMSes or domains over the years. I thought this was undesirable for everyone involved, and it only took a few minutes extra to test that every old address only required one redirect to get where they were going.

The redirection is handled by a separate system from my CMS that answers connections coming in to the old domains. I chose to handle the redirects outside of WordPress or the server instance WordPress is hosted on as to not slow down Apache’s URL-pre-processing on the new main domain.

It’s important to keep an eye on your server logs for the first couple of weeks when rewriting many addresses to monitor for any unexpected HTTP 404 Not Found hits. After a week, it seemed like I’d only missed a single URL!

Informing search engines

The vast majority of traffic to my blog undeniably originates from search engines. This, of course, made it a priority to make sure I didn’t upset these Gods of the internet. I was quite concerned when I wanted to both changing up my URL structures and domain at the same time.

Search engines can follow web-standards redirects just as well as any other client. However, they also provide dedicated tool for notifying them when you move origin (whether it be to go from HTTP to HTTPS or from one domain to another.)

The trick here is, to use disgusting SEO-lingo: preserve the domain authority and keep the link juice flowing through the redirects. This shouldn’t be a problem with regular redirects, but you can risk loosing out on traffic while the search engines work out that you’ve moved everything to a new home.

The major search engines all have dedicated tools for this:

Maybe you don’t care about Bing and Yandex because Google has ≈90 % global market share. However, even thinking about ignoring 1 in 10 potential visitors means you should reconsider this stance. Both Bing and Yandex license their search results to … ehm, better liked search engines such as the popular upstart DuckDuckGo.

After I began redirecting everything to the new URLs and notified the search engines at the same time; I noticed a dip in traffic from Google for five weeks. Bing stopped sending any traffic to the blog for about two weeks. Traffic from both Bing and Google recovered to the old levels pretty fast, and are now even trending upwards compared to the previous numbers.

Let the librarians know

The Internet Archive tries to preserve the entire web by keeping copies of as many webpages they can possibly get their spiders on.

I seem to be somewhat worried about my own mortality, and try to help the Internet Archive out by submitting copies of everything I publish using my Internet Archive plugin for WordPress. This kind of makes my articles, as unimportant as they may be, immortal as they can be retrieved from the Internet Archive if I keel over and no one is around to pay the hosting bills. (Morbid, I know.)

However, what about pages that move URLs? Like a proper netizen, the Internet Archive supports following redirects. Meaning that if you’re armed with an old URL you can see the page as it appear on the website it was originally published on. You can also track changes and updated to articles through the years, CMSes, and the domains I’ve moved between through. That is pretty cool.

To make sure this would work, I submitted every one of the old URLs I’d mapped out earlier to the Internet Archive for archival. This had the effect of not only preserving the redirects, but submitting the articles under their new addresses at the same time.

Unfortunately, it would seem like the Internet Archive is having some issues with availability lately. The save-function in the Wayback Machine has been unreliable as of early January. I know about half my archival requests failed, so I’ll wait a bit and resubmit all the requests in a few weeks. Hopefully, they’ll have addressed whatever issues they’re having by then.

Syndication feeds

All feed readers follow redirects, so you never loose any subscribers. Some clients will sensibly enforce a limit on how many redirects they’ll follow, but I’ve already taken care to ensure there wouldn’t be more than one redirect from an old URL to the new website.

Although, I’d appreciate it if more feed readers were to fully implement the HTTP 301 Permanent Redirect status code and update the subscription address instead of blindly following redirects with every refresh. Clients that encounter 301 redirects are supposed to update their link references and start using the new address rather than the address that returned a 301. This speeds up feed refreshes as well as helps ensure the subscription keeps working as old redirects have a tendency to stop working after a while.

This isn’t specific to feed readers, of course. As part of regular quality control for any website, you should regularly check links for decay and redirects and update the links accordingly.

Another aspects with syndication feeds that can cause unique challenges with URLs is that they require that each feed entry has a universally unique identifier. These UUIDs have to be a valid URI as per the Atom and RSS standards. By default, WordPress uses the blog address and the post’s ID number in the database to construct a URL it uses for these identifiers.

The problem here’s that these identifiers should never change even when the URL for the blog entry changes. Foreseeing that I’d want to change my blog’s URL without leaving behind identifiers referring to the old domain; I’ve been using actual UUIDs through the officially registered urn:uuid:<uuid> URI scheme.

Using URNs here can cause problems with some stupid clients that expose the UUID and assume it’s a URL instead of getting the URL from the correct element. These clients are luckily few and far in between. On the other hand, using an URN here removes the ambiguity of whether its a bad idea to use the ID as a URL. Other stupid clients that extract a feed entry’s URL by finding the first match for http or https in the entry (yes, these do exist) now finds the correct URLs as there’s no ambiguity with the UUID.

For anyone interested in this solution, check out the urn:uuid as the_guid plugin for WordPress.

To recount the main points in this article:

Moving domain doesn’t loose you any search traffic any longer assuming every old URL us mapped correctly. Go get a new domain if you want one!
Avoid redirect chains, and update links to eliminate redirects.
Pay attention to logs after redirecting a many addresses.
All the major search engines have tools for letting them know about domain-changes.
Let the Internet Archive know so they get the latest version of your pages in their new home.
You don’t loose feed subscribers when you redirect the old feed address to the new address.
Don’t use site-specific URLs as IDs, go with a URN.

I hope these notes might come in handy for someone some day. I wrote this article as I was planning what I needed to do for my own website.