Racks of servers featuring the Internet Archive project’s logo. 🅭

How to archive a page or an entire website in the Internet Archive

The Internet Archive’s Wayback Machine is one of the web’s greatest free resources. The Internet Archive does it best to preserve our online history by archiving public webpages in their collection. You can find long-gone blog posts, old website designs from the late 90’s, retracted news stories, and all sorts of treasures in their collection.

Archiving everything is an impossible task. There are simply too many webpages being published on too many websites for any archiving service to keep up with it all. You can help the Internet Archive’s archival effort by submitting webpages that are important to you or maybe tweets, images, or webpages you’ve created and want to preserve.

You can archive a single page using the form on the Wayback Machine’s front page.

Tools like the Internet Archive plugin for WordPress can automate the submissions of webpages that you create. There are similar tools available for other content management systems, and also plenty of web browser extensions available.

To archive an entire website you can either submit each URL one by one individually, or use a tool that can automate the submissions.

The best tool for automating larger submissions is Jacob Burenstam’s wayback_archiver. It’s a Ruby library that comes with a command-line interface that’s easy to use. It can independently crawl a website and submit URLs for archival as it works it way through a website, or it can parse a XML sitemap file when one is available to skip the need for crawling.

Running the program in --auto mode tells it to auto-discover XML sitemaps from the standard robots.txt, or fallback to the much slower process of crawling the website to discover pages. It can take several hours, days, or even weeks to complete depending on size and speed of the website, and the number of pages being archived.

The following example command would submit every page on Ctrl blog to the Internet Archive:

wayback_archiver https://www.ctrl.blog/ --auto

You can also submit individual URLs using the --url option:

wayback_archiver https://www.ctrl.blog/entry/how-to-internet-archive.html --url

The Internet Archive offers no guarantees that the URLs you submit will be accepted into their collection. Aggressively submitting a lot of URLs from a single IP address in a very short amount of time may be flagged as suspicious.

You won’t be able to submit any pages from websites that have actively opted to block the Internet Archive from accessing their website.The Internet Archive is one of the most frequently blocked bots on the web, as not everyone understands its purpose or sees its immense value to the web community.

Please consider making a donation to the Internet Archive if you’re submitting a large collection of pages or just find their service as invaluable as I do. Please considering giving a monthly donation if you intend to submit URLs on an ongoing basis.