Skip to main content

Using Archived Web Content in Your Research

Overview

Researchers are increasingly aware that content published on the web can change or disappear at any time. This guide identifies the leading archives of web content and provides context about these projects, and about web archiving in general, so that researchers can make informed use of the available resources.

Wayback Machine

The Internet Archive's Wayback Machine is the largest archive of the World Wide Web, covering more than 279 billion pages, dating back as far as 1996. Pages in the Wayback Machine are captured repeatedly over time. Many pages are available in hundreds or thousands of versions, corresponding to the content of the page on different dates.

The Internet Archive is in the process of a grant-funded project to develop a search engine. A beta version of that search is currently available. It is also possible to access archived sites directly if you know the site's URL.

Save Page Now

It's easy to miss, but the Wayback Machine allows users to archive a page on-demand. Just look in the bottom right corner of their homepage, or if you're on a device with a small screen, scroll to the bottom of the page. If you enter a URL, the page will be captured immediately, provided the site owner allows automated archiving.

Limitations

You may find that a page you are interested in is not available in the Wayback Machine, or that the content seems incomplete. There are a few reasons why this might be the case.

  • The Wayback Machine may not have been aware of the site's existence at that time. The archive uses a web crawler, similar to those used by search engines, which follows the links on a page to identify other pages that can be archived. This approach can miss pages that are not linked from other pages. Using the Save Page Now function can help the crawler find pages that it might not be aware of otherwise.
  • The site's owner may forbid web crawlers in general, or the Wayback Machine's crawler in particular, from accessing the page.
  • There may be technical features of the page that the Wayback Machine cannot capture. These often include embedded video or interactive components of a page.