Researchers who have limited experience with the technical process of creating web sites may find it surprising that they can be so difficult to archive and preserve. Understanding some of these challenges can make using web archives less confusing and more transparent.
Unlike many of the digital objects that we use in our daily lives - PDFs, Word files, MP3 files, etc. - webpages are not contained in a single file or package that can easily be moved about from one place to another. Instead, when a web browser such as Chrome or Firefox displays a web page, the browser first retrieves a single file, but is then instructed by that file to retrieve any number of additional files, possibly from disparate locations, which are then interpreted by the browser in order to produce the page that you see.
There are many other complications, particularly for pages that respond to user input and interaction, but these stem from the same basic challenge: the webpage is not a portable format that can easily be moved from one location (e.g. it's original home) to another (e.g. a web archive). It is instead a collection of files, assembled at the request of a user and interpreted in a web browser or other software.
The Wayback Machine's software, which is also the basis for many other successful web archiving projects, addresses this issue by creating a portable file out of the webpage. Those files follow the Web ARChive (WARC) format, which has become the standard file format for web archiving. WARC files cannot be opened in a web browser, but instead require dedicated software. When you access a page in the Wayback Machine or Perma, you are viewing a WARC file that has been created from the original page and played back in the Internet Archives WARC viewer. On the Wayback Machine, each timestamped snapshot of a page represents a single WARC file.
Not all web archiving projects take the same approach. Some projects are able to recreate and preserve older web content without packaging the contents into WARC files. This approach can have the advantage of preserving the functionality of unique and complex websites, as exemplified by the Rhizome project, which archives art that uses the Internet as its medium. This approach may require greater collaboration between the content owner and the archiving organization.
Because web content may be subject to copyright, many web archiving efforts rely on the permission of the content owner to publish archived versions of pages and sites. For large, automated efforts like the Wayback Machine, this process has been managed in the past using an internet protocol known as the robots exclusion standard or robots.txt. However, in April of 2017, the Internet Archive announced its decision to stop relying on the robots.txt standard.
The robots.txt standard allows website owners to communicate to non-human users such as web robots and web harvesters, telling them where on a site they are welcome. Robots are used for a variety of purposes, including the indexing of web sites for search engines. If, for example, a site owner wishes to exclude all or part of a site from Google's search engine, he or she can indicate that fact using the robots.txt standard, and the robot in question will respect that request.
Prior to the Internet Archive's recent change in policy, a significant number of sites were excluded from the Wayback Machine because their owners have chosen to forbid all robots, or to specifically forbid the robot that Internet Archive uses to build the Wayback Machine. This situation raised an important question - if publishers are able to opt out of archiving, how can their content be preserved? Should it be possible for publishers to retroactively un-publish content that was previously public, effectively removing that content from the historical record?
The Internet Archive is not the first major web archiving project to choose not to follow robots.txt instructions. As an example, Archive Team ignores the convention, outlining their reasons in the document below.
One particularly controversial aspect of the Internet Archive's prior implementation of the robots.txt standard is the retroactive application of new robots.txt instructions to previously archived web content. As an example, if the Internet Archive collected a copy of a page in 2015, and then returned to that site in 2016 to find that the robots.txt file has been changed to exclude its robot, the Internet Archive would remove public access to the 2015 copy of the page, even though at the time the copy was made the site's robots.txt file permitted the creation of the archived copy. One criticism of this policy is that it allowed third parties to remove content from the Wayback Machine by acquiring the domain of the content and changing the robots.txt file to exclude the Internet Archive's robot.