Skip to Main Content

Using Archived Web Content in Your Research

Archive-It

Archive-It is a web archiving service created by the Internet Archive that helps institutions archive and provide access to cultural heritage on the web. Archive-It works with 400 partner organizations, including academic, state, and public libraries, museums, and historical societies. Collections often include archived news articles, blogs, social media, and other websites about topics of interest. This resource could be useful for researching the specific topics covered by the digital collections that these institutions have created. You can browse the site or full-text search by collecting organization, collection, site, or page text. Some examples of collections include the #blacklivesmatter Web Archive, a Boston Marathon Bombing collection, and a collection about the Supreme Court hearings on DOMA/Prop 8 in 2013. 

Archive Team

Archive Team is a loosely organized group primarily focused on preserving user-generated web content. Many web services allow users to upload and create content; when services are discontinued, that content is at risk. Archive Team also works to identify and fill gaps in the Wayback Machine's archive.

Current and past projects include archives of GeoCities, LiveJournal, and AOL, along with projects focused on wikis, news articles, and government files available by FTP. Completed projects may contribute content to the Wayback Machine or other services, but many are also available to download as entire collections from the Internet Archive. This format may be particularly relevant to researchers interested in studying the web and how it is used or in other research where direct access to a large number of original files may be of use.

ArtBase by Rhizome

Rhizome is an organization that focuses on commissioning, presenting, and preserving digital art. It maintains a collection of 2,000+ born-digital artworks in ArtBase, its publicly accessible online archive. This is both a valuable resource and an example of the labor-intensive, manual nature of accurately archiving complex material. Many of the artworks were created using obsolete technologies, and Rhizome archivists often use technically rigorous strategies like rewriting code and emulating out-of-date operating systems in order to display the works in modern browsers.

Library of Congress Twitter Archive

In 2010 the Library of Congress announced that they would partner with Twitter to create an archive of every public tweet posted since the social media company’s inception in 2006. While that archiving continues, the Library of Congress has struggled to make the content accessible to researchers. The project has proved more challenging than anticipated, partly because tweets themselves have become more complicated with the addition of embedded photos and videos, as well as the retweet feature. The volume of tweets has grown considerably as well, from 55 million a day in 2010 to nearly 500 million a day in 2012. While the service's popularity has plateaued in recent years, most sources still estimate that 500 million tweets are created each day.

In a 2013 update on the project the Library of Congress said they had not yet found a cost-effective plan for creating an interface that would allow researchers to access their massive collection of captured tweets, and they are still working on a way to index the tweets and make them searchable. 

This is an important project to keep tabs on in the future, and an example of what might be possible in the world of archiving user-provided web content. 

Create Your Own Web Archive

Some researchers may find that their work requires creating a local web archive for their own use. For instance, the researcher might want to analyze files directly in a way that is not possible in public archives like the Wayback Machine. There are many ways to achieve this goal, but two free applications stand out as the best way for many users to get started.

Windows and Linux users

Mac and iOS users

More advanced users with specific requirements will also find that there are a number of command line tools available for web archiving.