Web Archiving

Why archive the web?

Web archiving is a solution to capturing and preserving dynamic objects like web pages and other interactive or dynamic digital objects.

Limitations

robots.txt is often embedded into websites and dictates that certain webpages cannot be crawled and captured
Limits need to be applied to crawls, such as number of pages, time limits, and how many levels from the seed the crawler captures
Website creators can determine what is publicly accessible or kept private
Intellectual property laws are often applied to websites where crawling may be an infringement on those laws

Library of Congress on Web Archiving

Types of Web Crawling

In general, there are three types of web archiving:

Remote harvesting: using web crawlers to browse and capture webpages
Database harvesting: exporting data from a database into a standard schema like XML for the purpose of querying
Transactional harvesting: captures webpages based on events that occur, such as viewing a page, and trigger a capture