Skip to Main Content
Virginia Tech® home

Web Archiving

Why archive the web?

Web archiving is a solution to capturing and preserving dynamic objects like web pages and other interactive or dynamic digital objects.

Limitations

  • robots.txt is often embedded into websites and dictates that certain webpages cannot be crawled and captured
  • Limits need to be applied to crawls, such as number of pages, time limits, and how many levels from the seed the crawler captures
  • Website creators can determine what is publicly accessible or kept private
  • Intellectual property laws are often applied to websites where crawling may be an infringement on those laws

Library of Congress on Web Archiving

Types of Web Crawling

In general, there are three types of web archiving:

  1. Remote harvesting: using web crawlers to browse and capture webpages
  2. Database harvesting: exporting data from a database into a standard schema like XML for the purpose of querying
  3. Transactional harvesting: captures webpages based on events that occur, such as viewing a page, and trigger a capture