Web Archiving: Home
Web archiving is the practice of collecting webpages in a preservable format with the goal of accessing multiple versions of webpages in the future. This LibGuide is meant to be an introductory resource for web archiving and web crawling, and to highlight the work being done at Virginia Tech University Libraries.
Why archive the web?
Web archiving is a solution to capturing and preserving dynamic objects like web pages and other interactive or dynamic digital objects.
- robots.txt is often embedded into websites and dictates that certain webpages cannot be crawled and captured
- Limits need to be applied to crawls, such as number of pages, time limits, and how many levels from the seed the crawler captures
- Website creators can determine what is publicly accessible or kept private
- Intellectual property laws are often applied to websites where crawling may be an infringement on those laws
Library of Congress on Web Archiving
Types of Web Crawling
In general, there are three types of web archiving:
- Remote harvesting: using web crawlers to browse and capture webpages
- Database harvesting: exporting data from a database into a standard schema like XML for the purpose of querying
- Transactional harvesting: captures webpages based on events that occur, such as viewing a page, and trigger a capture
What is Web Archiving
"Web archiving is the process of gathering up data that has been recorded on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research."
-- Niu, J. (March/April 2012). "An overview of web archiving." D-Lib Magazine, Vol. 8, 3/4.
Web crawling is the process of visiting a set list of URL's called seeds and identifying all of the hyperlinks in that seed, copying and saving information as it "crawls" the seed. This information is saved as snapshots in a WARC file format so when it is replayed with a WARC viewer, they appear as they were captured.
Figure: Diagram of simple web crawler architecture. Diagram is based on image from PhD. Thesis of Carlos Castillo, image released to public domain by the original author.
What is a WARC?
Web ARChive File Format (WARC)
"The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.
A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages."