Web archiving is the practice of collecting webpages in a preservable format with the goal of accessing multiple versions of webpages in the future. This LibGuide is meant to be an introductory resource for web archiving and web crawling, and to highlight the work being done at Virginia Tech University Libraries.
Web archiving is a solution to capturing and preserving dynamic objects like web pages and other interactive or dynamic digital objects.
In general, there are three types of web archiving:
-- Niu, J. (March/April 2012). "An overview of web archiving." D-Lib Magazine, Vol. 8, 3/4.
Figure: Diagram of simple web crawler architecture. Diagram is based on image from PhD. Thesis of Carlos Castillo, image released to public domain by the original author.
"The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.
A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages."
Sustainability of Digital Format: Planning for Library of Congress Collections