Web Archiving: Resources

Web archiving is quickly becoming a popular practice. This page provides a non-exhaustive list of resources on web archiving and web crawling.

Resources

Web Archiving

International Internet Preservation Consortium. "Web Archiving."
Niu, J. (March/April 2012). "An overview of web archiving." D-Lib Magazine, Vol. 8, 3/4.
Pennock, M. (March 2013). "Web-archiving." DPC Technology Watch Report 13-01. DOI: http://dx.doi.org/10.7207/twr13-01
Stanford Libraries. Web Archiving.
Web Archiving at Library of Congress

Web Archiving Metadata

Dooley, J. (5 April 2017). "Best Practices for Web Archiving Metadata: Watch This Space!" OCLC.
OCLC. (2018). Descriptive Metadata for Web Archiving
Praetzellis, Maria. (2018). "Add, edit, and manage your metadata." Archive-It.

Web Crawling

Castillo, C. (2004). Effective Web Crawling. Dept. of Computer Science, University of Chile.
Clay, B. (2015). "Robots Exclusion Protocol Guide." Bruce Clay Inc., Global Marketing Solutions.
Nemeslaki, András & Pocsarovszky, Károly. (2011). Web crawler research methodology.
NT, B. (18 September 2018). "Top 50 open source web crawlers for data mining." Big Data Made Simple.
Olston, C. and Najork, M. (2010). "Web crawling." Foundations and Trends in Information Retrieval, Vol 4, No. 3. p. 175-246. DOI: 10.1561/1500000017

Web ARChive File Format

Bibliothèque nationale de France. The WARC File Format (ISO 28500) - Information, Maintenance, Drafts
International Internet Preservation Consortium. "The WARC Format 1.0"
Internet Archive. "warc: Python library to work with ARC and WARC files"
ISO (2006). Information and documentation — The WARC File Format
Sustainability of Digital Formats: Planning for Library of Congress Collections: Web ARChive File Format

Other Resources

Common Crawl: an organization with the goal of "democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable."
Library of Congress Guide to Creating Preservable Websites: Creating preservable websites increases how effectively and comprehensively those websites can be archived.
Perma.cc: a registrar that provides persistent shortlinks for websites used for citation. Virginia Tech is a registrar. See the Perma.cc LibGuide for information on using this service. This service relies on web archived pages.
WebCite: an archiving system for webreferences (web citations) to ensure that cited references in scholarly works are always accessible

Glossary

API (application program interface): a set of routines, protocols, and tools for building software applications that specify how software components should interact

Digital preservation: the series of managed activities necessary to ensure continued access to digital materials for as long as necessary

robots.txt: a.k.a robots exclusion protocol; a standard to inform what website should and should not be crawled

Seed: a URL crawled by a web crawler

Web archive: a collection of pages from the World Wide Web

Web crawler (aka spider): software that automatically and systematically browses the internet and snapshots webpages

WARC (Web ARChive file format): a file format derived from the ARC (ARChival file format) that is the result of harvested web pages and allows for more metadata to be captured than an ARC file format