Web Archiving: Resources
Web archiving is quickly becoming a popular practice. This page provides a non-exhaustive list of resources on web archiving and web crawling.
Resources
Web Archiving
- International Internet Preservation Consortium. "Web Archiving."
- Niu, J. (March/April 2012). "An overview of web archiving." D-Lib Magazine, Vol. 8, 3/4.
- Pennock, M. (March 2013). "Web-archiving." DPC Technology Watch Report 13-01. DOI: http://dx.doi.org/10.7207/twr13-01
- Stanford Libraries. Web Archiving.
- Web Archiving at Library of Congress
Web Archiving Metadata
- Dooley, J. (5 April 2017). "Best Practices for Web Archiving Metadata: Watch This Space!" OCLC.
- OCLC. (2018). Descriptive Metadata for Web Archiving
- Praetzellis, Maria. (2018). "Add, edit, and manage your metadata." Archive-It.
Web Crawling
- Castillo, C. (2004). Effective Web Crawling. Dept. of Computer Science, University of Chile.
- Clay, B. (2015). "Robots Exclusion Protocol Guide." Bruce Clay Inc., Global Marketing Solutions.
- Nemeslaki, András & Pocsarovszky, Károly. (2011). Web crawler research methodology.
- NT, B. (18 September 2018). "Top 50 open source web crawlers for data mining." Big Data Made Simple.
- Olston, C. and Najork, M. (2010). "Web crawling." Foundations and Trends in Information Retrieval, Vol 4, No. 3. p. 175-246. DOI: 10.1561/1500000017
Web ARChive File Format
- Bibliothèque nationale de France. The WARC File Format (ISO 28500) - Information, Maintenance, Drafts
- International Internet Preservation Consortium. "The WARC Format 1.0"
- Internet Archive. "warc: Python library to work with ARC and WARC files"
- ISO (2006). Information and documentation — The WARC File Format
- Sustainability of Digital Formats: Planning for Library of Congress Collections: Web ARChive File Format
Other Resources
- Common Crawl: an organization with the goal of "democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable."
-
Library of Congress Guide to Creating Preservable Websites: Creating preservable websites increases how effectively and comprehensively those websites can be archived.
- Perma.cc: a registrar that provides persistent shortlinks for websites used for citation. Virginia Tech is a registrar. See the Perma.cc LibGuide for information on using this service. This service relies on web archived pages.
- WebCite: an archiving system for webreferences (web citations) to ensure that cited references in scholarly works are always accessible
Glossary
API (application program interface): a set of routines, protocols, and tools for building software applications that specify how software components should interact
Digital preservation: the series of managed activities necessary to ensure continued access to digital materials for as long as necessary
robots.txt: a.k.a robots exclusion protocol; a standard to inform what website should and should not be crawled
Seed: a URL crawled by a web crawler
Web archive: a collection of pages from the World Wide Web
Web crawler (aka spider): software that automatically and systematically browses the internet and snapshots webpages
WARC (Web ARChive file format): a file format derived from the ARC (ARChival file format) that is the result of harvested web pages and allows for more metadata to be captured than an ARC file format