Web Archiving: Resources
Web archiving is quickly becoming a popular practice. This page provides a non-exhaustive list of resources on web archiving and web crawling.
Web Archiving
- International Internet Preservation Consortium. "Web Archiving."
- Niu, J. (March/April 2012). "An overview of web archiving." D-Lib Magazine, Vol. 8, 3/4.
- Pennock, M. (March 2013). "Web-archiving." DPC Technology Watch Report 13-01. DOI: http://dx.doi.org/10.7207/twr13-01
- Stanford Libraries. Web Archiving.
- Web Archiving at Library of Congress
Web Archiving Metadata
- Dooley, J. (5 April 2017). "Best Practices for Web Archiving Metadata: Watch This Space!" OCLC.
- OCLC. (2018). Descriptive Metadata for Web Archiving
- Praetzellis, Maria. (2018). "Add, edit, and manage your metadata." Archive-It.
Web Crawling
- Castillo, C. (2004). Effective Web Crawling. Dept. of Computer Science, University of Chile.
- Clay, B. (2015). "Robots Exclusion Protocol Guide." Bruce Clay Inc., Global Marketing Solutions.
- Nemeslaki, András & Pocsarovszky, Károly. (2011). Web crawler research methodology.
- NT, B. (18 September 2018). "Top 50 open source web crawlers for data mining." Big Data Made Simple.
- Olston, C. and Najork, M. (2010). "Web crawling." Foundations and Trends in Information Retrieval, Vol 4, No. 3. p. 175-246. DOI: 10.1561/1500000017
Web ARChive File Format
- Bibliothèque nationale de France. The WARC File Format (ISO 28500) - Information, Maintenance, Drafts
- International Internet Preservation Consortium. "The WARC Format 1.0"
- Internet Archive. "warc: Python library to work with ARC and WARC files"
- ISO (2006). Information and documentation — The WARC File Format
- Sustainability of Digital Formats: Planning for Library of Congress Collections: Web ARChive File Format
Other Resources
- Common Crawl: an organization with the goal of "democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable."
Library of Congress Guide to Creating Preservable Websites: Creating preservable websites increases how effectively and comprehensively those websites can be archived.
- Perma.cc: a registrar that provides persistent shortlinks for websites used for citation. Virginia Tech is a registrar. See the Perma.cc LibGuide for information on using this service. This service relies on web archived pages.
- WebCite: an archiving system for webreferences (web citations) to ensure that cited references in scholarly works are always accessible
API (application program interface): a set of routines, protocols, and tools for building software applications that specify how software components should interact
Digital preservation: the series of managed activities necessary to ensure continued access to digital materials for as long as necessary
robots.txt: a.k.a robots exclusion protocol; a standard to inform what website should and should not be crawled
Seed: a URL crawled by a web crawler
Web archive: a collection of pages from the World Wide Web
Web crawler (aka spider): software that automatically and systematically browses the internet and snapshots webpages
WARC (Web ARChive file format): a file format derived from the ARC (ARChival file format) that is the result of harvested web pages and allows for more metadata to be captured than an ARC file format