Skip to Main Content

Web Archiving: Resources

Web archiving is quickly becoming a popular practice. This page provides a non-exhaustive list of resources on web archiving and web crawling.

Resources

Web Archiving

Web Archiving Metadata

Web Crawling

Web ARChive File Format

Other Resources

  • Common Crawl: an organization with the goal of "democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable."
  • Library of Congress Guide to Creating Preservable Websites: Creating preservable websites increases how effectively and comprehensively those websites can be archived.

  • Perma.cc: a registrar that provides persistent shortlinks for websites used for citation. Virginia Tech is a registrar. See the Perma.cc LibGuide for information on using this service. This service relies on web archived pages.
  • WebCite: an archiving system for webreferences (web citations) to ensure that cited references in scholarly works are always accessible

Glossary

API (application program interface)a set of routines, protocols, and tools for building software applications that specify how software components should interact

Digital preservation: the series of managed activities necessary to ensure continued access to digital materials for as long as necessary

robots.txt: a.k.a robots exclusion protocol; a standard to inform what website should and should not be crawled

Seed: a URL crawled by a web crawler

Web archive: a collection of pages from the World Wide Web

Web crawler (aka spider): software that automatically and systematically browses the internet and snapshots webpages

WARC (Web ARChive file format): a file format derived from the ARC (ARChival file format) that is the result of harvested web pages and allows for more metadata to be captured than an ARC file format

Web crawling slides