- University Libraries
- Research Guides
- Topic Guides
- Text and Data Mining: TDM
- Tools & Technology
Text and Data Mining: TDM: Tools & Technology
Policies, practices, tools, and current issues related to TDM.
Considerations for Mining
Contact your Subject Specialist librarian for data source options, or try the Finding Data library guide. Contact dataservices@vt.edu for additional recommendations on processes and tools for manipulating, analyzing, and visualizing text or data. Some general considerations for pulling large amounts of data:
- Use an official API or bulk download service if provided.
- Web scraping is an option, but can have drawbacks. Consider the amount and speed with which you are pulling data, and put forth a good faith effort to identify terms of use from your sources.
- If accessing subscription or licensed content, be sure to check the policies or terms.
Tool Lists:
- General: Tutorials, Toolsets, and Resource Lists
- Data Processing: Getting, Scraping, Extracting, Transcribing, and Cleaning Data from websites, documents
- Analyzing and Visualizing Data
- Coding language applications for TDM
Tutorials, Toolsets, and Resource Lists
- ContentMineThe usage of ContentMine tools can be learned step-by-step with the help of these tutorials.
- Data Visualization and Analysis ToolsComputerworld chart - 30+ free tools for data visualization and analysis.
- DHboxA digital humanities laboratory with essential applications through simple sign-in via a web browser.
- DiRT DirectoryA registry of digital research tools for scholarly use.
- Introduction to Text AnalysisThis workbook provides a brief introduction to digital text analysis through a series of three-part units.
- Programming HistorianNovice-friendly, peer-reviewed tutorials that help humanists learn a wide range of digital tools, techniques, and workflows.
- TAPoRText Analysis Portal for Research - Discover research tools for studying texts.
Data Processing: Getting, Scraping, Extracting, Transcribing, and Cleaning Data from websites, documents
- ChemDataExtractorAutomatically extract chemical information from scientific documents
- ContentMine: getpapersGet metadata, fulltexts or fulltext URLs of papers matching a search query.
- DHboxA digital humanities laboratory with essential applications through simple sign-in via a web browser.
- DiRT DirectoryA registry of digital research tools for scholarly use.
- From the PageOpen-source software for collaborative project support, from simple, plain-text transcription through bilingual digital editions. Indexing, discussion, security options, and version control features.
- Google Fusion TablesAn experimental data visualization web application to gather, visualize, and share data tables.
- OpenRefineA powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
- ScrapyAn open source and collaborative framework for extracting the data you need from websites.
- ScribeA framework for crowdsourcing the transcription of text-based documents, particularly documents that are not well suited for Optical Character Recognition
- ScriptoA free, open source tool enabling community transcriptions of document and multimedia files.
- TabulaExtract data tables from PDFs
- TAPoRText Analysis Portal for Research - Discover research tools for studying texts.
- TypeWrightA tool for correcting the text-version of a document made up of page images.
- Ubiqu+ItyGenerates statistics and web-based tagged text views for your text/s, using the DocuScope dictionary or your own rules.
- WordSeerA text analysis environment for Humanities scholars that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful.
Analyzing and Visualizing Data
- BookwormBookworm is a simple and powerful way to visualize trends in repositories of digitized texts.
- Data Analysis and Visualization ToolsComputerworld chart - 30+ free tools for data visualization and analysis.
- DiRT DirectoryA registry of digital research tools for scholarly use.
- GephiThe leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free.
- Google Books NGram ViewerGraphical displays of how words or phrases have occurred in the Google Books corpus of books.
- Google ChartsInteractive charts for browsers and mobile devices.
- Google Fusion TablesAn experimental data visualization web application to gather, visualize, and share data tables.
- Tableau PublicVisualize and Share Your Data in Minutes—For Free.
- TAPoRText Analysis Portal for Research - Discover research tools for studying texts.
- VOSViewerA software tool for constructing and visualizing bibliometric networks, for co-citation, bibliographic coupling, co-authorship relations, or co-occurrence networks of important terms.
- VoyantA web-based reading and analysis environment for digital texts.
- WordSeerA text analysis environment for Humanities scholars that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful.
Python Resources
- PythonA programming language that lets you work quickly
and integrate systems more effectively. - Cleaning Data in PythonA tutorial on Cleaning Data in Python from the University of Toronto Map & Data Library. *You may need to navigate to the tutorial link via a list on the left side of the landing page.
- Good TablesValidate tabular data with Python
- PandasData analysis and modeling for Python
- PDFMEFA multi-entity extraction framework for academic documents
- SciPyA Python-based ecosystem of open-source software for mathematics, science, and engineering
- SMaPP ToolkitPython library for interacting with smapp collections.
- txtorgA Python-based utility that leverages Whoosh and Apache Lucene to facilitate text preprocessing and management.
R Resources
- RA free software environment for statistical computing and graphics.
- TidyTextText Mining with R book companion website and resource.
- R and Data Mining CoursesDirectory of free online courses related to R and Data Mining.
- tabulizerProvides R bindings to the Tabula java library, which can be used to computationally extract tables from PDF documents.
- QuantedaAn R package for the Quantitative Analysis of Textual Data.
- AustinAustin is an R package for doing things with words. Right now that means scaling them in the style of Wordscores and Wordfish.
- stm: Structural Topic ModelA general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both.
- tesseractExtract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.