Why Engage in Text Mining?
▪ Analyzes patterns and connections in text corpora sets too large for individual researchers to absorb (a prolific author’s complete works; a city’s newspapers for a set decade)
▪ Facilitates finding materials by retrieving documents from a corpus based on keywords and phrases, and extracting specific instances from those documents
▪ Allows researchers to group documents by classification (setting groups or examples) or clustering (creating groups based on word frequencies)
▪ “Mining the Dispatch” uses topic modeling to explore trends and patterns in daily life in the capital of the Confederacy through the Richmond Daily Dispatch; Douglas Ernest Duhaime mines an early-modern text corpus to further understandings of the shift in emphasis on textual borrowing to authorial originality in the 18th century
Scholars have analyzed text for centuries, but manual methods of closely reading and interpreting text limit the scale of how much information critics can analyze. However, the advent of machine readable type paired with advances in computer processing have exponentially expanded the capacity for analyzing text. Text mining uses methods and tools from machine learning, statistics, library sciences, computational linguistics, and data mining to computationally analyze text. Text mining tools and methods have a wide range of functions and aims, but essentially all processes turn large sets of textual data into matrices for analysis. The most common goal of text mining is to uncover patterns and connections in large data sets that the average human would be incapable of uncovering and interpreting in a reasonable amount of time.
Common text mining methods are information retrieval and extraction, document classification, clustering, topic modeling, and natural language processing methods, like parts of speech tagging and sentiment analysis. Information retrieval pulls specific documents from a corpus based on keyword and phrase specifications, while information extraction pulls instances from documents. Broadly, document classification, clustering, and topic modeling are related but unique approaches toward the goal of grouping documents. Document classification groups documents based on provided groups and examples, while clustering determines groups based on a statistical analysis of word frequencies. Topic modeling takes this a step further and generates strings of words (“topics”) from the words in the corpus, assigning each document the topics that it has the highest probability of containing.
Text mining is closely related to natural language processing, which aims to help machines understand semantic parts of human speech. Parts of speech tagging and sentiment analysis depend on provided dictionaries that specify parts of speech and in the case of sentiment analysis, the emotion associated with each word. Text mining allows scholars to expand the scale of their analysis to a degree beyond the limits of an individual human’s physical and mental capacity. In reading large corpora of text with machines, we gain a high-level perspective on authors, series, genres, time periods, and languages. But while text mining can eliminate some individual biases, scholars programming machines to read text must remember that programming has the potential to introduce new biases; text mining should act as a complement to close reading and careful contextualization, it should not be a replacement.
For any questions or assistance, please contact us (firstname.lastname@example.org). You can also view the University Libraries resource guide to text mining.
Recommended Tools for Text Mining
Learn more about the following tools that can facilitate text mining.
A web based program where users input a downloaded corpus to return a word frequency cloud, word use over time graphs, word and phrase correlations, and topic models. It is capable of visualizing similarities between documents in a corpus in multiple ways. Easy and user friendly; no programming knowledge needed.
A virtual text mining environment for the creation and analysis of worksets from the HaithiTrust digital library. Researchers have unrestricted access to over three-million public domain works and limited, non-consumptive-use access to in-copyright works. Users may run provided algorithms such as word frequency, topic modeling, timeline visualization, and entity extraction. Users may also run their own programs in the center’s virtual environment. Data sets of extracted features (word frequency, parts of speech tagging, metadata) available for both public domain and in copyright volumes. Need to create a membership to use; no programming knowledge needed. Restricted to volumes in the Haithi Trust.
A java based package mainly used for its quick implementation of topic modeling algorithms. Generates topic models quickly, but some knowledge of Java code needed.
An environment developed for R programming language by statisticians. Functions with a large library of text mining packages, the most user friendly of which is tidytext. R offers a strong online support community and is best for data visualization, but its learning curve is steep for beginning programmers.