Skip to Main Content
University of Minnesota

Kathryn A. Martin Library

Digital Humanities at UMD

A listing of web sites, databases and resources that support digital scholarship and the humanities at the University of Minnesota Duluth.

Tips for Text Mining

Clean up your text before analysis:

  • Remove the extraneous components such as Forewords, Dedication, Copyright, Publisher Notes, 
  • Remove page #'s, captions, chapter headings, 
  • Change all capitalization to lower case.
  • Understand the limits of your text source. PDFs are subject to Optical Character Recognition (OCR), a machine process that identifies individual characters. While the OCR success rate is approaching 98%, the success varies due to the quality of the original PDF. Be prepared to deal with variances. Read more about OCR here
  • Copying and pasting from Microsoft Word and html doesn't cleanly transfer the text to the new source. A lot of "funky formatting" is pasted, which will skew text mining. Many text editors will accept CTRL + SHIFT + V (CMD + SHIFT + V) to paste the bare code. 

Additional  notes on text cleanup theory and practice can be found at http://www.scientificcomputing.com/blogs/2014/01/text-mining-next-data-frontier 

Common Text Mining Terms

Common Terms Used in Text Mining Literature and Conversation

Corpus - corpus, plural corpora (n.) (1) A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a startingpoint of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics). .... Corpora provide the basis for one kind of computational linguistics. A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition, and machine translation." -  A Dictionary of Linguistics and Phonetics, 6th Edition  

Lemma and Lexem  - "A lemma is the word you find in the dictionary. A lexeme is a unit of meaning, and can be more than one word. A lexeme is the set of all forms that have the same meaning, while lemma refers to the particular form that is chosen by convention to represent the lexeme... In English, for example, run, runs, ran and running are forms of the same lexeme, but run is the lemma.."   - https://simple.wikipedia.org/wiki/Lemma_(linguistics)

Stoplist  - "a set of words automatically omitted from a computer search, concordance, or index, usu. the most frequent words, which would slow down processing or make results less satisfactory; also written stoplist - http://dictionary.reference.com/browse/stop-lis

Tokenization  - "2) In lexical study, a term used as part of a measure of lexical density. The type/token ratio is the ratio of the total number of different words (types) to the total number of words (tokens) in a sample of text." A Dictionary of Linguistics and Phonetics, 6th Edition  

Definition of Text Mining

Per Catherine Blake in Annual Review of Information Science and Technology, 2011, Vol.45(1), pp.121-155

Word cloud for UMD Duluth Webpage

"a text mining system will allow a user to explore and interact with patterns of information from within a collection of potentially relevant documents. The goal of a text mining system is to help a user identify meaningful patterns and learn about the information space related to the information need."

Thanks to an every expanding corpus of materials available to researchers and the ubiquity of computing power, researchers can assess, quantify and code the words of a specific author, entity, field or nation and find new insights into history, sociology and more. Whether exploring the content of the UMD Statesman, the lyrics of a specific artist, or the corpus of a  specific author, we can now find insights that would be hidden and obscured by their scale.

Text Mining Tools

Example of Text Mining Analysis (Voyant)

Example of A Voyant Text Analysis

Text Mining Corpora (Text Locations)