Research Guides: Digital Humanities at UMD: Text Mining Resources

Resources on Text Mining and Linguistics

A sampling of Books from the UMD Library Collection on Text Mining

Text Mining by Sholom M. Weiss; Nitin Indurkhya; Tong Zhang; Fred J. Damerau
ISBN: 0387954333

Publication Date: 2004-10-25
Text Mining - Applications and Theory by Michael W. Berry (Editor); Jacob Kogan (Editor)
ISBN: 9780470689653

Publication Date: 2010-02-25
Natural Language Processing and Text Mining by Anne Kao (Editor); Stephen R. Poteet; Steve R. Poteet (Editor)
ISBN: 9781846287541

Publication Date: 2007-03-06
The text mining handbook : advanced approaches in analyzing unstructured data by Ronen Feldman 1962- James Sanger 1965-

Tips for Text Mining

Clean up your text before analysis:

Remove the extraneous components such as Forewords, Dedication, Copyright, Publisher Notes,
Remove page #'s, captions, chapter headings,
Change all capitalization to lower case.
Understand the limits of your text source. PDFs are subject to Optical Character Recognition (OCR), a machine process that identifies individual characters. While the OCR success rate is approaching 98%, the success varies due to the quality of the original PDF. Be prepared to deal with variances. Read more about OCR here
Copying and pasting from Microsoft Word and html doesn't cleanly transfer the text to the new source. A lot of "funky formatting" is pasted, which will skew text mining. Many text editors will accept CTRL + SHIFT + V (CMD + SHIFT + V) to paste the bare code.

Additional notes on text cleanup theory and practice can be found at http://www.scientificcomputing.com/blogs/2014/01/text-mining-next-data-frontier

Common Text Mining Terms

Common Terms Used in Text Mining Literature and Conversation

Corpus - corpus, plural corpora (n.) (1) A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a startingpoint of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics). .... Corpora provide the basis for one kind of computational linguistics. A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition, and machine translation." - A Dictionary of Linguistics and Phonetics, 6th Edition

Lemma and Lexem - "A lemma is the word you find in the dictionary. A lexeme is a unit of meaning, and can be more than one word. A lexeme is the set of all forms that have the same meaning, while lemma refers to the particular form that is chosen by convention to represent the lexeme... In English, for example, run, runs, ran and running are forms of the same lexeme, but run is the lemma.." - https://simple.wikipedia.org/wiki/Lemma_(linguistics)

Stoplist - "a set of words automatically omitted from a computer search, concordance, or index, usu. the most frequent words, which would slow down processing or make results less satisfactory; also written stoplist - http://dictionary.reference.com/browse/stop-list

Tokenization - "2) In lexical study, a term used as part of a measure of lexical density. The type/token ratio is the ratio of the total number of different words (types) to the total number of words (tokens) in a sample of text." A Dictionary of Linguistics and Phonetics, 6th Edition

Definition of Text Mining

Per Catherine Blake in Annual Review of Information Science and Technology, 2011, Vol.45(1), pp.121-155

"a text mining system will allow a user to explore and interact with patterns of information from within a collection of potentially relevant documents. The goal of a text mining system is to help a user identify meaningful patterns and learn about the information space related to the information need."

Thanks to an every expanding corpus of materials available to researchers and the ubiquity of computing power, researchers can assess, quantify and code the words of a specific author, entity, field or nation and find new insights into history, sociology and more. Whether exploring the content of the UMD Statesman, the lyrics of a specific artist, or the corpus of a specific author, we can now find insights that would be hidden and obscured by their scale.

Text Mining Tools

Voyant Reader
Voyant Tools is a web-based reading and analysis environment for digital texts. Creates word clouds, tokenization, word occurrences, and can provide context.
TextCleanr
Useful tool to quickly format and homogenize characters in a corpus - "The quick, easy, web based way to fix and clean up text when copying and pasting between applications. Remove email indents, find and replace, clean up spacing, line breaks, word characters and more. Perfect for tablets or mobile devices."
Lexos
A robust text cleaner for HTML, TXT, SGML documents with built-in analysis and visualization tools.
Juxta
Juxta is a downloadable program that provides side by side comparisons of different versions of texts.
Google Refine or Open Refine
A fantastic tool for cleaning and editing csv files. Free through GitHub.
Hathi Trust Bookworm
HathiTurst Bookworm is a visualizing tool for the Hathi collection of 4.6 million public domain texts.

Example of Text Mining Analysis (Voyant)

Text Mining Corpora (Text Locations)

Data For Research (JSTOR)
"Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. DfR enables researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. To this end we provide data sets of documents to researchers: OCR, metadata, Key Terms, N-grams and reference text."

more... less...

After creating a free account, users can submit requests for mining and analyzing JSTOR content. Users can choose to receive the following results:

Citations Only (all requests come with citations by default)
Word Counts
Bigrams
Trigrams
Quadgrams
Key Terms
References
Hathi Trust Datasets
HathiTrust contains 13.8 millions volumes of public and private items (622 terrabytes - 39% of which is in the Public domain)
Google N-Grams
"When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years." Uses the accumulated data of the Google Books digitization project
TCP - Text Creation Partnership
"EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database... We transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints."
Project Gutenberg
"Project Gutenberg offers over 46,000 free ebooks: choose among free epub books, free kindle books, download them or read them online."
Directory of Online Books from UPenn
A directory leading to many more collections of free ebooks.

Directory of Open Access Books (DOAB)
The primary aim of DOAB is to increase discoverability of Open Access books. The directory is open to all publishers who publish academic, peer reviewed books in Open Access and should contain as many books as possible, provided that these publications are in Open Access and meet academic standards.

Many Books E-Book Collections
A collection of Project Gutenberg ebooks divided into handy categories.
Elsevier's Data Mining Policy

Early English Books Online (EEBO)
Access digital copies of works printed in the British Isles and North America, as well as works in English printed elsewhere from 1470-1700. From the first book printed in English through to the ages of Spenser, Shakespeare and of the English Civil War, EEBO's content draws on authoritative and respected short-title catalogues of the period. Subject coverage includes English literature, history, philosophy, linguistics, religion, politics, and government.
Gale Primary Sources
To download large datasets a UMD Librarian will have to request data on your behalf from our Gale sales representative. It can take up to 3 weeks to process requests. Gale will send a hard drive with the data requested to the libraries for you to use.
The following databases searched at once via Gale Primary Sources:
- 17th and 18th Century Burney Collection
- British Library Newspapers
- Eighteenth Century Collections Online
- Financial Times Historical Archive
- International Herald Tribune Historical Archive 1887-2013
- Nineteenth Century Collections Online
- Nineteenth Century U.S. Newspapers
- Punch Historical Archive, 1841-1992
- The Economist Historical Archive
- The Illustrated London News Historical Archive, 1842-2003
- The Making of the Modern World
- The Times Digital Archive
SpringerLink
Online journals, e-books, conference proceedings, and reference titles from Springer Publishing. Covers many disciplines with separate collections for architecture and design, behavioral Science, biomedical and life sciences, business and economics, chemistry and materials sciences, computer science and information technology, environmental sciences, engineering, humanities, social sciences. law, mathematics and statistics, medicine, operational research, physics and astronomy.

Open Source Shakespeare
Open Source Shakespeare attempts to be the best free Web site containing Shakespeare's complete works. It is intended for scholars, thespians, and Shakespeare lovers of every kind. OSS includes the 1864 Globe Edition of the complete works, which was the definitive single-volume Shakespeare edition for over a half-century.

Kathryn A. Martin Library

Digital Humanities at UMD