A sampling of Books from the UMD Library Collection on Text Mining
Clean up your text before analysis:
Additional notes on text cleanup theory and practice can be found at http://www.scientificcomputing.com/blogs/2014/01/text-mining-next-data-frontier
Common Terms Used in Text Mining Literature and Conversation
Corpus - corpus, plural corpora (n.) (1) A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a startingpoint of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics). .... Corpora provide the basis for one kind of computational linguistics. A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition, and machine translation." - A Dictionary of Linguistics and Phonetics, 6th Edition
Lemma and Lexem - "A lemma is the word you find in the dictionary. A lexeme is a unit of meaning, and can be more than one word. A lexeme is the set of all forms that have the same meaning, while lemma refers to the particular form that is chosen by convention to represent the lexeme... In English, for example, run, runs, ran and running are forms of the same lexeme, but run is the lemma.." - https://simple.wikipedia.org/wiki/Lemma_(linguistics)
Stoplist - "a set of words automatically omitted from a computer search, concordance, or index, usu. the most frequent words, which would slow down processing or make results less satisfactory; also written stoplist - http://dictionary.reference.com/browse/stop-list
Tokenization - "2) In lexical study, a term used as part of a measure of lexical density. The type/token ratio is the ratio of the total number of different words (types) to the total number of words (tokens) in a sample of text." A Dictionary of Linguistics and Phonetics, 6th Edition
Per Catherine Blake in Annual Review of Information Science and Technology, 2011, Vol.45(1), pp.121-155
"a text mining system will allow a user to explore and interact with patterns of information from within a collection of potentially relevant documents. The goal of a text mining system is to help a user identify meaningful patterns and learn about the information space related to the information need."
Thanks to an every expanding corpus of materials available to researchers and the ubiquity of computing power, researchers can assess, quantify and code the words of a specific author, entity, field or nation and find new insights into history, sociology and more. Whether exploring the content of the UMD Statesman, the lyrics of a specific artist, or the corpus of a specific author, we can now find insights that would be hidden and obscured by their scale.