Text mining with n-gram variables
Abstract. Text mining is the process of turning free text into numerical variables and
then analyzing them with statistical techniques. We introduce the command
ngram, which implements the most common approach to text mining, the
"bag of words". An n-gram is a contiguous sequence of words in a text.
Broadly speaking, ngram creates hundreds or thousands of variables, each
recording how often the corresponding n-gram occurs in a given text.
This is more useful than it sounds. We illustrate ngram with the
categorization of text answers from two open-ended questions.
View all articles by these authors:
Matthias Schonlau, Nick Guenther, Ilia Sucholutsky
View all articles with these keywords:
ngram, bag of words, sets of words, unigram, gram, statistical learning, machine learning
Download citation: BibTeX RIS
Download citation and abstract: BibTeX RIS
|