Home  >>  Archives  >>  Volume 17 Number 4  >>  st0502

The Stata Journal
Volume 17 Number 4: pp. 866-881



Subscribe to the Stata Journal
cover

Text mining with n-gram variables

Matthias Schonlau
University of Waterloo
Waterloo, Canada
[email protected]
Nick Guenther
University of Waterloo
Waterloo, Canada
[email protected]
Ilia Sucholutsky
University of Waterloo
Waterloo, Canada
[email protected]
Abstract.  Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the "bag of words". An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.
Terms of use     View this article (PDF)

View all articles by these authors: Matthias Schonlau, Nick Guenther, Ilia Sucholutsky

View all articles with these keywords: ngram, bag of words, sets of words, unigram, gram, statistical learning, machine learning

Download citation: BibTeX  RIS

Download citation and abstract: BibTeX  RIS