Text analysis: classification and topic modeling
MACS 30500
University of Chicago
November 22, 2017
Supervised learning
- Hand-code a small set of documents (\(N = 1000\))
- Train a statistical learning model on the hand-coded data
- Evaluate the effectiveness of the statistical learning model
- Apply the final model to the remaining set of documents (\(N = 1000000\))
USCongress
## Classes 'tbl_df', 'tbl' and 'data.frame': 4449 obs. of 6 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cong : int 107 107 107 107 107 107 107 107 107 107 ...
## $ billnum : int 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 ...
## $ h_or_sen: Factor w/ 2 levels "HR","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ major : int 18 18 18 18 5 21 15 18 18 18 ...
## $ text : chr "To suspend temporarily the duty on Fast Magenta 2 Stage." "To suspend temporarily the duty on Fast Black 286 Stage." "To suspend temporarily the duty on mixtures of Fluazinam." "To reduce temporarily the duty on Prodiamine Technical." ...
- Set of hand-coded bills from US Congress
- Text description
- Major policy topic
Create tidy text data frame
(congress_tokens <- congress %>%
unnest_tokens(output = word, input = text) %>%
filter(!str_detect(word, "^[0-9]*$")) %>%
anti_join(stop_words) %>%
mutate(word = SnowballC::wordStem(word)))
## # A tibble: 58,820 x 6
## ID cong billnum h_or_sen major word
## <int> <int> <int> <fctr> <int> <chr>
## 1 1 107 4499 HR 18 suspend
## 2 1 107 4499 HR 18 temporarili
## 3 1 107 4499 HR 18 duti
## 4 1 107 4499 HR 18 fast
## 5 1 107 4499 HR 18 magenta
## 6 1 107 4499 HR 18 stage
## 7 2 107 4500 HR 18 suspend
## 8 2 107 4500 HR 18 temporarili
## 9 2 107 4500 HR 18 duti
## 10 2 107 4500 HR 18 fast
## # ... with 58,810 more rows
Create document-term matrix
(congress_dtm <- congress_tokens %>%
count(ID, word) %>%
cast_dtm(document = ID, term = word, value = n))
## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
Weighting
- Term frequency (tf)
- Term frequency-inverse document frequency (tf-idf)
Weighting
congress_tokens %>%
count(ID, word) %>%
cast_dtm(document = ID, term = word, value = n,
weighting = tm::weightTfIdf)
## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Exploratory analysis
Estimate model
congress_rf <- train(x = as.matrix(congress_dtm),
y = factor(congress$major),
method = "rf",
ntree = 200,
trControl = trainControl(method = "oob"))
Evaluate model
congress_rf_200$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 200, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 209
##
## OOB estimate of error rate: 34.1%
## Confusion matrix:
## 1 2 3 4 5 6 7 8 10 12 13 14 15 16 17 18 19 20 21 99
## 1 97 0 3 0 3 3 2 5 5 11 1 2 13 3 1 5 0 7 2 0
## 2 1 13 6 1 5 4 4 1 3 6 3 2 8 7 3 1 1 11 4 0
## 3 4 1 532 4 14 9 6 1 4 7 6 2 7 10 1 0 1 5 3 0
## 4 2 1 8 91 4 2 5 1 1 1 0 1 3 0 2 5 1 1 4 0
## 5 5 3 13 2 153 10 4 2 10 13 2 1 7 7 6 7 5 5 6 1
## 6 9 1 7 0 11 159 1 0 1 8 1 1 6 3 2 5 2 1 4 0
## 7 2 4 4 5 4 3 101 5 9 8 1 1 7 3 5 4 3 6 25 1
## 8 6 2 1 1 2 1 4 100 3 3 0 2 4 1 0 1 0 2 5 0
## 10 6 0 3 2 4 1 6 2 96 12 0 0 6 3 4 6 3 12 5 0
## 12 10 0 19 4 13 6 9 1 8 143 4 3 14 4 5 7 6 28 4 3
## 13 4 0 6 0 5 1 3 2 1 3 60 3 1 0 1 1 0 2 1 0
## 14 2 0 1 2 5 2 3 1 4 1 1 46 3 2 1 2 0 3 1 0
## 15 13 3 7 5 14 2 9 4 6 19 1 2 145 6 6 12 6 15 3 1
## 16 1 5 4 0 6 3 3 1 7 5 2 2 6 133 1 9 7 18 6 0
## 17 5 0 3 1 6 3 3 1 1 8 0 3 4 2 36 1 2 6 4 1
## 18 0 0 0 2 1 2 4 2 3 1 0 0 2 2 0 373 7 2 1 0
## 19 2 0 3 3 11 5 5 0 4 9 0 0 4 6 0 9 46 6 8 0
## 20 12 2 7 1 15 5 7 1 7 20 1 3 10 14 7 12 2 235 18 1
## 21 6 4 6 3 4 4 26 2 9 7 1 4 4 11 3 9 5 16 348 0
## 99 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 0 1 25
## class.error
## 1 0.4049080
## 2 0.8452381
## 3 0.1377634
## 4 0.3157895
## 5 0.4160305
## 6 0.2837838
## 7 0.4975124
## 8 0.2753623
## 10 0.4385965
## 12 0.5085911
## 13 0.3617021
## 14 0.4250000
## 15 0.4802867
## 16 0.3926941
## 17 0.6000000
## 18 0.0721393
## 19 0.6198347
## 20 0.3815789
## 21 0.2627119
## 99 0.1666667
Evaluate model
Topic modeling
- Keywords
- Links
- Themes
- Probabilistic topic models
- Latent Dirichlet allocation
Food and animals
- I ate a banana and spinach smoothie for breakfast.
- I like to eat broccoli and bananas.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
LDA document structure
- Decide on the number of words N the document will have
- Generate each word in the document:
- Pick a topic
- Generate the word
- LDA backtracks from this assumption
Food and animals
- Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
- Pick 5 to be the number of words in \(D\).
- Pick the first word to come from the food topic
- Pick the second word to come from the cute animals topic
- Pick the third word to come from the cute animals topic
- Pick the fourth word to come from the food topic
- Pick the fifth word to come from the food topic
LDA with a known topic structure
- Great Expectations by Charles Dickens
- The War of the Worlds by H.G. Wells
- Twenty Thousand Leagues Under the Sea by Jules Verne
- Pride and Prejudice by Jane Austen
topicmodels
## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104722/3410773
## Sparsity : 97%
## Maximal term length: 19
## Weighting : term frequency (tf)
Terms associated with each topic
Per-document classification
Consensus topic
## # A tibble: 4 x 2
## consensus topic
## <chr> <int>
## 1 Great Expectations 4
## 2 Pride and Prejudice 1
## 3 The War of the Worlds 3
## 4 Twenty Thousand Leagues under the Sea 2
Mis-identification
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
count(title, consensus) %>%
knitr::kable()
Great Expectations |
Great Expectations |
57 |
Great Expectations |
Pride and Prejudice |
1 |
Great Expectations |
The War of the Worlds |
1 |
Pride and Prejudice |
Pride and Prejudice |
61 |
The War of the Worlds |
The War of the Worlds |
27 |
Twenty Thousand Leagues under the Sea |
Twenty Thousand Leagues under the Sea |
46 |
Incorrectly classified words
Great Expectations |
49656 |
3908 |
1923 |
81 |
Pride and Prejudice |
1 |
37231 |
6 |
4 |
The War of the Worlds |
0 |
0 |
22561 |
7 |
Twenty Thousand Leagues under the Sea |
0 |
5 |
0 |
39629 |
Most commonly mistaken words
## # A tibble: 3,551 x 4
## title consensus term n
## <chr> <chr> <chr> <dbl>
## 1 Great Expectations Pride and Prejudice love 44
## 2 Great Expectations Pride and Prejudice sergeant 37
## 3 Great Expectations Pride and Prejudice lady 32
## 4 Great Expectations Pride and Prejudice miss 26
## 5 Great Expectations The War of the Worlds boat 25
## 6 Great Expectations The War of the Worlds tide 20
## 7 Great Expectations The War of the Worlds water 20
## 8 Great Expectations Pride and Prejudice father 19
## 9 Great Expectations Pride and Prejudice baby 18
## 10 Great Expectations Pride and Prejudice flopson 18
## # ... with 3,541 more rows
Associated Press articles
## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
Perplexity
- A statistical measure of how well a probability model predicts a sample
- Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
- Perplexity for LDA model with 12 topics
Topics from \(k=100\)