Text analysis: classification and topic modeling

MACS 30500 University of Chicago

November 22, 2017

Supervised learning

  1. Hand-code a small set of documents (\(N = 1000\))
  2. Train a statistical learning model on the hand-coded data
  3. Evaluate the effectiveness of the statistical learning model
  4. Apply the final model to the remaining set of documents (\(N = 1000000\))
  • Text classification

USCongress

## Classes 'tbl_df', 'tbl' and 'data.frame':    4449 obs. of  6 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cong    : int  107 107 107 107 107 107 107 107 107 107 ...
##  $ billnum : int  4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 ...
##  $ h_or_sen: Factor w/ 2 levels "HR","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ major   : int  18 18 18 18 5 21 15 18 18 18 ...
##  $ text    : chr  "To suspend temporarily the duty on Fast Magenta 2 Stage." "To suspend temporarily the duty on Fast Black 286 Stage." "To suspend temporarily the duty on mixtures of Fluazinam." "To reduce temporarily the duty on Prodiamine Technical." ...
  • Set of hand-coded bills from US Congress
  • Text description
  • Major policy topic

Create tidy text data frame

(congress_tokens <- congress %>%
   unnest_tokens(output = word, input = text) %>%
   filter(!str_detect(word, "^[0-9]*$")) %>%
   anti_join(stop_words) %>%
   mutate(word = SnowballC::wordStem(word)))
## # A tibble: 58,820 x 6
##       ID  cong billnum h_or_sen major        word
##    <int> <int>   <int>   <fctr> <int>       <chr>
##  1     1   107    4499       HR    18     suspend
##  2     1   107    4499       HR    18 temporarili
##  3     1   107    4499       HR    18        duti
##  4     1   107    4499       HR    18        fast
##  5     1   107    4499       HR    18     magenta
##  6     1   107    4499       HR    18       stage
##  7     2   107    4500       HR    18     suspend
##  8     2   107    4500       HR    18 temporarili
##  9     2   107    4500       HR    18        duti
## 10     2   107    4500       HR    18        fast
## # ... with 58,810 more rows

Create document-term matrix

(congress_dtm <- congress_tokens %>%
   count(ID, word) %>%
   cast_dtm(document = ID, term = word, value = n))
## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

Weighting

  • Term frequency (tf)
  • Term frequency-inverse document frequency (tf-idf)

Weighting

congress_tokens %>%
  count(ID, word) %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = tm::weightTfIdf)
## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Sparsity

removeSparseTerms(congress_dtm, sparse = .99)
## <<DocumentTermMatrix (documents: 4449, terms: 209)>>
## Non-/sparse entries: 33794/896047
## Sparsity           : 96%
## Maximal term length: 11
## Weighting          : term frequency (tf)
removeSparseTerms(congress_dtm, sparse = .95)
## <<DocumentTermMatrix (documents: 4449, terms: 28)>>
## Non-/sparse entries: 18447/106125
## Sparsity           : 85%
## Maximal term length: 11
## Weighting          : term frequency (tf)
removeSparseTerms(congress_dtm, sparse = .90)
## <<DocumentTermMatrix (documents: 4449, terms: 16)>>
## Non-/sparse entries: 14917/56267
## Sparsity           : 79%
## Maximal term length: 9
## Weighting          : term frequency (tf)

Exploratory analysis

Estimate model

congress_rf <- train(x = as.matrix(congress_dtm),
                     y = factor(congress$major),
                     method = "rf",
                     ntree = 200,
                     trControl = trainControl(method = "oob"))

Evaluate model

congress_rf_200$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 200, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 209
## 
##         OOB estimate of  error rate: 34.1%
## Confusion matrix:
##     1  2   3  4   5   6   7   8 10  12 13 14  15  16 17  18 19  20  21 99
## 1  97  0   3  0   3   3   2   5  5  11  1  2  13   3  1   5  0   7   2  0
## 2   1 13   6  1   5   4   4   1  3   6  3  2   8   7  3   1  1  11   4  0
## 3   4  1 532  4  14   9   6   1  4   7  6  2   7  10  1   0  1   5   3  0
## 4   2  1   8 91   4   2   5   1  1   1  0  1   3   0  2   5  1   1   4  0
## 5   5  3  13  2 153  10   4   2 10  13  2  1   7   7  6   7  5   5   6  1
## 6   9  1   7  0  11 159   1   0  1   8  1  1   6   3  2   5  2   1   4  0
## 7   2  4   4  5   4   3 101   5  9   8  1  1   7   3  5   4  3   6  25  1
## 8   6  2   1  1   2   1   4 100  3   3  0  2   4   1  0   1  0   2   5  0
## 10  6  0   3  2   4   1   6   2 96  12  0  0   6   3  4   6  3  12   5  0
## 12 10  0  19  4  13   6   9   1  8 143  4  3  14   4  5   7  6  28   4  3
## 13  4  0   6  0   5   1   3   2  1   3 60  3   1   0  1   1  0   2   1  0
## 14  2  0   1  2   5   2   3   1  4   1  1 46   3   2  1   2  0   3   1  0
## 15 13  3   7  5  14   2   9   4  6  19  1  2 145   6  6  12  6  15   3  1
## 16  1  5   4  0   6   3   3   1  7   5  2  2   6 133  1   9  7  18   6  0
## 17  5  0   3  1   6   3   3   1  1   8  0  3   4   2 36   1  2   6   4  1
## 18  0  0   0  2   1   2   4   2  3   1  0  0   2   2  0 373  7   2   1  0
## 19  2  0   3  3  11   5   5   0  4   9  0  0   4   6  0   9 46   6   8  0
## 20 12  2   7  1  15   5   7   1  7  20  1  3  10  14  7  12  2 235  18  1
## 21  6  4   6  3   4   4  26   2  9   7  1  4   4  11  3   9  5  16 348  0
## 99  0  0   0  0   0   0   0   0  1   0  0  0   0   1  2   0  0   0   1 25
##    class.error
## 1    0.4049080
## 2    0.8452381
## 3    0.1377634
## 4    0.3157895
## 5    0.4160305
## 6    0.2837838
## 7    0.4975124
## 8    0.2753623
## 10   0.4385965
## 12   0.5085911
## 13   0.3617021
## 14   0.4250000
## 15   0.4802867
## 16   0.3926941
## 17   0.6000000
## 18   0.0721393
## 19   0.6198347
## 20   0.3815789
## 21   0.2627119
## 99   0.1666667

Evaluate model

Topic modeling

  • Keywords
  • Links
  • Themes
  • Probabilistic topic models
    • Latent Dirichlet allocation

Food and animals

  1. I ate a banana and spinach smoothie for breakfast.
  2. I like to eat broccoli and bananas.
  3. Chinchillas and kittens are cute.
  4. My sister adopted a kitten yesterday.
  5. Look at this cute hamster munching on a piece of broccoli.

LDA document structure

  • Decide on the number of words N the document will have
  • Generate each word in the document:
    • Pick a topic
    • Generate the word
  • LDA backtracks from this assumption

Food and animals

  • Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
  • Pick 5 to be the number of words in \(D\).
  • Pick the first word to come from the food topic
  • Pick the second word to come from the cute animals topic
  • Pick the third word to come from the cute animals topic
  • Pick the fourth word to come from the food topic
  • Pick the fifth word to come from the food topic

LDA with a known topic structure

  • Great Expectations by Charles Dickens
  • The War of the Worlds by H.G. Wells
  • Twenty Thousand Leagues Under the Sea by Jules Verne
  • Pride and Prejudice by Jane Austen

topicmodels

## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104722/3410773
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Terms associated with each topic

Per-document classification

Consensus topic

## # A tibble: 4 x 2
##                               consensus topic
##                                   <chr> <int>
## 1                    Great Expectations     4
## 2                   Pride and Prejudice     1
## 3                 The War of the Worlds     3
## 4 Twenty Thousand Leagues under the Sea     2

Mis-identification

chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  count(title, consensus) %>%
  knitr::kable()
title consensus n
Great Expectations Great Expectations 57
Great Expectations Pride and Prejudice 1
Great Expectations The War of the Worlds 1
Pride and Prejudice Pride and Prejudice 61
The War of the Worlds The War of the Worlds 27
Twenty Thousand Leagues under the Sea Twenty Thousand Leagues under the Sea 46

Incorrectly classified words

title Great Expectations Pride and Prejudice The War of the Worlds Twenty Thousand Leagues under the Sea
Great Expectations 49656 3908 1923 81
Pride and Prejudice 1 37231 6 4
The War of the Worlds 0 0 22561 7
Twenty Thousand Leagues under the Sea 0 5 0 39629

Most commonly mistaken words

## # A tibble: 3,551 x 4
##                 title             consensus     term     n
##                 <chr>                 <chr>    <chr> <dbl>
##  1 Great Expectations   Pride and Prejudice     love    44
##  2 Great Expectations   Pride and Prejudice sergeant    37
##  3 Great Expectations   Pride and Prejudice     lady    32
##  4 Great Expectations   Pride and Prejudice     miss    26
##  5 Great Expectations The War of the Worlds     boat    25
##  6 Great Expectations The War of the Worlds     tide    20
##  7 Great Expectations The War of the Worlds    water    20
##  8 Great Expectations   Pride and Prejudice   father    19
##  9 Great Expectations   Pride and Prejudice     baby    18
## 10 Great Expectations   Pride and Prejudice  flopson    18
## # ... with 3,541 more rows

Associated Press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Perplexity

  • A statistical measure of how well a probability model predicts a sample
  • Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
  • Perplexity for LDA model with 12 topics
    • 2277.8757156

Topics from \(k=100\)