Text analysis: classification and topic modeling

MACS 30500 University of Chicago

November 22, 2017

Supervised learning

Hand-code a small set of documents (\(N = 1000\))
Train a statistical learning model on the hand-coded data
Evaluate the effectiveness of the statistical learning model
Apply the final model to the remaining set of documents (\(N = 1000000\))

Text classification

`USCongress`

## Classes 'tbl_df', 'tbl' and 'data.frame':    4449 obs. of  6 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cong    : int  107 107 107 107 107 107 107 107 107 107 ...
##  $ billnum : int  4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 ...
##  $ h_or_sen: Factor w/ 2 levels "HR","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ major   : int  18 18 18 18 5 21 15 18 18 18 ...
##  $ text    : chr  "To suspend temporarily the duty on Fast Magenta 2 Stage." "To suspend temporarily the duty on Fast Black 286 Stage." "To suspend temporarily the duty on mixtures of Fluazinam." "To reduce temporarily the duty on Prodiamine Technical." ...

Set of hand-coded bills from US Congress
Text description
Major policy topic

Create tidy text data frame

(congress_tokens <- congress %>%
   unnest_tokens(output = word, input = text) %>%
   filter(!str_detect(word, "^[0-9]*$")) %>%
   anti_join(stop_words) %>%
   mutate(word = SnowballC::wordStem(word)))

## # A tibble: 58,820 x 6
##       ID  cong billnum h_or_sen major        word
##    <int> <int>   <int>   <fctr> <int>       <chr>
##  1     1   107    4499       HR    18     suspend
##  2     1   107    4499       HR    18 temporarili
##  3     1   107    4499       HR    18        duti
##  4     1   107    4499       HR    18        fast
##  5     1   107    4499       HR    18     magenta
##  6     1   107    4499       HR    18       stage
##  7     2   107    4500       HR    18     suspend
##  8     2   107    4500       HR    18 temporarili
##  9     2   107    4500       HR    18        duti
## 10     2   107    4500       HR    18        fast
## # ... with 58,810 more rows

Create document-term matrix

(congress_dtm <- congress_tokens %>%
   count(ID, word) %>%
   cast_dtm(document = ID, term = word, value = n))

## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

Weighting

Term frequency (tf)
Term frequency-inverse document frequency (tf-idf)

Weighting

congress_tokens %>%
  count(ID, word) %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = tm::weightTfIdf)

## <<DocumentTermMatrix (documents: 4449, terms: 4902)>>
## Non-/sparse entries: 55033/21753965
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Sparsity

removeSparseTerms(congress_dtm, sparse = .99)

## <<DocumentTermMatrix (documents: 4449, terms: 209)>>
## Non-/sparse entries: 33794/896047
## Sparsity           : 96%
## Maximal term length: 11
## Weighting          : term frequency (tf)

removeSparseTerms(congress_dtm, sparse = .95)

## <<DocumentTermMatrix (documents: 4449, terms: 28)>>
## Non-/sparse entries: 18447/106125
## Sparsity           : 85%
## Maximal term length: 11
## Weighting          : term frequency (tf)

removeSparseTerms(congress_dtm, sparse = .90)

## <<DocumentTermMatrix (documents: 4449, terms: 16)>>
## Non-/sparse entries: 14917/56267
## Sparsity           : 79%
## Maximal term length: 9
## Weighting          : term frequency (tf)

Exploratory analysis

Estimate model

congress_rf <- train(x = as.matrix(congress_dtm),
                     y = factor(congress$major),
                     method = "rf",
                     ntree = 200,
                     trControl = trainControl(method = "oob"))

Evaluate model

congress_rf_200$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 200, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 209
## 
##         OOB estimate of  error rate: 34.1%
## Confusion matrix:
##     1  2   3  4   5   6   7   8 10  12 13 14  15  16 17  18 19  20  21 99
## 1  97  0   3  0   3   3   2   5  5  11  1  2  13   3  1   5  0   7   2  0
## 2   1 13   6  1   5   4   4   1  3   6  3  2   8   7  3   1  1  11   4  0
## 3   4  1 532  4  14   9   6   1  4   7  6  2   7  10  1   0  1   5   3  0
## 4   2  1   8 91   4   2   5   1  1   1  0  1   3   0  2   5  1   1   4  0
## 5   5  3  13  2 153  10   4   2 10  13  2  1   7   7  6   7  5   5   6  1
## 6   9  1   7  0  11 159   1   0  1   8  1  1   6   3  2   5  2   1   4  0
## 7   2  4   4  5   4   3 101   5  9   8  1  1   7   3  5   4  3   6  25  1
## 8   6  2   1  1   2   1   4 100  3   3  0  2   4   1  0   1  0   2   5  0
## 10  6  0   3  2   4   1   6   2 96  12  0  0   6   3  4   6  3  12   5  0
## 12 10  0  19  4  13   6   9   1  8 143  4  3  14   4  5   7  6  28   4  3
## 13  4  0   6  0   5   1   3   2  1   3 60  3   1   0  1   1  0   2   1  0
## 14  2  0   1  2   5   2   3   1  4   1  1 46   3   2  1   2  0   3   1  0
## 15 13  3   7  5  14   2   9   4  6  19  1  2 145   6  6  12  6  15   3  1
## 16  1  5   4  0   6   3   3   1  7   5  2  2   6 133  1   9  7  18   6  0
## 17  5  0   3  1   6   3   3   1  1   8  0  3   4   2 36   1  2   6   4  1
## 18  0  0   0  2   1   2   4   2  3   1  0  0   2   2  0 373  7   2   1  0
## 19  2  0   3  3  11   5   5   0  4   9  0  0   4   6  0   9 46   6   8  0
## 20 12  2   7  1  15   5   7   1  7  20  1  3  10  14  7  12  2 235  18  1
## 21  6  4   6  3   4   4  26   2  9   7  1  4   4  11  3   9  5  16 348  0
## 99  0  0   0  0   0   0   0   0  1   0  0  0   0   1  2   0  0   0   1 25
##    class.error
## 1    0.4049080
## 2    0.8452381
## 3    0.1377634
## 4    0.3157895
## 5    0.4160305
## 6    0.2837838
## 7    0.4975124
## 8    0.2753623
## 10   0.4385965
## 12   0.5085911
## 13   0.3617021
## 14   0.4250000
## 15   0.4802867
## 16   0.3926941
## 17   0.6000000
## 18   0.0721393
## 19   0.6198347
## 20   0.3815789
## 21   0.2627119
## 99   0.1666667

Evaluate model

Topic modeling

Keywords
Links
Themes
Probabilistic topic models
- Latent Dirichlet allocation

Food and animals

I ate a banana and spinach smoothie for breakfast.
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

LDA document structure

Decide on the number of words N the document will have
- Dirichlet probability distribution
- Fixed set of \(k\) topics
Generate each word in the document:
- Pick a topic
- Generate the word
LDA backtracks from this assumption

Food and animals

Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
Pick 5 to be the number of words in \(D\).
Pick the first word to come from the food topic
Pick the second word to come from the cute animals topic
Pick the third word to come from the cute animals topic
Pick the fourth word to come from the food topic
Pick the fifth word to come from the food topic

LDA with a known topic structure

Great Expectations by Charles Dickens
The War of the Worlds by H.G. Wells
Twenty Thousand Leagues Under the Sea by Jules Verne
Pride and Prejudice by Jane Austen

`topicmodels`

## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104722/3410773
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Terms associated with each topic

Per-document classification

Consensus topic

## # A tibble: 4 x 2
##                               consensus topic
##                                   <chr> <int>
## 1                    Great Expectations     4
## 2                   Pride and Prejudice     1
## 3                 The War of the Worlds     3
## 4 Twenty Thousand Leagues under the Sea     2

Mis-identification

chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  count(title, consensus) %>%
  knitr::kable()

title	consensus	n
Great Expectations	Great Expectations	57
Great Expectations	Pride and Prejudice	1
Great Expectations	The War of the Worlds	1
Pride and Prejudice	Pride and Prejudice	61
The War of the Worlds	The War of the Worlds	27
Twenty Thousand Leagues under the Sea	Twenty Thousand Leagues under the Sea	46

Incorrectly classified words

title	Great Expectations	Pride and Prejudice	The War of the Worlds	Twenty Thousand Leagues under the Sea
Great Expectations	49656	3908	1923	81
Pride and Prejudice	1	37231	6	4
The War of the Worlds	0	0	22561	7
Twenty Thousand Leagues under the Sea	0	5	0	39629

Most commonly mistaken words

## # A tibble: 3,551 x 4
##                 title             consensus     term     n
##                 <chr>                 <chr>    <chr> <dbl>
##  1 Great Expectations   Pride and Prejudice     love    44
##  2 Great Expectations   Pride and Prejudice sergeant    37
##  3 Great Expectations   Pride and Prejudice     lady    32
##  4 Great Expectations   Pride and Prejudice     miss    26
##  5 Great Expectations The War of the Worlds     boat    25
##  6 Great Expectations The War of the Worlds     tide    20
##  7 Great Expectations The War of the Worlds    water    20
##  8 Great Expectations   Pride and Prejudice   father    19
##  9 Great Expectations   Pride and Prejudice     baby    18
## 10 Great Expectations   Pride and Prejudice  flopson    18
## # ... with 3,541 more rows

Associated Press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Perplexity

A statistical measure of how well a probability model predicts a sample
Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
Perplexity for LDA model with 12 topics
- 2277.8757156

Text analysis: classification and topic modeling

MACS 30500 University of Chicago

November 22, 2017

Supervised learning

USCongress

Create tidy text data frame

Create document-term matrix

Weighting

Weighting

Sparsity

Exploratory analysis

Estimate model

Evaluate model

Evaluate model

Topic modeling

Food and animals

LDA document structure

Food and animals

LDA with a known topic structure

topicmodels

Terms associated with each topic

Per-document classification

Consensus topic

Mis-identification

Incorrectly classified words

Most commonly mistaken words

Associated Press articles

Perplexity

Topics from \(k=100\)

`USCongress`

`topicmodels`