Text analysis: fundamentals and sentiment analysis

MACS 30500 University of Chicago

November 20, 2017

Basic workflow for text analysis

Obtain your text sources
Extract documents and move into a corpus
Transformation
Extract features
Perform analysis

Obtain your text sources

Web sites
- Twitter
Databases
PDF documents
Digital scans of printed materials

Extract documents and move into a corpus

Text corpus
Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
Standard text processing
- Convert to lower case
- Remove punctuation
- Remove numbers
- Remove stopwords
- Remove domain-specific stopwords
- Stemming

Extract features

Convert the text string into some sort of quantifiable measures
Bag-of-words model
- Term frequency vector
- Term-document matrix
- Ignores context

Perform analysis

Basic
- Word frequency
- Collocation
- Dictionary tagging
Advanced
- Document classification
  - Supervised
  - Unsupervised
- Corpora comparison
- Topic modeling

`tidytext`

Tidy text format
Defined as one-term-per-row
Differs from the document-term matrix
- One-document-per-row and one-term-per-column

Get text corpa

library(janeaustenr)

books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text,
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()
books

## # A tibble: 73,422 x 4
##                     text                book linenumber chapter
##                    <chr>              <fctr>      <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2                       Sense & Sensibility          2       0
##  3        by Jane Austen Sense & Sensibility          3       0
##  4                       Sense & Sensibility          4       0
##  5                (1811) Sense & Sensibility          5       0
##  6                       Sense & Sensibility          6       0
##  7                       Sense & Sensibility          7       0
##  8                       Sense & Sensibility          8       0
##  9                       Sense & Sensibility          9       0
## 10             CHAPTER 1 Sense & Sensibility         10       1
## # ... with 73,412 more rows

Tokenize text

tidy_books <- books %>%
  unnest_tokens(output = word, input = text)
tidy_books

## # A tibble: 725,055 x 4
##                   book linenumber chapter        word
##                 <fctr>      <int>   <int>       <chr>
##  1 Sense & Sensibility          1       0       sense
##  2 Sense & Sensibility          1       0         and
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0          by
##  5 Sense & Sensibility          3       0        jane
##  6 Sense & Sensibility          3       0      austen
##  7 Sense & Sensibility          5       0        1811
##  8 Sense & Sensibility         10       1     chapter
##  9 Sense & Sensibility         10       1           1
## 10 Sense & Sensibility         13       1         the
## # ... with 725,045 more rows

Download population data for U.S. states

# retrieve state populations in 2014 from Census Bureau
pop_df <- read_csv("http://cfss.uchicago.edu/data/pop2014.csv")

# do these results make sense?
pop_df %>% 
  arrange(desc(pop2014)) %>%
  top_n(10)

## # A tibble: 10 x 2
##        state_name  pop2014
##             <chr>    <int>
##  1     california 38066920
##  2          texas 26092033
##  3       new york 19594330
##  4        florida 19361792
##  5       illinois 12868747
##  6   pennsylvania 12758729
##  7           ohio 11560380
##  8        georgia  9907756
##  9       michigan  9889024
## 10 north carolina  9750405

Retrieve song lyrics

song_lyrics <- read_csv("../data/billboard_lyrics_1964-2015.csv")
str(song_lyrics)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5100 obs. of  6 variables:
##  $ Rank  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Song  : chr  "wooly bully" "i cant help myself sugar pie honey bunch" "i cant get no satisfaction" "you were on my mind" ...
##  $ Artist: chr  "sam the sham and the pharaohs" "four tops" "the rolling stones" "we five" ...
##  $ Year  : int  1965 1965 1965 1965 1965 1965 1965 1965 1965 1965 ...
##  $ Lyrics: chr  "sam the sham miscellaneous wooly bully wooly bully sam the sham  the pharaohs  domingo samudio uno dos one two "| __truncated__ "sugar pie honey bunch you know that i love you i cant help myself i love you and nobody elsein and out my life "| __truncated__ NA "when i woke up this morning you were on my mind and you were on my mind i got troubles whoaoh i got worries who"| __truncated__ ...
##  $ Source: int  3 1 1 1 1 1 3 5 1 3 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 6
##   .. ..$ Rank  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Song  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Artist: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Lyrics: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Source: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Use tidytext to create a data frame with one row for each token in each song

Use `tidytext` to create a data frame with one row for each token in each song

tidy_lyrics <- bind_rows(song_lyrics %>% 
                           unnest_tokens(output = state_name,
                                         input = Lyrics),
                         song_lyrics %>% 
                           unnest_tokens(output = state_name,
                                         input = Lyrics, 
                                         token = "ngrams", n = 2))
tidy_lyrics

## # A tibble: 3,201,196 x 6
##     Rank        Song                        Artist  Year Source
##    <int>       <chr>                         <chr> <int>  <int>
##  1     1 wooly bully sam the sham and the pharaohs  1965      3
##  2     1 wooly bully sam the sham and the pharaohs  1965      3
##  3     1 wooly bully sam the sham and the pharaohs  1965      3
##  4     1 wooly bully sam the sham and the pharaohs  1965      3
##  5     1 wooly bully sam the sham and the pharaohs  1965      3
##  6     1 wooly bully sam the sham and the pharaohs  1965      3
##  7     1 wooly bully sam the sham and the pharaohs  1965      3
##  8     1 wooly bully sam the sham and the pharaohs  1965      3
##  9     1 wooly bully sam the sham and the pharaohs  1965      3
## 10     1 wooly bully sam the sham and the pharaohs  1965      3
## # ... with 3,201,186 more rows, and 1 more variables: state_name <chr>

Find all the state names occurring in the song lyrics

Find all the state names occurring in the song lyrics

tidy_lyrics <- inner_join(tidy_lyrics, pop_df) %>%
  distinct(Rank, Song, Artist, Year, state_name, .keep_all = TRUE)
tidy_lyrics

## # A tibble: 253 x 7
##     Rank                          Song                         Artist
##    <int>                         <chr>                          <chr>
##  1    12              king of the road                   roger miller
##  2    29            eve of destruction                  barry mcguire
##  3    49              california girls                 the beach boys
##  4    10            california dreamin           the mamas  the papas
##  5    77            message to michael                 dionne warwick
##  6    61             california nights                    lesley gore
##  7     4 sittin on the dock of the bay                   otis redding
##  8    10                    tighten up        archie bell  the drells
##  9    25                      get back the beatles with billy preston
## 10    25                      get back the beatles with billy preston
## # ... with 243 more rows, and 4 more variables: Year <int>, Source <int>,
## #   state_name <chr>, pop2014 <int>

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

(state_counts <- tidy_lyrics %>% 
   count(state_name) %>% 
   arrange(desc(n)))

## # A tibble: 33 x 2
##     state_name     n
##          <chr> <int>
##  1    new york    64
##  2  california    34
##  3     georgia    22
##  4   tennessee    14
##  5       texas    14
##  6     alabama    12
##  7 mississippi    10
##  8    kentucky     7
##  9      hawaii     6
## 10    illinois     6
## # ... with 23 more rows

pop_df <- pop_df %>% 
  left_join(state_counts) %>% 
  mutate(rate = n / pop2014 * 1e6)

pop_df %>%
  arrange(desc(rate)) %>%
  top_n(10)

## # A tibble: 10 x 4
##     state_name  pop2014     n     rate
##          <chr>    <int> <int>    <dbl>
##  1      hawaii  1392704     6 4.308166
##  2 mississippi  2984345    10 3.350819
##  3    new york 19594330    64 3.266251
##  4     alabama  4817678    12 2.490826
##  5       maine  1328535     3 2.258126
##  6     georgia  9907756    22 2.220483
##  7   tennessee  6451365    14 2.170083
##  8     montana  1006370     2 1.987341
##  9    nebraska  1855617     3 1.616713
## 10    kentucky  4383272     7 1.596981

Make a choropleth map for both the raw frequency counts and relative frequency counts

library(statebins)

pop_df %>%
  mutate(state_name = stringr::str_to_title(state_name),
         state_name = if_else(state_name == "District Of Columbia",
                              "District of Columbia", state_name)) %>%
  statebins_continuous(state_col = "state_name", value_col = "n") +
  labs(title = "Frequency of states mentioned in song lyrics",
       subtitle = "Number of mentions") +
  theme(legend.position = "bottom")

pop_df %>%
  mutate(state_name = stringr::str_to_title(state_name),
         state_name = if_else(state_name == "District Of Columbia",
                              "District of Columbia", state_name)) %>%
  statebins_continuous(state_col = "state_name", value_col = "rate") +
  labs(title = "Frequency of states mentioned in song lyrics",
       subtitle = "Number of mentions per capita") +
  theme(legend.position = "bottom")

Sentiment analysis

I am happy

Dictionaries

get_sentiments("bing")

## # A tibble: 6,788 x 2
##           word sentiment
##          <chr>     <chr>
##  1     2-faced  negative
##  2     2-faces  negative
##  3          a+  positive
##  4    abnormal  negative
##  5     abolish  negative
##  6  abominable  negative
##  7  abominably  negative
##  8   abominate  negative
##  9 abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

Dictionaries

get_sentiments("afinn")

## # A tibble: 2,476 x 2
##          word score
##         <chr> <int>
##  1    abandon    -2
##  2  abandoned    -2
##  3   abandons    -2
##  4   abducted    -2
##  5  abduction    -2
##  6 abductions    -2
##  7      abhor    -3
##  8   abhorred    -3
##  9  abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

Dictionaries

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

Dictionaries

get_sentiments("nrc") %>%
  count(sentiment)

## # A tibble: 10 x 2
##       sentiment     n
##           <chr> <int>
##  1        anger  1247
##  2 anticipation   839
##  3      disgust  1058
##  4         fear  1476
##  5          joy   689
##  6     negative  3324
##  7     positive  2312
##  8      sadness  1191
##  9     surprise   534
## 10        trust  1231

Calculate sentiment

tidy_books %>%
  inner_join(get_sentiments("bing"))

## # A tibble: 52,287 x 5
##                   book linenumber chapter        word sentiment
##                 <fctr>      <int>   <int>       <chr>     <chr>
##  1 Sense & Sensibility         16       1 respectable  positive
##  2 Sense & Sensibility         16       1        good  positive
##  3 Sense & Sensibility         18       1    advanced  positive
##  4 Sense & Sensibility         20       1       death  negative
##  5 Sense & Sensibility         20       1       great  positive
##  6 Sense & Sensibility         21       1        loss  negative
##  7 Sense & Sensibility         25       1 comfortably  positive
##  8 Sense & Sensibility         28       1    goodness  positive
##  9 Sense & Sensibility         28       1       solid  positive
## 10 Sense & Sensibility         29       1     comfort  positive
## # ... with 52,277 more rows

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment)

## # A tibble: 1,840 x 4
##                   book index sentiment     n
##                 <fctr> <dbl>     <chr> <int>
##  1 Sense & Sensibility     0  negative    16
##  2 Sense & Sensibility     0  positive    32
##  3 Sense & Sensibility     1  negative    19
##  4 Sense & Sensibility     1  positive    53
##  5 Sense & Sensibility     2  negative    12
##  6 Sense & Sensibility     2  positive    31
##  7 Sense & Sensibility     3  negative    15
##  8 Sense & Sensibility     3  positive    31
##  9 Sense & Sensibility     4  negative    16
## 10 Sense & Sensibility     4  positive    34
## # ... with 1,830 more rows

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0)

## # A tibble: 920 x 4
##                   book index negative positive
##  *              <fctr> <dbl>    <dbl>    <dbl>
##  1 Sense & Sensibility     0       16       32
##  2 Sense & Sensibility     1       19       53
##  3 Sense & Sensibility     2       12       31
##  4 Sense & Sensibility     3       15       31
##  5 Sense & Sensibility     4       16       34
##  6 Sense & Sensibility     5       16       51
##  7 Sense & Sensibility     6       24       40
##  8 Sense & Sensibility     7       23       51
##  9 Sense & Sensibility     8       30       40
## 10 Sense & Sensibility     9       15       19
## # ... with 910 more rows

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## # A tibble: 920 x 5
##                   book index negative positive sentiment
##                 <fctr> <dbl>    <dbl>    <dbl>     <dbl>
##  1 Sense & Sensibility     0       16       32        16
##  2 Sense & Sensibility     1       19       53        34
##  3 Sense & Sensibility     2       12       31        19
##  4 Sense & Sensibility     3       15       31        16
##  5 Sense & Sensibility     4       16       34        18
##  6 Sense & Sensibility     5       16       51        35
##  7 Sense & Sensibility     6       24       40        16
##  8 Sense & Sensibility     7       23       51        28
##  9 Sense & Sensibility     8       30       40        10
## 10 Sense & Sensibility     9       15       19         4
## # ... with 910 more rows

janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

# plot the sentiment over time in each book
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
        geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
        facet_wrap( ~ book, ncol = 2, scales = "free_x")

Load Harry Potter text

## # A tibble: 1,089,386 x 3
## # Groups:   book [7]
##                  book chapter    word
##                <fctr>   <int>   <chr>
##  1 philosophers_stone       1     the
##  2 philosophers_stone       1     boy
##  3 philosophers_stone       1     who
##  4 philosophers_stone       1   lived
##  5 philosophers_stone       1      mr
##  6 philosophers_stone       1     and
##  7 philosophers_stone       1     mrs
##  8 philosophers_stone       1 dursley
##  9 philosophers_stone       1      of
## 10 philosophers_stone       1  number
## # ... with 1,089,376 more rows

Text analysis: fundamentals and sentiment analysis

MACS 30500 University of Chicago

November 20, 2017

Basic workflow for text analysis

Obtain your text sources

Extract documents and move into a corpus

Transformation

Extract features

Perform analysis

`tidytext`

Get text corpa

Tokenize text

Download population data for U.S. states

Retrieve song lyrics

Use `tidytext` to create a data frame with one row for each token in each song

Find all the state names occurring in the song lyrics

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

Make a choropleth map for both the raw frequency counts and relative frequency counts

Sentiment analysis

Dictionaries

Dictionaries

Dictionaries

Dictionaries

Calculate sentiment

Load Harry Potter text

Most frequent words, by book

Exercises

Text analysis: fundamentals and sentiment analysis

MACS 30500 University of Chicago

November 20, 2017

Basic workflow for text analysis

Obtain your text sources

Extract documents and move into a corpus

Transformation

Extract features

Perform analysis

tidytext

Get text corpa

Tokenize text

Download population data for U.S. states

Retrieve song lyrics

Use tidytext to create a data frame with one row for each token in each song

Find all the state names occurring in the song lyrics

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

Make a choropleth map for both the raw frequency counts and relative frequency counts

Sentiment analysis

Dictionaries

Dictionaries

Dictionaries

Dictionaries

Calculate sentiment

Load Harry Potter text

Most frequent words, by book

Exercises

`tidytext`

Use `tidytext` to create a data frame with one row for each token in each song