---
title: "Tidy text analysis, part 2: sentiment analysis"
output:
  html_document:
    toc: yes
  html_notebook:
    theme: united
    toc: yes
---

## Using TidyText to perform sentiment analysis

```{r}
library(tidytext)
library(textdata)
library(magritter)
library(dplyr)
library(stringr)
library(glue)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(gutenbergr)
```

Now we will create a 'words' vector that goes through the standard tidytext process of uploading a text file, creating a dataframe of words and row numbers, and tokenizing the words in the text file.

```{r}
alice_words <- tibble(file = paste0('corpus/c19-20_prose/carroll_alice.txt')) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

alice_words

alice_words$word[5:25]
```

Next you invoke the 'inner_join' function which is essentially a way of conflating (or comparing) a data set against another. This is much easier to accomplish when the data is in tidy format. Here we are joining the text data from Lewis Carrol's *Alice in Wonderland* with a dictionary of sentiment words that assigns positive or negative values to each word.

```{r}
alice_words_sentiment <- inner_join(alice_words,
                              get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)
```

Did you notice the `get_sentiments` function? What is happening is that you are calling an `inner_join` function that includes an instruction to compare the tidy words tibble to a sentiment dictionary called "bing." Let's have a look at this.

```{r}
get_sentiments("bing")
```

We get a binary sentiment result (positive or negative). Another useful dictionary (probably better for more complex or historical texts) is "aFinn".
```{r}
get_sentiments("afinn")
```

Here we get a range of sentiment values. The other dictionary called `nrc`, offers a binary one with 'yes' or 'no' values, with corresponding categories, if 'yes'.

```{r}
get_sentiments("nrc")
```


Silge and Robinson provide a useful flowchart of how this works.

![tidy-sentiment.png](tidy-sentiment.png)

Back to our results: Using the ggplot library, we can visualise the results. 

```{r}
alice_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume") + 
  scale_y_continuous("Bing net Sentiment")
```

Let's make this interesting by comparing these results to another children's author from our corpus. 

```{r}
edgeworth_words <- data_frame(file = paste0('corpus/c19-20_prose/edgeworth_parents.txt')) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

edgeworth_words_sentiment <- inner_join(edgeworth_words,
                                                 get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)

edgeworth_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume (by percentage)") + 
  scale_y_continuous("Bing net sentiment of Edgworth's Parents...")

```

Clearly Lewis Carrol's *Alice* is more sinister on the single-word level. Let's break down the Carrol results into more understandable graphs.

```{r}
alice.bing_word_counts <- alice_words %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

alice.bing_word_counts <- alice.bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Lewis Carrol's Alice",
       x = NULL) +
  coord_flip()

alice.bing_word_counts
```
You may notice that the top negative word is likely wrong: "miss," which probably refers to the indefinite unmarried woman. You could create a custom stop word list to filter out words that you do not want.

```{r}
custom_stop_words <- bind_rows(tibble(word = c("miss", "mock", "like"), lexicon = c("custom")), 
                               stop_words)

custom_stop_words
```


Let try this with another sentiment dictionary.


```{r}
alice.nrc_word_counts <- alice_words %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

alice.nrc_word_counts %>% 
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Lewis Carrol's Alice",
       x = NULL) +
  coord_flip()
```

Here we have different categories of sentiment words, and it seems to be more accurate than the bing calculations. 

We can also use the tidy `filter()` function to focus on a specific category of sentiment.

```{r}
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

alice.joy.word.counts <- alice_words %>%
  inner_join(nrc_joy) %>% # here we can just inner_join with our new nrc_joy variable
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

alice.joy.word.counts
```


You can use the code below to combine the two sentiment dictionary results in an overall trajectory.

```{r}
bing_and_nrc <- bind_rows(alice_words %>% 
  inner_join(get_sentiments("bing")) %>%
  mutate(method = "Bing"),
  alice_words %>% 
  inner_join(get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative"))) %>%
  mutate(method = "NRC")) %>%
  count(method, index = line_number %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

bing_and_nrc %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")
```

As you can see the bing dictionary generally scores the text more negatively than the nrc. Let's now compare this to Edgeworth.  

```{r}
edg.nrc_word_counts <- edgeworth_words %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

edg.nrc_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Edgeworth",
       x = NULL) +
  coord_flip()
summary(bing_word_counts)
```

Another way to re-orient the sentiment results is to create a word cloud. Sometimes these can be useful for assessing the total weight of positivity or negativity in a text.

```{r message=FALSE}
library(wordcloud)
library(reshape2)

# create a sentiment wordcloud of the Edgeworth results

edgeworth_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))
```

As you can see, the cloud displays the overall negativity that the line graph above suggested. Let's see how that compares to Herodotus.

```{r message=FALSE}
library(wordcloud)
library(reshape2)

alice_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))
```

```{r}
# for a comparison, create a sentiment wordcloud of the Alice results using the afinn dictionary
get_sentiments("afinn")

alice_words %>%
  inner_join(get_sentiments("afinn")) %>%
  count(word, value, sort = TRUE) %>% # notice here that you're calling on a different observation ("value") instead of the "sentiment" in the previous example: that's due to the tibble structure of the dictionary
  acast(word ~ value, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))
```

### Accessing more context: 2-n gram frequencies

```{r}
alice_bigrams <- alice_words %>%
  unnest_tokens(bigram, word, token = "ngrams", n = 2)
alice_bigrams %>%
  count(bigram, sort = TRUE)

# use separate function to filter out stopwords
bigrams_sep <- alice_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_sep %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts with substantive words
sub_bigram_counts <- bigrams_filtered %>%
  count(word1, word2, sort = TRUE)

sub_bigram_counts
```


```{r}
# calculate tri-grams
alice_trigrams <- alice_words %>%
  unnest_tokens(trigram, word, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
  !word2 %in% stop_words$word,
  !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

alice_trigrams
```

### Using bigrams to provide contextual sentiment

```{r}
bigrams_sep %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)
```


```{r}
AFINN <- get_sentiments("afinn")

not_words <- bigrams_sep %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE) %>%
  ungroup()

not_words
```


```{r}
not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()

# run dev.off() if you get a graphics error
```


Now let's add to this by accounting for negation words.

```{r}
negation_words <- c("not", "no", "never", "without")

negated_words <- bigrams_sep %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE) %>%
  ungroup()
```


```{r}
negated_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(25) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by a negation") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()
```


## Exercise 3

1. Load your own texts (either from your our corpus or from a digital repository like or Project Gutenberg).

2. Posit a new question--or questions--about what you would like to investigate further. 

3. Modify a code block(s) from the R Notebook to answer your question.

```{r}

```


## Publish your results

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file).

If you are interested in learning more about R and corpus linguistics, in addition to Silge and Robinson and Jockers, you could also consult R. H. Baayen's *Analyzing Linguistic Data: A practical introduction to statistics* (Cambridge UP, 2008) and Stefan Gries's *Quantitative Corpus Linguistics with R*, 2nd ed. (Routledge, 2017).

Some good web resources include Jeff Rydberg-Cox's [Introduction to R](https://daedalus.umkc.edu/StatisticalMethods/index.html) and David Silge and Julia Robinson's [*Text Mining with R*](https://www.tidytextmining.com/). Also be sure to examine the [CRAN R Documentation site](https://cran.r-project.org/doc/manuals/R-intro.html).