---
title: "reddit-sentiment-analysis"
author: "Aleszu Bajak"
date: "2/8/2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Sentiment analysis of Reddit comments

In February 2018, Felippe Rodrigues and I published a story at Smithsonianmag.com about brain-boosting substances that are at once banned in the Olympics and popular in the tech world. To understand the size and tenor of the conversation surrounding these so-called "nootropics"- think the pill from the movie *Limitless* - we used R's ```tidytext``` package to analyze more than 150,000 Reddit comments scraped using Python.

Here's how we did it. 

First, load the following packages.

```{r}
# for data wrangling
library(tidyr)
library(stringr)
library(magrittr)
library(dplyr)
library(lubridate)
# for sentiment analysis
library(tidytext)
# for visualization
library(ggplot2)
library(ggridges)
```

## Load in the data

Next, load in the 164,000 comments we scraped from the subreddits *r/Nootropics* and *r/StackAdvice*. By using ```glimpse()``` we can take a peek at the tibble in RStudio's console. 


```{r}
All_comments <- read.csv("reddit/all_Noot_Stack_comments.csv", 
                         header=TRUE, stringsAsFactors=FALSE) 
All_comments %>% glimpse()
```

Next, we's filtered these Reddit comments by substance and then exported each as a CSV for manual exploration in Excel using *write.csv()*. 

```{r}
# Get caffeine comments
All_caffeine_comments <- All_comments %>%
  filter(str_detect(body, "caffeine")) %>% glimpse()
write.csv((All_caffeine_comments), "reddit/substances/all_caffeine_mentions.csv")

# Get theanine comments
All_theanine_comments <- All_comments %>%
  filter(str_detect(body, "theanine")) %>% glimpse()
write.csv((All_theanine_comments), "reddit/substances/all_theanine_mentions.csv")

# Get piracetam comments
All_piracetam_comments <- All_comments %>%
  filter(str_detect(body, "piracetam")) %>% glimpse()
write.csv((All_piracetam_comments), "reddit/substances/all_piracetam_mentions.csv")

# Get noopept comments
All_noopept_comments <- All_comments %>%
  filter(str_detect(body, "noopept")) %>% glimpse()
write.csv((All_noopept_comments), "reddit/substances/all_noopept_mentions.csv")

# Get modafinil comments
All_modafinil_comments <- All_comments %>%
  filter(str_detect(body, "modafinil")) %>% glimpse()
write.csv((All_modafinil_comments), "reddit/substances/all_modafinil_mentions.csv")

# Get phenibut comments
All_phenibut_comments <- All_comments %>%
  filter(str_detect(body, "phenibut")) %>% glimpse()
write.csv((All_phenibut_comments), "reddit/substances/all_phenibut_mentions.csv")

# Get bacopa comments
All_bacopa_comments <- All_comments %>%
  filter(str_detect(body, "bacopa")) %>% glimpse()
write.csv((All_bacopa_comments), "reddit/substances/all_bacopa_mentions.csv")
```


In Excel, we surveyed each of these substances, adding a corresponding ```substance``` column to each csv. We then pasted all of these into one spreadsheet, which we named *all_substance_mentions.csv*.

This could have easily been done with the ```dplyr``` R package, too, FYI.

## Visualize substance mentions over time

First, we load in our csv with the all Reddit substance mentions

```{r}
TotalMentions <- read.csv("reddit/all_substance_mentions.csv", 
                          header=TRUE, stringsAsFactors=FALSE) 
TotalMentions %>% glimpse()
```

We need to fix the date using ```lubridate``` and then breakdown the data by month. This will make a much more presentable plot. 

```{r}
TotalMentions$date <- ymd(TotalMentions$date)
TotalMentions$Month <- as.Date(cut(TotalMentions$date,
                                   breaks = "month"))
TotalMentions %>% 
  arrange(date) %>% glimpse() 
```

Finally, we'll use the ```ggplot2``` visualization package to plot mentions of each substance over time. Note: ```facet_grid``` breaks the plot out into small plots for each substance. 

```{r}
ggplot(TotalMentions, aes(Month)) + geom_histogram(aes(fill=factor(substance)), stat = "count") +
  facet_grid(~substance)
```

We also wondered if we could stack these substance mentions over time using what some call the "Joy Division" plot. It comes with the ```ggridges``` package. 

```{r}
ggplot(TotalMentions, aes(x = Month, y = substance)) + geom_density_ridges()
```

We also tried visualizing using some other plots. 

```{r}
ggplot(TotalMentions, aes(Month, color=substance)) + geom_histogram(aes(binwidth=0.01), stat = "bin") +
  facet_grid(~substance)
```

```{r}
ggplot(TotalMentions, aes(Month)) + geom_histogram(aes(fill=factor(substance)), binwidth="40", stat = "count") + facet_grid(~substance)
```

```{r}
ggplot(TotalMentions, aes(TotalMentions$Month, color=substance)) + geom_freqpoly(aes(binwidth=0.01), stat="bin") 
```

```{r}
ggplot(TotalMentions, aes(time)) + geom_histogram(aes(fill=factor(substance)), stat = "count")
```


# Sentiment analysis

We employed sentiment analysis using the "tidytext" R package on our CSV file of mentions of all seven substances.

First, load the AFINN sentiment analysis library that comes with "tidytext" 

```{r}
AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)
```

## Tokenize comments and merge with words ranked by sentiment

Next, tokenize comments into one-word rows and cut out stop words. Then join AFINN-scored words with words in comments, if present. Next, return a 114,000-row tibble with X1, substance, word and sentiment score. (Uncomment the *write.csv()* line to create a CSV of this tibble.)

```{r}
TotalMentionsComments <- TotalMentions

all_sentiment <- TotalMentionsComments %>%
  select(body, X1, time, substance, date) %>%
  unnest_tokens(word, body) %>%
  anti_join(stop_words) %>%
  inner_join(AFINN, by = "word") %>%
  group_by(X1, substance, word) %>%
  summarize(sentiment = mean(afinn_score))

all_sentiment

# write.csv((all_sentiment), "reddit/all_sentiment.csv")
```

## Visualize sentiment analysis 

Let's see what we've got. Using ```ggplot2``` we will plot words by sentiment and frequency, with dot size representing the frequency of words. Add a *geom_hline* to show the average sentiment. 

```{r}
ggplot(all_sentiment, aes(x=substance, y = all_sentiment$sentiment)) +
  geom_point() +
  geom_count() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1, hjust = 1) + geom_hline(yintercept = mean(all_sentiment$sentiment), color = "red", lty = 2)
```

To eventually plot sentiment vs. word frequency, we will need to count word occurences and then merge these two dataframes. 

```{r}
all_sentiment_wordcount <- all_sentiment %>%
  select(X1, substance, word, sentiment) %>%
  group_by(word) %>%
  tally()

Bind_sent_and_word <- all_sentiment %>%
  full_join(all_sentiment_wordcount, by="word")
```

Ok, now we're ready to plot sentiment of all substances vs. word frequency, using *facet_wrap* to split up the charts by substance.

```{r}
ggplot(Bind_sent_and_word, aes(y=n, x=sentiment, color=substance)) +
  geom_point() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1, hjust = 1) +
  geom_hline(yintercept = mean(Bind_sent_and_word$sentiment), color = "red", lty = 2) +
  facet_wrap(~substance)
```

Since our Smithsonian story is only about modafinil, piracetam and noopept, let's filter by those three substances.

```{r}
Filtered_sent_vs_word <- Bind_sent_and_word %>%
  filter(substance == "modafinil" | substance == "piracetam" | substance == "noopept") %>% glimpse()
```

## Visualize our publication-ready chart

The graph we included in the story plots word frequency vs. sentiment and colors the points by substance and labels them by word. 

```{r}
ggplot(Filtered_sent_vs_word, aes(y=n, x=sentiment, color=substance)) +
  geom_point() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.1, hjust = 1.1) 
```

We exported an SVG, brought it into Adobe Illustrator and designed it up. 

That's it!