--- title: "Trump Twitter analysis using the `tidyverse`" author: "Adam Spannbauer and Jennifer Chunn" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: df_print: kable vignette: | %\VignetteIndexEntry{Trump Twitter tidyverse analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette is based on data collected for the 538 story entitled "The World's Favorite Donald Trump Tweets" by Leah Libresco available [here](https://fivethirtyeight.com/features/the-worlds-favorite-donald-trump-tweets/). Load required packages to reproduce analysis. ```{r, message=FALSE, warning=FALSE} library(fivethirtyeight) library(ggplot2) library(dplyr) library(readr) library(tidytext) library(textdata) library(stringr) library(lubridate) library(knitr) library(hunspell) # Turn off scientific notation options(scipen = 99) ``` ## Check date range of tweets ```{r date_range} ## check out structure and date range ------------------------------------------------ (minDate <- min(date(trump_twitter$created_at))) (maxDate <- max(date(trump_twitter$created_at))) ``` # Create vectorised stemming function using hunspell ```{r hunspell} my_hunspell_stem <- function(token) { stem_token <- hunspell_stem(token)[[1]] if (length(stem_token) == 0) return(token) else return(stem_token[1]) } vec_hunspell_stem <- Vectorize(my_hunspell_stem, "token") ``` # Clean text by tokenizing & removing urls/stopwords We first remove URLs and stopwords as specified in the `tidytext` library. Stopwords are common words in English. We also do spellchecking using hunspell. ```{r tokens} trump_tokens <- trump_twitter %>% mutate(text = str_replace_all(text, pattern=regex("(www|https?[^\\s]+)"), replacement = "")) %>% #rm urls mutate(text = str_replace_all(text, pattern = "[[:digit:]]", replacement = "")) %>% unnest_tokens(tokens, text) %>% #tokenize mutate(tokens = vec_hunspell_stem(tokens)) %>% filter(!(tokens %in% stop_words$word)) #rm stopwords ``` # Sentiment analysis To measure the sentiment of tweets, we used the AFINN lexicon for each (non-stop) word in a tweet. The score runs between -5 and 5. We then sum the scores for each word across all words in one tweet to get a total tweet sentiment score. ```{r sentiment} afinn_sentiment <- system.file("extdata", "afinn.csv", package = "fivethirtyeight") %>% read_csv() trump_sentiment <- trump_tokens %>% inner_join(afinn_sentiment, by=c("tokens"="word")) trump_full_text_sent <- trump_sentiment %>% group_by(id) %>% summarise(score = sum(value, na.rm=TRUE)) %>% ungroup() %>% right_join(trump_twitter, by="id") %>% mutate(score_factor = ifelse(is.na(score), "Missing score", ifelse(score < 0, "-.Negative", ifelse(score == 0, "0", "+.Pos")))) ``` ## Distribution of sentiment scores ```{r} trump_full_text_sent %>% count(score_factor) %>% mutate(prop = prop.table(n)) ``` 46.4% of tweets did not have sentiment scores. 15.4% were net negative and 36.6% were net positive. ```{r sentiment_hist, fig.width=7, , warning=FALSE} ggplot(data=trump_full_text_sent, aes(score)) + geom_histogram(bins = 10) ``` # plot sentiment over time ```{r plot_time, fig.width=7} sentOverTimeGraph <- ggplot(data=filter(trump_full_text_sent,!is.na(score)), aes(x=created_at, y=score)) + geom_line() + geom_point() + xlab("Date") + ylab("Sentiment (afinn)") + ggtitle(paste0("Trump Tweet Sentiment (",minDate," to ",maxDate,")")) sentOverTimeGraph ``` # Examine top 5 most positive tweets ```{r pos_tweets} most_pos_trump <- trump_full_text_sent %>% arrange(desc(score)) %>% head(n=5) %>% .[["text"]] kable(most_pos_trump, format="html") ``` # Examine top 5 most negative tweets ```{r, neg_tweets} most_neg_trump <- trump_full_text_sent %>% arrange(score) %>% head(n=5) %>% .[["text"]] kable(most_neg_trump, format = "html") ``` # When is trumps favorite time to tweet? Total number of tweets and average sentiment (when available) by hour of the day, day of the week, and month ```{r tweet_time} trump_tweet_times <- trump_full_text_sent %>% mutate(weekday = wday(created_at, label=TRUE), month = month(created_at, label=TRUE), hour = hour(created_at), month_over_time = round_date(created_at,"month")) plotSentByTime <- function(trump_tweet_times, timeGroupVar) { timeVar <- substitute(timeGroupVar) timeVarLabel <- str_to_title(timeVar) trump_tweet_time_sent <- trump_tweet_times %>% rename(timeGroup = !! timeVar) %>% group_by(timeGroup) %>% summarise(score = mean(score, na.rm=TRUE), Count = n()) %>% ungroup() ggplot(trump_tweet_time_sent, aes(x=timeGroup, y=Count, fill = score)) + geom_bar(stat="identity") + xlab(timeVarLabel) + ggtitle(paste("Trump Tweet Count & Sentiment by", timeVarLabel)) } ``` ```{r plot_hour, fig.width=7, warning=FALSE} plotSentByTime(trump_tweet_times, "hour") ``` * Trump tweets the least between 4 and 10 am. * Trump's tweets are most positive during the 10am hour. ```{r plot_weekday, fig.width=7, warning=FALSE} plotSentByTime(trump_tweet_times, "weekday") ``` * Trump tweeted the most on Tuesday and Wednesday * Trump was most positive in the second part of the work week (Wed, Thurs, Fri) ```{r plot_month, fig.width=7, warning=FALSE} plotSentByTime(trump_tweet_times, "month_over_time") ``` * In this dataset, the number of tweets decreased after November 2015 and drastically dropped off after March 2016. It is unclear if this is a result of actual decrease in tweeting frequency or a result of the data collection process.