# Analyzing Youtube Comments in R

In this notebook, we will interactively walk trough the analysis of YouTube comments. To run a code chunk, simply click on it and hit the "Run" button from the menu above. 

If you want to know (more about) what Jupyter notebooks are and how to use them, see, e.g., https://medium.com/ibm-data-science-experience/back-to-basics-jupyter-notebooks-dfcdc19c54bc

For instructions on how to download and parse the data that we will be using here, please refer to the R script file in the GitHub repository associated with this notebook. The script and notebook are part of an ongoing research project (see https://www.researchgate.net/project/Methods-and-Tools-for-Automatic-Sampling-and-Analysis-of-YouTube-Comments) and will be subject to change. 

If you use substantive parts of this notebook or the accompanying script as part of your own research, please cite it in the following way:

Kohne, J., Breuer, J., & Mohseni, M. R. (2019). Methods and Tools for Automatic Sampling and Analysis of YouTube User Comments: https://doi.org/10.17605/OSF.IO/HQSXE

***

## Installing & loading packages

In this Notebook, we have already pre-installed all necessary packages that you need to run our code. These packages are:

 - devtools
 - tm
 - quanteda
 - tuber
 - qdapRegex
 - rlang
 - purrr
 - ggplot2
 - syuzhet
 - lexicon
 
in addition to the above CRAN packages, we also included two packages for handling emojis that are not on CRAN (yet) and have to be installed directly through GitHub:

 - emo
 - emoGG

If you want to replicate our analysis locally or want to adapt it, you will need to install those packages on your machine as well. To do so, you can download the file "install.R" in the folder called "binder" from this notebook and run it on your local machine before your analysis.

**NOTE:** The binder server on which this notebook runs on is still using an older version of R (3.4.4),
this is why we need to install an older version of the statnet.common package, which the quanteda package relies on. If you want to replicate this analysis locally and are using the newest version of R, you can skip the statnet.common installation and simply install the Quanteda package.

In [None]:
# loading CRAN libraries
library(devtools)
library(tm)
library(quanteda)
library(qdapRegex)
library(rlang)
library(purrr)
library(ggplot2)
library(syuzhet)
library(lexicon)

# loading Github libraries
library(emo)
library(emoGG)

***

## Setting options

Because we are working with text data, we need to set the following option, so that text is not interpreted as categorical variables by R

In [None]:
options(stringAsFactors = FALSE)

***

## Importing data

In this notebook, we will work with a saved version of an already parsed dataframe. Please check the R script file
in the associated GitHub repository for a walktrough of how to extract comments from YouTube and parse them. We will be using the comments of this video, extracted in February 2019:

https://www.youtube.com/watch?v=DcJFdCmN98s 

In [None]:
# Loading prepared dataset - the name of this dataframe is "FormattedComments"
load("./Data/ParsedCommentsUTF8.Rdata")

In [None]:
# sorting comments by date
FormattedComments <- FormattedComments[order(FormattedComments$Published),]

***

## Analyzing YouTube comments

### Overview

First, let's have a look at an excerpt of the data to see how it is structured. We will display the
first 10 rows of the dataframe.

In [None]:
# view first 10 rows of dataframe
head(FormattedComments,10)

In this dataframe, we already parsed the information extracted with the tuber package into formats that make the data easily usable. For example, we can use the DateTime column to see the number of new comments
over time without much formatting.

In [None]:
# Create helper dataframe
CommentsCounter <- rep(1,dim(FormattedComments)[1])
CounterFrame <- data.frame(CommentsCounter,unlist(FormattedComments[,8]))
colnames(CounterFrame) <- c("CommentCounter","DateTime")

In [None]:
# bin by week
CounterFrame$DateTime <- as.Date(cut(CounterFrame$DateTime, breaks = "week"))

In [None]:
# compute percentiles
PercTimes <- round(quantile(cumsum(CounterFrame$CommentCounter), probs = c(0.5, 0.75, 0.9, 0.99)))
CounterFrame$DateTime[PercTimes]

In [None]:
# plot
ggplot(CounterFrame,aes(x=DateTime,y=CommentCounter)) +
 stat_summary(fun.y=sum,geom="bar") +
 scale_x_date() +
 labs(title = "Most frequent words in comments", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s") +
 geom_vline(xintercept = CounterFrame$DateTime[PercTimes],linetype = "dashed", colour = "red")+
 geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][1]) , label = "50%", y = 3500), colour="red", angle=90, vjust = 1.2) +
 geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][2]) , label = "75%", y = 3500), colour="red", angle=90, vjust = 1.2) +
 geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][3]) , label = "90%", y = 3500), colour="red", angle=90, vjust = 1.2) +
 geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][4]) , label = "99%", y = 3500), colour="red", angle=90, vjust = 1.2)

***

### Basic frequency analysis for text

In this section, we give a brief outline of text analysis for YouTube Comments.
This part is largely based on this tutorial:

https://docs.quanteda.io/articles/pkgdown/examples/plotting.html

We use the dataframe column without the emojis for the textual analysis here.

First of all, we need to remove new line commands from comment texts.

In [None]:
# Removing newline characters from comments
FormattedComments$TextEmojiDeleted <- gsub(FormattedComments$TextEmojiDeleted, pattern = "\\\n", replacement = " ")

Next, we need to tokenize the comments (i.e., split them up into individual words, seperated by a space).
This step also simplifies the text by:
- removing all numbers
- removing all punctuation
- removing all non-character symbols
- removing all hyphens
- removing all URLs 

In [None]:
# Tokenize the comments
# for more information and options check:
# https://www.rdocumentation.org/packages/quanteda/versions/1.4.0/topics/tokens

toks <- tokens(char_tolower(FormattedComments$TextEmojiDeleted),
 remove_numbers = TRUE,
 remove_punct = TRUE,
 remove_separators = TRUE,
 remove_symbols = TRUE,
 remove_hyphens = TRUE,
 remove_url = TRUE)

Next, we build a document-term matrix and remove stopwords. For more information see:

https://en.wikipedia.org/wiki/Document-term_matrix

https://en.wikipedia.org/wiki/Stop_words)

Stopwords are very frequent words that appear in almost all texts (e.g. "a","but","it").
These words occur with about the same frequency in all kinds of texts and are, hence, not very informative.

In [None]:
# Create document-term frequency matrix
commentsDfm <- dfm(toks, remove = quanteda::stopwords("english"))

We now have a matrix where each column represents a token that occurs at least once in the collcetion of comments and each row represents a comment. If a token is contained in a comment, the respective cell in the matrix contains a 1 and if a token is not contained in a comment the respective cell will contain a 0.

We can use this document-term matrix to visualize the occurance of tokens

In [None]:
# Display the n most frequent tokens
TermFreq <- textstat_frequency(commentsDfm)
head(TermFreq, n = 20)

After inspecting the most frequent terms, we might want to exclude certain
terms that are not informative for us (e.g. the word "video") or are
artifacts of online communication (e.g. xd or d as leftovers of ASCII emojis).

In [None]:
# This is just an example, you can (and should) create your own list for each video
CustomStops <- c("video","oh","d","now","get","go","xd", "youtube", "lol")

In [None]:
# We can create another document-frequency matrix that excludes the custom stopwords that we just defined
commentsDfm <- dfm(toks, remove = c(quanteda::stopwords("english"),CustomStops))

In [None]:
# Recalculate and display the n most frequent tokens
TermFreq <- textstat_frequency(commentsDfm)
head(TermFreq, n = 20)

Next, we can visualize the frequency of tokens with some plots. First of all, lets check the overall frequency of terms across all comments.

In [None]:
# Sort by reverse frequency order (i.e., from most to least frequent)
TermFreq$feature <- with(TermFreq, reorder(feature, -frequency))

# Plot frequency of 50 most common words
ggplot(head(TermFreq, n = 50), aes(x = feature, y = frequency)) + # you can change n to choose how many words are plotted
 geom_point() +
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust= 0.5)) +
 labs(title = "Most frequent words in comments", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

With the above method, we're only counting the overall occurance across all comments. This might be biased
by some users spamming the same tokens many times in the same comment, while other comments might not contain
the term at all. To see whether this is a problem in our data, lets plot the number of comments in which each
token occurs at least once. This is typically called the document frequency.

In [None]:
# sort by reverse document frequency order (i.e., from most to least frequent)
TermFreq$feature <- with(TermFreq, reorder(feature, -docfreq))

# plot terms that appear in the highest number of comments
ggplot(head(TermFreq, n = 50), aes(x = feature, y = docfreq)) + # you can change n to choose how many words are plotted
 geom_point() +
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust= 0.5)) +
 labs(title = "Number of comments that each token is contained in", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

By manual inspection, we do not see any extreme deviations, even though this is a completely subjective
assessment. Whether you want to rely on overall frequency or on document frequency for your analysis depends
on your research question, your data, and your personal assessment. For most examples in this notebook, we will continue to use the overall frequency.

We can also use our document-term frequency matrix to generate a wordcloud of the most common tokens.

In [None]:
# Setting a random seed to make the wordcloud reproducible (this can be any number)
set.seed(12345)

# Creating wordcloud
textplot_wordcloud(dfm_select(commentsDfm, min_nchar=1),
 random_order=FALSE,
 max_words=100)

***

### Sentiment Analysis for Text

We want to compute sentiment scores per comment. This is done by matching the text strings with a dictionary of word sentiments, and adding them up per document (in our case comments). Depending on the type of content you want to analyze, a different sentiment dictionary might be suitable. For our example, we decided to use the AFINN dictionary: 

http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

For more options, check:

https://www.rdocumentation.org/packages/syuzhet/versions/1.0.4/topics/get_sentiment 

In [None]:
# compute sentiment scores
CommentSentiment <- get_sentiment(FormattedComments$TextEmojiDeleted, method = "afinn")

First of all, let's get an overview of the sentiment scores per comment.

In [None]:
# summary statistics for sentiment scores per comment
summary(CommentSentiment)

In [None]:
# display comments with a sentiment score below x
x <- -15
as.list(FormattedComments$TextEmojiDeleted[CommentSentiment < x])

In [None]:
# disyplay comments with a sentiment score above x
x <- 15
as.list(FormattedComments$TextEmojiDeleted[CommentSentiment > x])

In [None]:
# display most negative/positive comment
FormattedComments$TextEmojiDeleted[CommentSentiment == min(CommentSentiment)]
FormattedComments$TextEmojiDeleted[CommentSentiment == max(CommentSentiment)]

By manual inspection, our approach seems to have worked fine with comments having a negative score being negative
and comments with a positive score being postive. However, just assigning sentiments to words and then summing
sentiments per comment (bag-of-words approach) can have some pitfalls. Consider the following cases, for example.

In [None]:
# display specific comment
as.list(FormattedComments$TextEmojiDeleted[CommentSentiment < -10])[5]

As humans, we can see that this comment is meant to be positive. However, the sentiment sum for the comment is negative, mostly due to the strong use of swearwords:

Fucking hilarious! And that guy could either do commercials or be an actor, I\'ve never, in my entire life, heard anyone express themselves that strongly about a fucking hamburger. And now all I know is I have never eaten one of those but damned if I won\'t have it on my list of shit to do tomorrow! Hell of a job by schmoyoho as well, whoever said this should be a commercial hit it on the head.

By contrast, the following negative comment with very civil language is labelled with a positive sentiment.

In [None]:
# Display specific comment
as.list(FormattedComments$TextEmojiDeleted[CommentSentiment > 10])[2]

As humans, we can see that this comment is meant to be negative, however, the sentiment sum for the comment is positive, mostly due to the negated positive words

Schmoyoho, we\'re not really entertained by you anymore. You\'re sort of like Dane Cook. At first we thought, "Wow! Get a load of this channel! It\'s funny!" But then we realized after far too long, "Wow, these guys are just a one trick pony! There is absolutely nothing I like about these people!" You\'ve run your course. The shenanigans, the "songifies".. we get it. It\'s just not that funny man. We don\'t really like you. So please, for your own sake, go and actually try to make some real friends.

As a general note of caution, if you are analyzing text with sentiment dictionaries, you should, hence, always be aware of the issues outlined above, manually inspect your text and be careful when interpreting your results.

***

### Visualizing comment sentiments

Even though sentiment analysis using sums of word-dictionary mappings per comment is not perfect, it might be interesting to get an overview of the distribution of comment sentiments. Let's visualize it!

In [None]:
# build helper dataframe to distinguish between positive, negative and neutral comments
Desc <- CommentSentiment
Desc[Desc > 0] <- "positive"
Desc[Desc < 0] <- "negative"
Desc[Desc == 0] <- "neutral"
df <- data.frame(FormattedComments$TextEmojiDeleted,CommentSentiment,Desc)
colnames(df) <- c("Comment","Sentiment","Valence")

# display amount of positive, negative, and neutral comments
ggplot(data=df, aes(x=Valence, fill = Valence)) +
 geom_bar(stat='count') +
 labs(title = "Sentiment Categories", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

In [None]:
# distribution of comment sentiments
ggplot(df, aes(x=Sentiment)) +
 geom_histogram(binwidth = 1) +
 geom_vline(aes(xintercept=mean(Sentiment)),
 color="black", linetype="dashed", size=1) +
 scale_x_continuous(limits=c(-10,10))+
 labs(title = "Distribution of Comment Sentiment Scores", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

We can see that most comments seem to be neutral and we have more comments with positive sentiment than with negative sentiment for this video.

***

### Basic frequency analysis for emojis

Online communication is different from more traditional forms of written communication in many ways. One of those differences is the use of emojis to express concepts and emotions. In many textual analyses, emojis are not used at all and simply discarded. In this notebook, we will have a look at the use of emojis in YouTube comments. First of all, we need to make the emojis usable for further analyses. This has largely been done in the parsing step (see the R script in the GitHub repo for scraping and parsing the data), so we only need to do some minor preparation here.

**NOTE:** There is a persistent issue with encoding problems for emojis in R on Windows. We tested the code in this notebook on several Windows machines and it should work there as well. If the code does not work for you offline, windows encoding problems are a likely culprit. 

In [None]:
# First, we need to define missing values correctly
FormattedComments$Emoji[FormattedComments$Emoji == "NA"] <- NA

# next, we remove spaces at the end of the string
FormattedComments$Emoji <- substr(FormattedComments$Emoji, 1, nchar(FormattedComments$Emoji)-1)

# then we tokenize emoji descriptions (important for comments that contain more than one emoji)
EmojiToks <- tokens(FormattedComments$Emoji)

# afterwards, we create an emoji frequency matrix, excluding "NA" as a term
EmojiDfm <- dfm(EmojiToks, remove = "NA")

We now have a "document-emoji frequency matrix" and can treat the emojis just like we treated the other tokens in our previous text analyses. Let's check out the most requent emojis.

In [None]:
# list the most frequent emojis in the comments
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 25)

Let's visualize the emoji frequencies.

In [None]:
# sort by reverse frequency order (i.e., from most to least frequent)
EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -frequency))

# plot
ggplot(head(EmojiFreq, n = 50), aes(x = feature, y = frequency)) + # you can change n to choose how many Emojis are plotted 
 geom_point() + 
 theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5)) +
 labs(title = "Most frequent Emojis", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

Interestingly, just as words do, emojis seem to follow a Zipf-like distribution

https://en.wikipedia.org/wiki/Zipf%27s_law

However, our plot still looks a bit bland. Lets make it look nicer by mapping emojis to the respective points in the plot.

In [None]:
# create mappings to display scatterplot points as emojis
mapping1 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_facewithtearsofjoy",], aes(feature,frequency), emoji = "1f602")
mapping2 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_hamburger",], aes(feature,frequency), emoji = "1f354")
mapping3 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_frenchfries",], aes(feature,frequency), emoji = "1f35f")
mapping4 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_smilingfacewithsunglasses",], aes(feature,frequency), emoji = "1f60e")
mapping5 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_smilingface",], aes(feature,frequency), emoji = "263a")
mapping6 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_fire",], aes(feature,frequency), emoji = "1f525")
mapping7 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_loudlycryingface",], aes(feature,frequency), emoji = "1f62d")
mapping8 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_smilingfacewithheart-eyes",], aes(feature,frequency), emoji = "1f60d")
mapping9 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_rollingonthefloorlaughing",], aes(feature,frequency), emoji = "1f923")
mapping10 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == "emoji_redheart",], aes(feature,frequency), emoji = "2764")

# sort by reverse frequency order
EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -frequency))

# plot 10 most common emojis using their graphical representation as points in the scatterplot
ggplot(EmojiFreq[1:10], aes(x = feature, y = frequency)) +
 geom_point() + 
 theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5)) +
 labs(title = "10 Most Frequent Emojis", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s") +
 mapping1 +
 mapping2 +
 mapping3 +
 mapping4 +
 mapping5 +
 mapping6 +
 mapping7 +
 mapping8 +
 mapping9 +
 mapping10


Just like with the text tokens, it might be that some comments contain a specific emoji numerous times. For this reason, we will also check the number of comments that each emoji is contained in.

In [None]:
# sort by reverse document frequency order (i.e., from most to least frequent)
EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -docfreq))

# plot
ggplot(head(EmojiFreq,n = 50), aes(x = feature, y = docfreq)) + # you can change n to choose how many Emojis are plotted 
 geom_point() + 
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
 labs(title = "Emojis contained in most Comments", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

Again, we can use emojis as points to make this plot look cooler.

In [None]:
# create a new frame order by document occurance frequenc rather than overall frequency
NewOrder <- EmojiFreq[order(-EmojiFreq$docfreq),]

# create mappings to display scatterplot points as emojis
mapping1 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_facewithtearsofjoy",], aes(feature,docfreq), emoji = "1f602")
mapping2 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_hamburger",], aes(feature,docfreq), emoji = "1f354")
mapping3 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_loudlycryingface",], aes(feature,docfreq), emoji = "1f62d")
mapping4 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_fire",], aes(feature,docfreq), emoji = "1f525")
mapping5 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_redheart",], aes(feature,docfreq), emoji = "2764")
mapping6 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_heartsuit",], aes(feature,docfreq), emoji = "2665")
mapping7 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_frenchfries",], aes(feature,docfreq), emoji = "1f35f")
mapping8 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_rollingonthefloorlaughing",], aes(feature,docfreq), emoji = "1f923")
mapping9 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_thumbsup",], aes(feature,docfreq), emoji = "1f44d")
mapping10 <- geom_emoji(data = NewOrder[NewOrder$feature == "emoji_smilingfacewithheart-eyes",], aes(feature,docfreq), emoji = "1f60d")

# plot 10 emojis that most comments mention at least once
ggplot(NewOrder[1:10], aes(x = feature, y = docfreq)) +
 geom_point() + 
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
 labs(title = "10 Emojis contained in most Comments", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")+
 mapping1 +
 mapping2 +
 mapping3 +
 mapping4 +
 mapping5 +
 mapping6 +
 mapping7 +
 mapping8 +
 mapping9 +
 mapping10

***

### Emoji sentiment analysis

Emojis are often used to confer emotions (hence the name), so they might be valuable addition
to assess the sentiment of a comment. To do this, we need a dictionary that maps sentiment scores to specific emojis.

In [None]:
# import emoji dictionary (from the lexicon package)
EmojiSentiments <- emojis_sentiment

Unfortunately, the dictionary only contains 734 different emojis. Those were the most frequently used ones when the study on which the dictionars is based was conducted.

You can view the emoji sentiment scores online here:

http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html


We have to match the sentiment scores to our descriptions of the emojis, and create a quanteda dictionary object.

In [None]:
# create quanteda dictionary object
EmojiNames <- paste0("emoji_",gsub(" ","",EmojiSentiments$name))
EmojiSentiment <- cbind.data.frame(EmojiNames,EmojiSentiments$sentiment,EmojiSentiments$polarity)
names(EmojiSentiment) <- c("word","sentiment","valence")
EmojiSentDict <- as.dictionary(EmojiSentiment[,1:2])

In [None]:
# tokenize the emoji-only column in our formatted dataframe
EmojiToks <- tokens(tolower(FormattedComments$Emoji))

In [None]:
# replace the emojis that appear in the dictionary with the corresponding sentiment scores
EmojiToksSent <- tokens_lookup(x = EmojiToks, dictionary = EmojiSentDict)

We now have a vector of emoji sentiment scores for each column that we can use to analyze affective valence. But let's first check how many emojis we could and how many we could not assign a sentiment score to.

In [None]:
# total number of emojis in the dataframe
AllEmoji <- unlist(EmojiToksSent)
names(AllEmoji) <- NULL
AllNonNAEmoji <- AllEmoji[AllEmoji!="NA"]
length(AllNonNAEmoji)

In [None]:
# number of emojis that could be assigned a sentiment score
length(grep("0.",AllNonNAEmoji))

In [None]:
# number of emojis that could not be assigned a sentiment score
length(grep("emoji_",AllNonNAEmoji))

We could assign sentiment to all emojis in our data! Nice! Now we need to compute an overall metric for sentiment of each comment based only on the emojis

In [None]:
# only keep the assigned sentiment scores for the emoji vector
AllEmojiSentiments <- tokens_select(EmojiToksSent,EmojiSentiment$sentiment,"keep")
AllEmojiSentiments <- as.list(AllEmojiSentiments)

# define custom function to add up sentiment scores of emojis per comment
AddEmojiSentiments <- function(x){
 
 x <- sum(as.numeric(as.character(x)))
 return(x)
 
}

# apply the function to every comment that contains emojis (only those emojis that have a sentiment score will be used)
AdditiveEmojiSentiment <- lapply(AllEmojiSentiments,AddEmojiSentiments)
AdditiveEmojiSentiment[AdditiveEmojiSentiment == 0] <- NA
AdditiveEmojiSentiment <- unlist(AdditiveEmojiSentiment)

Now let's plot the distribution of summed emoji sentiment per comment.

In [None]:
# plot histogram to check distribution of emoji sentiment scores
AES_df <- data.frame(AdditiveEmojiSentiment)
ggplot(AES_df, aes(x = AES_df[,1])) +
 geom_histogram(binwidth = 1) +
 labs(title = "Distribution of Summed Emoji Sentiment Scores by Comment", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s") +
 xlab("Emoji Sentiment summed per Comment")

We can see that most emoji sentiment sum scores are neutral or slightly positive. However, there also are some slightly negative scores and a few very positive outliers. Let's have a look at these comments.

In [None]:
# show comments with negative emoji sum scores
EmojiNegComments <- FormattedComments[AdditiveEmojiSentiment < 0,2]
as.list(EmojiNegComments[is.na(EmojiNegComments) == F])

In [None]:
# show comments with overly positive emoji sum scores
EmojiPosComments <- FormattedComments[AdditiveEmojiSentiment > 20,2]
as.list(EmojiPosComments[is.na(EmojiPosComments) == F])

As we can see, especially the positive outliers seem to occur because the same emoji gets spammend multiple times in the same comment.

***

### Relationship between text and emoji sentiment

If emojis are used to underline the affective valence of what a comment author wants to express, there should be a positive correlation between text sentiment and emoji sentiment. Let's check if that is the case for our example video.

In [None]:
# correlation between additive emoji sentiment score and text sentiment score
cor(CommentSentiment,AdditiveEmojiSentiment,use="complete.obs")

In [None]:
# plot the relationship
TextEmojiRel <- data.frame(CommentSentiment,AdditiveEmojiSentiment)
ggplot(TextEmojiRel, aes(x = CommentSentiment, y = AdditiveEmojiSentiment)) + geom_point(shape = 1) + 
 labs(title = "Scatterplot for Comment Sentiment and Emoji Sentiment", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s") +
 scale_x_continuous(limits=c(-15,15))

As we can see, there seems to be no relationship between the sentiment scores of the text and the sentiment
of the used emojis. This can have multiple reasons:
 - Comments that score very high (positive) on emoji sentiment typically contain very little text.
 - Comments that score very low (negative) on emoji sentiment typically contain very little text.
 - Bag-of-Words/-Emojis sentiment analysis is limited - there is a lot of room for error in both metrics.
 - Most comment text sentiments and emoji sentiments are neutral.
 - Emojis are very much context dependent. However, we only consider a single sentiment score for each emoji.
 - High positive scores on the emoji sentiment are likely due to people spamming the same emoji a lot.

We can try to make our metrics less dependent on the amount of emojis or words in the comments by comparing average sentiment per used word and per used emoji for each comment.

In [None]:
# average sentiment score per word for each comment
WordsInComments <- sapply(FormattedComments$TextEmojiDeleted,function(x){A <- strsplit(x," ");return(length(A[[1]]))})
names(WordsInComments) <- NULL

# compute average sentiment score per word instead of using the overall sum
AverageSentimentPerWord <- CommentSentiment/WordsInComments

# save a copy of the full vector for later use
FullAverageSentimentPerWord <- AverageSentimentPerWord

# exclude comments that do not have any words in them
AverageSentimentPerWord <- AverageSentimentPerWord[is.nan(AverageSentimentPerWord) == FALSE]

Let's see if our assessment is different now that we averaged sentiment scores by number of words.

In [None]:
# build helper dataframe to distinguish between positive, negative and neutral comments
Desc <- AverageSentimentPerWord
Desc[Desc > 0] <- "positive"
Desc[Desc < 0] <- "negative"
Desc[Desc == 0] <- "neutral"
df <- data.frame(FormattedComments$TextEmojiDeleted[is.nan(FullAverageSentimentPerWord) == FALSE],AverageSentimentPerWord,Desc)
colnames(df) <- c("Comment","Sentiment","Valence")

# display amount of positive, negative, and neutral comments
ggplot(data=df, aes(x=Valence, fill = Valence)) +
 geom_bar(stat='count') + 
 labs(title = "Average Word Sentiment per Comment", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")
 

In [None]:
# distribution of comment sentiments
ggplot(df, aes(x=Sentiment)) +
 geom_histogram(binwidth = 1) +
 geom_vline(aes(xintercept=mean(Sentiment)),
 color="black", linetype="dashed", size=1) +
 scale_x_continuous(limits=c(-5,5)) + 
 labs(title = "Average Word Sentiment per Comment", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")
 

In [None]:
# display most negative/positive comment(s) (by average sentiment score per word)
as.list(as.character(df$Comment[AverageSentimentPerWord == min(AverageSentimentPerWord)]))
as.list(as.character(df$Comment[AverageSentimentPerWord == max(AverageSentimentPerWord)]))

We now have very short comments with very few extreme words as the extreme ends of the spectrum. Let's have a look at the emojis as well.

In [None]:
## compute average emoji sentiment per comment

# define custom function to add up sentiment scores of emojis per comment
AverageEmojiSentiments <- function(x){
 
 x <- mean(as.numeric(unlist(x)))
 return(x)
 
}

# Apply the function to every comment that contains emojis (only those emojis that have a sentiment score will be used)
AverageEmojiSentiment <- lapply(AllEmojiSentiments,AverageEmojiSentiments)

# save a copy of the vector for later use
FullAverageEmojiSentiment <- unlist(AverageEmojiSentiment)

AverageEmojiSentiment[AverageEmojiSentiment == 0] <- NA
AverageEmojiSentiment <- unlist(AverageEmojiSentiment)

# exclude comments that do not contain emojis
AverageEmojiSentiment <- AverageEmojiSentiment[is.nan(AverageEmojiSentiment) == FALSE]

In [None]:
# plot histogram to check distribution of emoji sentiment scores
AvES_df <- data.frame(AverageEmojiSentiment)
ggplot(AvES_df, aes(x = AvES_df[,1])) +
 geom_histogram(binwidth = 0.2) +
 labs(title = "Distribution of Averaged Emoji Sentiment Scores by Comment", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s") +
 xlab("Emoji Sentiment averaged per Comment") 

Now that we have averaged both sentiment metrics, let's check whether this changed something in their bivariate distribution.

In [None]:
# correlation between averaged emoji sentiment score and averaged text sentiment score
cor(FullAverageSentimentPerWord,FullAverageEmojiSentiment,use="complete.obs")

In [None]:
# plot the relationship
TextEmojiRel <- data.frame(FullAverageSentimentPerWord,FullAverageEmojiSentiment)
ggplot(TextEmojiRel, aes(x = FullAverageSentimentPerWord, y = FullAverageEmojiSentiment)) + geom_point(shape = 1) +
 ggtitle("Averaged Sentiment Scores") +
 labs(title = "Averagd Sentiment Scores", subtitle = "Schmoyoho - OH MY DAYUM ft. Daym Drops \nhttps://www.youtube.com/watch?v=DcJFdCmN98s")

We do obtain a larger positive correlation with the averaged measures, however visual inspection reveals that there is no meaningful linear relationships. The data are clustered around one vertical line and multiple horizontal lines. This is likely in large parts due to:

 - skewed distribution of number of emojis per comment and types of emojis used (e.g., using the ROFL emoji exactly once is by far the most common case for this particular video)
 - most common average sentiment per word is zero

## Conclusion

 - including emojis in the analysis of text data from the internet/social media data can provide valuable additional insights
 - simple dictonary-based sentiment analysis of emojis has clear limitations
 - it might be interesting to also use emojis in topic models or network analyses of text