{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing Youtube Comments in R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will interactively walk trough the analysis of YouTube comments. To run a code chunk, simply click on it and hit the \"Run\" button from the menu above. \n", "\n", "If you want to know (more about) what Jupyter notebooks are and how to use them, see, e.g., https://medium.com/ibm-data-science-experience/back-to-basics-jupyter-notebooks-dfcdc19c54bc\n", "\n", "For instructions on how to download and parse the data that we will be using here, please refer to the R script file in the GitHub repository associated with this notebook. The script and notebook are part of an ongoing research project (see https://www.researchgate.net/project/Methods-and-Tools-for-Automatic-Sampling-and-Analysis-of-YouTube-Comments) and will be subject to change. \n", "\n", "If you use substantive parts of this notebook or the accompanying script as part of your own research, please cite it in the following way:\n", "\n", "Kohne, J., Breuer, J., & Mohseni, M. R. (2019). Methods and Tools for Automatic Sampling and Analysis of YouTube User Comments: https://doi.org/10.17605/OSF.IO/HQSXE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing & loading packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this Notebook, we have already pre-installed all necessary packages that you need to run our code. These packages are:\n", "\n", " - devtools\n", " - tm\n", " - quanteda\n", " - tuber\n", " - qdapRegex\n", " - rlang\n", " - purrr\n", " - ggplot2\n", " - syuzhet\n", " - lexicon\n", " \n", "in addition to the above CRAN packages, we also included two packages for handling emojis that are not on CRAN (yet) and have to be installed directly through GitHub:\n", "\n", " - emo\n", " - emoGG\n", "\n", "If you want to replicate our analysis locally or want to adapt it, you will need to install those packages on your machine as well. To do so, you can download the file \"install.R\" in the folder called \"binder\" from this notebook and run it on your local machine before your analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** The binder server on which this notebook runs on is still using an older version of R (3.4.4),\n", "this is why we need to install an older version of the statnet.common package, which the quanteda package relies on. If you want to replicate this analysis locally and are using the newest version of R, you can skip the statnet.common installation and simply install the Quanteda package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading CRAN libraries\n", "library(devtools)\n", "library(tm)\n", "library(quanteda)\n", "library(qdapRegex)\n", "library(rlang)\n", "library(purrr)\n", "library(ggplot2)\n", "library(syuzhet)\n", "library(lexicon)\n", "\n", "# loading Github libraries\n", "library(emo)\n", "library(emoGG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting options" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we are working with text data, we need to set the following option, so that text is not interpreted as categorical variables by R" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(stringAsFactors = FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will work with a saved version of an already parsed dataframe. Please check the R script file\n", "in the associated GitHub repository for a walktrough of how to extract comments from YouTube and parse them. We will be using the comments of this video, extracted in February 2019:\n", "\n", "https://www.youtube.com/watch?v=DcJFdCmN98s " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loading prepared dataset - the name of this dataframe is \"FormattedComments\"\n", "load(\"./Data/ParsedCommentsUTF8.Rdata\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sorting comments by date\n", "FormattedComments <- FormattedComments[order(FormattedComments$Published),]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing YouTube comments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's have a look at an excerpt of the data to see how it is structured. We will display the\n", "first 10 rows of the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# view first 10 rows of dataframe\n", "head(FormattedComments,10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataframe, we already parsed the information extracted with the tuber package into formats that make the data easily usable. For example, we can use the DateTime column to see the number of new comments\n", "over time without much formatting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create helper dataframe\n", "CommentsCounter <- rep(1,dim(FormattedComments)[1])\n", "CounterFrame <- data.frame(CommentsCounter,unlist(FormattedComments[,8]))\n", "colnames(CounterFrame) <- c(\"CommentCounter\",\"DateTime\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# bin by week\n", "CounterFrame$DateTime <- as.Date(cut(CounterFrame$DateTime, breaks = \"week\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute percentiles\n", "PercTimes <- round(quantile(cumsum(CounterFrame$CommentCounter), probs = c(0.5, 0.75, 0.9, 0.99)))\n", "CounterFrame$DateTime[PercTimes]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot\n", "ggplot(CounterFrame,aes(x=DateTime,y=CommentCounter)) +\n", " stat_summary(fun.y=sum,geom=\"bar\") +\n", " scale_x_date() +\n", " labs(title = \"Most frequent words in comments\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\") +\n", " geom_vline(xintercept = CounterFrame$DateTime[PercTimes],linetype = \"dashed\", colour = \"red\")+\n", " geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][1]) , label = \"50%\", y = 3500), colour=\"red\", angle=90, vjust = 1.2) +\n", " geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][2]) , label = \"75%\", y = 3500), colour=\"red\", angle=90, vjust = 1.2) +\n", " geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][3]) , label = \"90%\", y = 3500), colour=\"red\", angle=90, vjust = 1.2) +\n", " geom_text(aes(x = as.Date(CounterFrame$DateTime[PercTimes][4]) , label = \"99%\", y = 3500), colour=\"red\", angle=90, vjust = 1.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic frequency analysis for text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we give a brief outline of text analysis for YouTube Comments.\n", "This part is largely based on this tutorial:\n", "\n", "https://docs.quanteda.io/articles/pkgdown/examples/plotting.html\n", "\n", "We use the dataframe column without the emojis for the textual analysis here.\n", "\n", "First of all, we need to remove new line commands from comment texts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Removing newline characters from comments\n", "FormattedComments$TextEmojiDeleted <- gsub(FormattedComments$TextEmojiDeleted, pattern = \"\\\\\\n\", replacement = \" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to tokenize the comments (i.e., split them up into individual words, seperated by a space).\n", "This step also simplifies the text by:\n", "- removing all numbers\n", "- removing all punctuation\n", "- removing all non-character symbols\n", "- removing all hyphens\n", "- removing all URLs " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tokenize the comments\n", "# for more information and options check:\n", "# https://www.rdocumentation.org/packages/quanteda/versions/1.4.0/topics/tokens\n", "\n", "toks <- tokens(char_tolower(FormattedComments$TextEmojiDeleted),\n", " remove_numbers = TRUE,\n", " remove_punct = TRUE,\n", " remove_separators = TRUE,\n", " remove_symbols = TRUE,\n", " remove_hyphens = TRUE,\n", " remove_url = TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we build a document-term matrix and remove stopwords. For more information see:\n", "\n", "https://en.wikipedia.org/wiki/Document-term_matrix\n", "\n", "https://en.wikipedia.org/wiki/Stop_words)\n", "\n", "Stopwords are very frequent words that appear in almost all texts (e.g. \"a\",\"but\",\"it\").\n", "These words occur with about the same frequency in all kinds of texts and are, hence, not very informative." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create document-term frequency matrix\n", "commentsDfm <- dfm(toks, remove = quanteda::stopwords(\"english\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a matrix where each column represents a token that occurs at least once in the collcetion of comments and each row represents a comment. If a token is contained in a comment, the respective cell in the matrix contains a 1 and if a token is not contained in a comment the respective cell will contain a 0.\n", "\n", "We can use this document-term matrix to visualize the occurance of tokens" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the n most frequent tokens\n", "TermFreq <- textstat_frequency(commentsDfm)\n", "head(TermFreq, n = 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After inspecting the most frequent terms, we might want to exclude certain\n", "terms that are not informative for us (e.g. the word \"video\") or are\n", "artifacts of online communication (e.g. xd or d as leftovers of ASCII emojis)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is just an example, you can (and should) create your own list for each video\n", "CustomStops <- c(\"video\",\"oh\",\"d\",\"now\",\"get\",\"go\",\"xd\", \"youtube\", \"lol\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can create another document-frequency matrix that excludes the custom stopwords that we just defined\n", "commentsDfm <- dfm(toks, remove = c(quanteda::stopwords(\"english\"),CustomStops))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Recalculate and display the n most frequent tokens\n", "TermFreq <- textstat_frequency(commentsDfm)\n", "head(TermFreq, n = 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can visualize the frequency of tokens with some plots. First of all, lets check the overall frequency of terms across all comments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sort by reverse frequency order (i.e., from most to least frequent)\n", "TermFreq$feature <- with(TermFreq, reorder(feature, -frequency))\n", "\n", "# Plot frequency of 50 most common words\n", "ggplot(head(TermFreq, n = 50), aes(x = feature, y = frequency)) + # you can change n to choose how many words are plotted\n", " geom_point() +\n", " theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust= 0.5)) +\n", " labs(title = \"Most frequent words in comments\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the above method, we're only counting the overall occurance across all comments. This might be biased\n", "by some users spamming the same tokens many times in the same comment, while other comments might not contain\n", "the term at all. To see whether this is a problem in our data, lets plot the number of comments in which each\n", "token occurs at least once. This is typically called the document frequency." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sort by reverse document frequency order (i.e., from most to least frequent)\n", "TermFreq$feature <- with(TermFreq, reorder(feature, -docfreq))\n", "\n", "# plot terms that appear in the highest number of comments\n", "ggplot(head(TermFreq, n = 50), aes(x = feature, y = docfreq)) + # you can change n to choose how many words are plotted\n", " geom_point() +\n", " theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust= 0.5)) +\n", " labs(title = \"Number of comments that each token is contained in\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By manual inspection, we do not see any extreme deviations, even though this is a completely subjective\n", "assessment. Whether you want to rely on overall frequency or on document frequency for your analysis depends\n", "on your research question, your data, and your personal assessment. For most examples in this notebook, we will continue to use the overall frequency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use our document-term frequency matrix to generate a wordcloud of the most common tokens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setting a random seed to make the wordcloud reproducible (this can be any number)\n", "set.seed(12345)\n", "\n", "# Creating wordcloud\n", "textplot_wordcloud(dfm_select(commentsDfm, min_nchar=1),\n", " random_order=FALSE,\n", " max_words=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentiment Analysis for Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to compute sentiment scores per comment. This is done by matching the text strings with a dictionary of word sentiments, and adding them up per document (in our case comments). Depending on the type of content you want to analyze, a different sentiment dictionary might be suitable. For our example, we decided to use the AFINN dictionary: \n", "\n", "http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010\n", "\n", "For more options, check:\n", "\n", "https://www.rdocumentation.org/packages/syuzhet/versions/1.0.4/topics/get_sentiment " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute sentiment scores\n", "CommentSentiment <- get_sentiment(FormattedComments$TextEmojiDeleted, method = \"afinn\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, let's get an overview of the sentiment scores per comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# summary statistics for sentiment scores per comment\n", "summary(CommentSentiment)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display comments with a sentiment score below x\n", "x <- -15\n", "as.list(FormattedComments$TextEmojiDeleted[CommentSentiment < x])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# disyplay comments with a sentiment score above x\n", "x <- 15\n", "as.list(FormattedComments$TextEmojiDeleted[CommentSentiment > x])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display most negative/positive comment\n", "FormattedComments$TextEmojiDeleted[CommentSentiment == min(CommentSentiment)]\n", "FormattedComments$TextEmojiDeleted[CommentSentiment == max(CommentSentiment)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By manual inspection, our approach seems to have worked fine with comments having a negative score being negative\n", "and comments with a positive score being postive. However, just assigning sentiments to words and then summing\n", "sentiments per comment (bag-of-words approach) can have some pitfalls. Consider the following cases, for example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display specific comment\n", "as.list(FormattedComments$TextEmojiDeleted[CommentSentiment < -10])[5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As humans, we can see that this comment is meant to be positive. However, the sentiment sum for the comment is negative, mostly due to the strong use of swearwords:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fucking hilarious! And that guy could either do commercials or be an actor, I\\'ve never, in my entire life, heard anyone express themselves that strongly about a fucking hamburger. And now all I know is I have never eaten one of those but damned if I won\\'t have it on my list of shit to do tomorrow! Hell of a job by schmoyoho as well, whoever said this should be a commercial hit it on the head." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By contrast, the following negative comment with very civil language is labelled with a positive sentiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display specific comment\n", "as.list(FormattedComments$TextEmojiDeleted[CommentSentiment > 10])[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As humans, we can see that this comment is meant to be negative, however, the sentiment sum for the comment is positive, mostly due to the negated positive words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Schmoyoho, we\\'re not really entertained by you anymore. You\\'re sort of like Dane Cook. At first we thought, \"Wow! Get a load of this channel! It\\'s funny!\" But then we realized after far too long, \"Wow, these guys are just a one trick pony! There is absolutely nothing I like about these people!\" You\\'ve run your course. The shenanigans, the \"songifies\".. we get it. It\\'s just not that funny man. We don\\'t really like you. So please, for your own sake, go and actually try to make some real friends." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a general note of caution, if you are analyzing text with sentiment dictionaries, you should, hence, always be aware of the issues outlined above, manually inspect your text and be careful when interpreting your results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing comment sentiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though sentiment analysis using sums of word-dictionary mappings per comment is not perfect, it might be interesting to get an overview of the distribution of comment sentiments. Let's visualize it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# build helper dataframe to distinguish between positive, negative and neutral comments\n", "Desc <- CommentSentiment\n", "Desc[Desc > 0] <- \"positive\"\n", "Desc[Desc < 0] <- \"negative\"\n", "Desc[Desc == 0] <- \"neutral\"\n", "df <- data.frame(FormattedComments$TextEmojiDeleted,CommentSentiment,Desc)\n", "colnames(df) <- c(\"Comment\",\"Sentiment\",\"Valence\")\n", "\n", "# display amount of positive, negative, and neutral comments\n", "ggplot(data=df, aes(x=Valence, fill = Valence)) +\n", " geom_bar(stat='count') +\n", " labs(title = \"Sentiment Categories\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# distribution of comment sentiments\n", "ggplot(df, aes(x=Sentiment)) +\n", " geom_histogram(binwidth = 1) +\n", " geom_vline(aes(xintercept=mean(Sentiment)),\n", " color=\"black\", linetype=\"dashed\", size=1) +\n", " scale_x_continuous(limits=c(-10,10))+\n", " labs(title = \"Distribution of Comment Sentiment Scores\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that most comments seem to be neutral and we have more comments with positive sentiment than with negative sentiment for this video." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic frequency analysis for emojis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Online communication is different from more traditional forms of written communication in many ways. One of those differences is the use of emojis to express concepts and emotions. In many textual analyses, emojis are not used at all and simply discarded. In this notebook, we will have a look at the use of emojis in YouTube comments. First of all, we need to make the emojis usable for further analyses. This has largely been done in the parsing step (see the R script in the GitHub repo for scraping and parsing the data), so we only need to do some minor preparation here.\n", "\n", "**NOTE:** There is a persistent issue with encoding problems for emojis in R on Windows. We tested the code in this notebook on several Windows machines and it should work there as well. If the code does not work for you offline, windows encoding problems are a likely culprit. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First, we need to define missing values correctly\n", "FormattedComments$Emoji[FormattedComments$Emoji == \"NA\"] <- NA\n", "\n", "# next, we remove spaces at the end of the string\n", "FormattedComments$Emoji <- substr(FormattedComments$Emoji, 1, nchar(FormattedComments$Emoji)-1)\n", "\n", "# then we tokenize emoji descriptions (important for comments that contain more than one emoji)\n", "EmojiToks <- tokens(FormattedComments$Emoji)\n", "\n", "# afterwards, we create an emoji frequency matrix, excluding \"NA\" as a term\n", "EmojiDfm <- dfm(EmojiToks, remove = \"NA\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a \"document-emoji frequency matrix\" and can treat the emojis just like we treated the other tokens in our previous text analyses. Let's check out the most requent emojis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# list the most frequent emojis in the comments\n", "EmojiFreq <- textstat_frequency(EmojiDfm)\n", "head(EmojiFreq, n = 25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the emoji frequencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sort by reverse frequency order (i.e., from most to least frequent)\n", "EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -frequency))\n", "\n", "# plot\n", "ggplot(head(EmojiFreq, n = 50), aes(x = feature, y = frequency)) + # you can change n to choose how many Emojis are plotted \n", " geom_point() + \n", " theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5)) +\n", " labs(title = \"Most frequent Emojis\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, just as words do, emojis seem to follow a Zipf-like distribution\n", "\n", "https://en.wikipedia.org/wiki/Zipf%27s_law\n", "\n", "However, our plot still looks a bit bland. Lets make it look nicer by mapping emojis to the respective points in the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create mappings to display scatterplot points as emojis\n", "mapping1 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_facewithtearsofjoy\",], aes(feature,frequency), emoji = \"1f602\")\n", "mapping2 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_hamburger\",], aes(feature,frequency), emoji = \"1f354\")\n", "mapping3 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_frenchfries\",], aes(feature,frequency), emoji = \"1f35f\")\n", "mapping4 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_smilingfacewithsunglasses\",], aes(feature,frequency), emoji = \"1f60e\")\n", "mapping5 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_smilingface\",], aes(feature,frequency), emoji = \"263a\")\n", "mapping6 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_fire\",], aes(feature,frequency), emoji = \"1f525\")\n", "mapping7 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_loudlycryingface\",], aes(feature,frequency), emoji = \"1f62d\")\n", "mapping8 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_smilingfacewithheart-eyes\",], aes(feature,frequency), emoji = \"1f60d\")\n", "mapping9 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_rollingonthefloorlaughing\",], aes(feature,frequency), emoji = \"1f923\")\n", "mapping10 <- geom_emoji(data = EmojiFreq[EmojiFreq$feature == \"emoji_redheart\",], aes(feature,frequency), emoji = \"2764\")\n", "\n", "# sort by reverse frequency order\n", "EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -frequency))\n", "\n", "# plot 10 most common emojis using their graphical representation as points in the scatterplot\n", "ggplot(EmojiFreq[1:10], aes(x = feature, y = frequency)) +\n", " geom_point() + \n", " theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5)) +\n", " labs(title = \"10 Most Frequent Emojis\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\") +\n", " mapping1 +\n", " mapping2 +\n", " mapping3 +\n", " mapping4 +\n", " mapping5 +\n", " mapping6 +\n", " mapping7 +\n", " mapping8 +\n", " mapping9 +\n", " mapping10\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like with the text tokens, it might be that some comments contain a specific emoji numerous times. For this reason, we will also check the number of comments that each emoji is contained in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sort by reverse document frequency order (i.e., from most to least frequent)\n", "EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -docfreq))\n", "\n", "# plot\n", "ggplot(head(EmojiFreq,n = 50), aes(x = feature, y = docfreq)) + # you can change n to choose how many Emojis are plotted \n", " geom_point() + \n", " theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +\n", " labs(title = \"Emojis contained in most Comments\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we can use emojis as points to make this plot look cooler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a new frame order by document occurance frequenc rather than overall frequency\n", "NewOrder <- EmojiFreq[order(-EmojiFreq$docfreq),]\n", "\n", "# create mappings to display scatterplot points as emojis\n", "mapping1 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_facewithtearsofjoy\",], aes(feature,docfreq), emoji = \"1f602\")\n", "mapping2 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_hamburger\",], aes(feature,docfreq), emoji = \"1f354\")\n", "mapping3 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_loudlycryingface\",], aes(feature,docfreq), emoji = \"1f62d\")\n", "mapping4 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_fire\",], aes(feature,docfreq), emoji = \"1f525\")\n", "mapping5 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_redheart\",], aes(feature,docfreq), emoji = \"2764\")\n", "mapping6 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_heartsuit\",], aes(feature,docfreq), emoji = \"2665\")\n", "mapping7 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_frenchfries\",], aes(feature,docfreq), emoji = \"1f35f\")\n", "mapping8 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_rollingonthefloorlaughing\",], aes(feature,docfreq), emoji = \"1f923\")\n", "mapping9 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_thumbsup\",], aes(feature,docfreq), emoji = \"1f44d\")\n", "mapping10 <- geom_emoji(data = NewOrder[NewOrder$feature == \"emoji_smilingfacewithheart-eyes\",], aes(feature,docfreq), emoji = \"1f60d\")\n", "\n", "# plot 10 emojis that most comments mention at least once\n", "ggplot(NewOrder[1:10], aes(x = feature, y = docfreq)) +\n", " geom_point() + \n", " theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +\n", " labs(title = \"10 Emojis contained in most Comments\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")+\n", " mapping1 +\n", " mapping2 +\n", " mapping3 +\n", " mapping4 +\n", " mapping5 +\n", " mapping6 +\n", " mapping7 +\n", " mapping8 +\n", " mapping9 +\n", " mapping10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Emoji sentiment analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Emojis are often used to confer emotions (hence the name), so they might be valuable addition\n", "to assess the sentiment of a comment. To do this, we need a dictionary that maps sentiment scores to specific emojis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import emoji dictionary (from the lexicon package)\n", "EmojiSentiments <- emojis_sentiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, the dictionary only contains 734 different emojis. Those were the most frequently used ones when the study on which the dictionars is based was conducted.\n", "\n", "You can view the emoji sentiment scores online here:\n", "\n", "http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to match the sentiment scores to our descriptions of the emojis, and create a quanteda dictionary object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create quanteda dictionary object\n", "EmojiNames <- paste0(\"emoji_\",gsub(\" \",\"\",EmojiSentiments$name))\n", "EmojiSentiment <- cbind.data.frame(EmojiNames,EmojiSentiments$sentiment,EmojiSentiments$polarity)\n", "names(EmojiSentiment) <- c(\"word\",\"sentiment\",\"valence\")\n", "EmojiSentDict <- as.dictionary(EmojiSentiment[,1:2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# tokenize the emoji-only column in our formatted dataframe\n", "EmojiToks <- tokens(tolower(FormattedComments$Emoji))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# replace the emojis that appear in the dictionary with the corresponding sentiment scores\n", "EmojiToksSent <- tokens_lookup(x = EmojiToks, dictionary = EmojiSentDict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a vector of emoji sentiment scores for each column that we can use to analyze affective valence. But let's first check how many emojis we could and how many we could not assign a sentiment score to." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# total number of emojis in the dataframe\n", "AllEmoji <- unlist(EmojiToksSent)\n", "names(AllEmoji) <- NULL\n", "AllNonNAEmoji <- AllEmoji[AllEmoji!=\"NA\"]\n", "length(AllNonNAEmoji)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of emojis that could be assigned a sentiment score\n", "length(grep(\"0.\",AllNonNAEmoji))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of emojis that could not be assigned a sentiment score\n", "length(grep(\"emoji_\",AllNonNAEmoji))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could assign sentiment to all emojis in our data! Nice! Now we need to compute an overall metric for sentiment of each comment based only on the emojis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# only keep the assigned sentiment scores for the emoji vector\n", "AllEmojiSentiments <- tokens_select(EmojiToksSent,EmojiSentiment$sentiment,\"keep\")\n", "AllEmojiSentiments <- as.list(AllEmojiSentiments)\n", "\n", "# define custom function to add up sentiment scores of emojis per comment\n", "AddEmojiSentiments <- function(x){\n", " \n", " x <- sum(as.numeric(as.character(x)))\n", " return(x)\n", " \n", "}\n", "\n", "# apply the function to every comment that contains emojis (only those emojis that have a sentiment score will be used)\n", "AdditiveEmojiSentiment <- lapply(AllEmojiSentiments,AddEmojiSentiments)\n", "AdditiveEmojiSentiment[AdditiveEmojiSentiment == 0] <- NA\n", "AdditiveEmojiSentiment <- unlist(AdditiveEmojiSentiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's plot the distribution of summed emoji sentiment per comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot histogram to check distribution of emoji sentiment scores\n", "AES_df <- data.frame(AdditiveEmojiSentiment)\n", "ggplot(AES_df, aes(x = AES_df[,1])) +\n", " geom_histogram(binwidth = 1) +\n", " labs(title = \"Distribution of Summed Emoji Sentiment Scores by Comment\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\") +\n", " xlab(\"Emoji Sentiment summed per Comment\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that most emoji sentiment sum scores are neutral or slightly positive. However, there also are some slightly negative scores and a few very positive outliers. Let's have a look at these comments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show comments with negative emoji sum scores\n", "EmojiNegComments <- FormattedComments[AdditiveEmojiSentiment < 0,2]\n", "as.list(EmojiNegComments[is.na(EmojiNegComments) == F])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show comments with overly positive emoji sum scores\n", "EmojiPosComments <- FormattedComments[AdditiveEmojiSentiment > 20,2]\n", "as.list(EmojiPosComments[is.na(EmojiPosComments) == F])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, especially the positive outliers seem to occur because the same emoji gets spammend multiple times in the same comment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationship between text and emoji sentiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If emojis are used to underline the affective valence of what a comment author wants to express, there should be a positive correlation between text sentiment and emoji sentiment. Let's check if that is the case for our example video." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# correlation between additive emoji sentiment score and text sentiment score\n", "cor(CommentSentiment,AdditiveEmojiSentiment,use=\"complete.obs\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the relationship\n", "TextEmojiRel <- data.frame(CommentSentiment,AdditiveEmojiSentiment)\n", "ggplot(TextEmojiRel, aes(x = CommentSentiment, y = AdditiveEmojiSentiment)) + geom_point(shape = 1) + \n", " labs(title = \"Scatterplot for Comment Sentiment and Emoji Sentiment\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\") +\n", " scale_x_continuous(limits=c(-15,15))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, there seems to be no relationship between the sentiment scores of the text and the sentiment\n", "of the used emojis. This can have multiple reasons:\n", " - Comments that score very high (positive) on emoji sentiment typically contain very little text.\n", " - Comments that score very low (negative) on emoji sentiment typically contain very little text.\n", " - Bag-of-Words/-Emojis sentiment analysis is limited - there is a lot of room for error in both metrics.\n", " - Most comment text sentiments and emoji sentiments are neutral.\n", " - Emojis are very much context dependent. However, we only consider a single sentiment score for each emoji.\n", " - High positive scores on the emoji sentiment are likely due to people spamming the same emoji a lot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can try to make our metrics less dependent on the amount of emojis or words in the comments by comparing average sentiment per used word and per used emoji for each comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# average sentiment score per word for each comment\n", "WordsInComments <- sapply(FormattedComments$TextEmojiDeleted,function(x){A <- strsplit(x,\" \");return(length(A[[1]]))})\n", "names(WordsInComments) <- NULL\n", "\n", "# compute average sentiment score per word instead of using the overall sum\n", "AverageSentimentPerWord <- CommentSentiment/WordsInComments\n", "\n", "# save a copy of the full vector for later use\n", "FullAverageSentimentPerWord <- AverageSentimentPerWord\n", "\n", "# exclude comments that do not have any words in them\n", "AverageSentimentPerWord <- AverageSentimentPerWord[is.nan(AverageSentimentPerWord) == FALSE]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see if our assessment is different now that we averaged sentiment scores by number of words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# build helper dataframe to distinguish between positive, negative and neutral comments\n", "Desc <- AverageSentimentPerWord\n", "Desc[Desc > 0] <- \"positive\"\n", "Desc[Desc < 0] <- \"negative\"\n", "Desc[Desc == 0] <- \"neutral\"\n", "df <- data.frame(FormattedComments$TextEmojiDeleted[is.nan(FullAverageSentimentPerWord) == FALSE],AverageSentimentPerWord,Desc)\n", "colnames(df) <- c(\"Comment\",\"Sentiment\",\"Valence\")\n", "\n", "# display amount of positive, negative, and neutral comments\n", "ggplot(data=df, aes(x=Valence, fill = Valence)) +\n", " geom_bar(stat='count') + \n", " labs(title = \"Average Word Sentiment per Comment\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# distribution of comment sentiments\n", "ggplot(df, aes(x=Sentiment)) +\n", " geom_histogram(binwidth = 1) +\n", " geom_vline(aes(xintercept=mean(Sentiment)),\n", " color=\"black\", linetype=\"dashed\", size=1) +\n", " scale_x_continuous(limits=c(-5,5)) + \n", " labs(title = \"Average Word Sentiment per Comment\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display most negative/positive comment(s) (by average sentiment score per word)\n", "as.list(as.character(df$Comment[AverageSentimentPerWord == min(AverageSentimentPerWord)]))\n", "as.list(as.character(df$Comment[AverageSentimentPerWord == max(AverageSentimentPerWord)]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have very short comments with very few extreme words as the extreme ends of the spectrum. Let's have a look at the emojis as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## compute average emoji sentiment per comment\n", "\n", "# define custom function to add up sentiment scores of emojis per comment\n", "AverageEmojiSentiments <- function(x){\n", " \n", " x <- mean(as.numeric(unlist(x)))\n", " return(x)\n", " \n", "}\n", "\n", "# Apply the function to every comment that contains emojis (only those emojis that have a sentiment score will be used)\n", "AverageEmojiSentiment <- lapply(AllEmojiSentiments,AverageEmojiSentiments)\n", "\n", "# save a copy of the vector for later use\n", "FullAverageEmojiSentiment <- unlist(AverageEmojiSentiment)\n", "\n", "AverageEmojiSentiment[AverageEmojiSentiment == 0] <- NA\n", "AverageEmojiSentiment <- unlist(AverageEmojiSentiment)\n", "\n", "# exclude comments that do not contain emojis\n", "AverageEmojiSentiment <- AverageEmojiSentiment[is.nan(AverageEmojiSentiment) == FALSE]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot histogram to check distribution of emoji sentiment scores\n", "AvES_df <- data.frame(AverageEmojiSentiment)\n", "ggplot(AvES_df, aes(x = AvES_df[,1])) +\n", " geom_histogram(binwidth = 0.2) +\n", " labs(title = \"Distribution of Averaged Emoji Sentiment Scores by Comment\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\") +\n", " xlab(\"Emoji Sentiment averaged per Comment\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have averaged both sentiment metrics, let's check whether this changed something in their bivariate distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# correlation between averaged emoji sentiment score and averaged text sentiment score\n", "cor(FullAverageSentimentPerWord,FullAverageEmojiSentiment,use=\"complete.obs\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the relationship\n", "TextEmojiRel <- data.frame(FullAverageSentimentPerWord,FullAverageEmojiSentiment)\n", "ggplot(TextEmojiRel, aes(x = FullAverageSentimentPerWord, y = FullAverageEmojiSentiment)) + geom_point(shape = 1) +\n", " ggtitle(\"Averaged Sentiment Scores\") +\n", " labs(title = \"Averagd Sentiment Scores\", subtitle = \"Schmoyoho - OH MY DAYUM ft. Daym Drops \\nhttps://www.youtube.com/watch?v=DcJFdCmN98s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do obtain a larger positive correlation with the averaged measures, however visual inspection reveals that there is no meaningful linear relationships. The data are clustered around one vertical line and multiple horizontal lines. This is likely in large parts due to:\n", "\n", " - skewed distribution of number of emojis per comment and types of emojis used (e.g., using the ROFL emoji exactly once is by far the most common case for this particular video)\n", " - most common average sentiment per word is zero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " - including emojis in the analysis of text data from the internet/social media data can provide valuable additional insights\n", " - simple dictonary-based sentiment analysis of emojis has clear limitations\n", " - it might be interesting to also use emojis in topic models or network analyses of text" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }