{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# DSI Summer Workshops Series\n", "\n", "## July 26, 2018\n", "\n", "Peggy Lindner
\n", "Center for Advanced Computing & Data Science (CACDS)
\n", "Data Science Institute (DSI)
\n", "University of Houston \n", "plindner@uh.edu \n", "\n", "Please make sure you have Jupyterhub running with support for R and all the required packages installed. Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops https://github.com/peggylind/Materials_Summer2018" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data Mining Twitter Data\n", "\n", "Understand basics of twitter data mining using R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Before we start, let's do some prep\n", "\n", "Tools, tools, tools" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "install.packages(\"twitteR\")\n", "#install.packages(\"tm\")\n", "#install.packages(\"SnowballC\")\n", "#install.packages(\"wordcloud\")\n", "#install.packages(\"topicmodels\")\n", "#install.packages(\"data.table\")\n", "#source(\"https://bioconductor.org/biocLite.R\")\n", "#biocLite(\"graph\")\n", "#biocLite(\"Rgraphviz\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Twitter\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/twitter-start.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Techniques\n", "\n", "* Text Mining\n", "* Topic Modeling\n", "* Sentiment Analysis\n", "\n", "### Tools\n", "\n", "* Twitter API\n", "* R and specifically the following packages\n", "\n", "* [twitteR] Twitter data extraction\n", "* [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) Text cleaning mining\n", "* [topicmodels](https://www.tidytextmining.com/topicmodeling.html) Topic Modeling\n", "\n", "...\n", "\n", "Visualization\n", "\n", "* [ggplot2](http://ggplot2.tidyverse.org/) Modern R visulaizations\n", "* [wordcloud](http://developer.marvel.com) Make some nice word clouds\n", "* [RColorBrewer](https://dataset.readthedocs.org/en/latest/) Get color into your visualizations\n", "\n", "...\n", "\n", "\n", "### Process\n", "1. Extract tweets and followers from the Twitter website with R and the twitteR package\n", "2. With the tm package, clean text by removing punctuations, numbers, hyperlinks and stop words, followed by stemming and stem completion\n", "3. Build a term-document matrix\n", "4. Analyse topics with the topicmodels package\n", "5. Analyse sentiment with the sentiment140 package\n", "6. Analyse following/followed and retweeting relationships with the igraph package\n", "\n", "### Using existing twitter data within this tutorial\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/twitter-account.png)\n", "\n", "`# you could download Twitter data manually from site\n", "#url <- \"http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds\"\n", "#download.file(url, destfile = \"RDataMining-Tweets-20160212.rds\") `\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "### Retrieve Tweets" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "R version 3.5.0 (2018-04-23)\n", "Platform: x86_64-apple-darwin15.6.0 (64-bit)\n", "Running under: macOS High Sierra 10.13.4\n", "\n", "Matrix products: default\n", "BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib\n", "LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib\n", "\n", "locale:\n", "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n", "\n", "attached base packages:\n", "[1] grid parallel datasets stats graphics grDevices utils \n", "[8] methods base \n", "\n", "other attached packages:\n", " [1] sentiment_1.0 plyr_1.8.4 rjson_0.2.20 \n", " [4] RCurl_1.95-4.11 bitops_1.0-6 data.table_1.11.4 \n", " [7] topicmodels_0.2-7 Rgraphviz_2.24.0 graph_1.58.0 \n", "[10] BiocGenerics_0.26.0 wordcloud_2.5 RColorBrewer_1.1-2 \n", "[13] tm_0.7-3 NLP_0.1-11 twitteR_1.1.9 \n", "[16] ggplot2_3.0.0 \n", "\n", "loaded via a namespace (and not attached):\n", " [1] pbdZMQ_0.3-3 modeltools_0.2-21 tidyselect_0.2.4 \n", " [4] slam_0.1-43 repr_0.15.0 purrr_0.2.5 \n", " [7] colorspace_1.3-2 htmltools_0.3.6 SnowballC_0.5.1 \n", "[10] stats4_3.5.0 base64enc_0.1-3 rlang_0.2.1 \n", "[13] pillar_1.3.0 glue_1.3.0 withr_2.1.2 \n", "[16] DBI_1.0.0 bit64_0.9-7 bindrcpp_0.2.2 \n", "[19] uuid_0.1-2 bindr_0.1.1 stringr_1.3.1 \n", "[22] munsell_0.5.0 gtable_0.2.0 evaluate_0.10.1 \n", "[25] labeling_0.3 curl_3.2 IRdisplay_0.5.0 \n", "[28] Rcpp_0.12.18 openssl_1.0.1 scales_0.5.0 \n", "[31] IRkernel_0.8.12.9000 jsonlite_1.5 bit_1.1-14 \n", "[34] digest_0.6.15 stringi_1.2.4 dplyr_0.7.6 \n", "[37] tools_3.5.0 magrittr_1.5 lazyeval_0.2.1 \n", "[40] tibble_1.4.2 crayon_1.3.4 pkgconfig_2.0.1 \n", "[43] xml2_1.2.0 assertthat_0.2.0 httr_1.3.1 \n", "[46] R6_2.2.2 compiler_3.5.0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#load the twitteR library\n", "sessionInfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A) using the Twitter API\n", "\n", "The following code is merely an abstract example. You will have to learn more about the Twitter API and how to use it at: [Twitter Developer](https://developer.twitter.com/en.html)\n", "\n", "And prepare you Twitter account: https://towardsdatascience.com/setting-up-twitter-for-text-mining-in-r-bcfc5ba910f4\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This code will not run!\n", "# Change the next four lines based on your own consumer_key, consume_secret, access_token, and access_secret. \n", "consumer_key <- \"dfgbfdbhe\"\n", "consumer_secret <- \"fdbdbh\"\n", "access_token <- \"dfbhdf\"\n", "access_secret <- \"fbhfd\"\n", "\n", "setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)\n", "tw = twitteR::searchTwitter('#realDonaldTrump + #HillaryClinton', n = 1e4, since = '2016-11-08', retryOnRateLimit = 1e3)\n", "d = twitteR::twListToDF(tw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### B) load from file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "#read the data from a file\n", "tweets <- readRDS(\"dataJuly26th/RDataMining-Tweets-20160212.rds\")\n", "\n", "#let's look what we got\n", "tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's explore" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of tweets in dataset\n", "(n.tweet <- length(tweets))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert to data frame\n", "tweets.df <- twListToDF(tweets)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# look at tweet #190\n", "tweets.df[190, c(\"id\", \"created\", \"screenName\", \"replyToSN\", \"favoriteCount\", \"retweetCount\", \"longitude\", \"latitude\", \"text\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Text Cleaning " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we will use the tm library\n", "library(tm) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# build a corpus, and specify the source to be character vectors\n", "myCorpus <- Corpus(VectorSource(tweets.df$text))\n", "\n", "#what did we just create?\n", "myCorpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print tweet # and make text fit for slide width\n", "writeLines(strwrap(tweets.df$text[3], 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print tweet #190 and make text fit for slide width\n", "writeLines(strwrap(tweets.df$text[190], 60))\n", "\n", "# convert to lower case\n", "myCorpus <- tm_map(myCorpus, content_transformer(tolower))\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove URLs\n", "removeURL <- function(x) gsub(\"http[^[:space:]]*\", \"\", x)\n", "myCorpus <- tm_map(myCorpus, content_transformer(removeURL))\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove anything other than English letters or space\n", "removePunct <- function(x) gsub(\"[^[:alpha:][:space:]]*\", \"\", x)\n", "myCorpus <- tm_map(myCorpus, content_transformer(removePunct))\n", "myCorpus <- tm_map(myCorpus, removeNumbers)\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove stopwords\n", "myStopwords <- c(setdiff(stopwords('english'), c(\"r\", \"big\")),\n", " \"use\", \"see\", \"used\", \"via\", \"amp\")\n", "myCorpus <- tm_map(myCorpus, removeWords, myStopwords)\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove extra whitespace\n", "myCorpus <- tm_map(myCorpus, stripWhitespace)\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# keep a copy for stem completion later\n", "myCorpusCopy <- myCorpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myCorpusCopy <- tm_map(myCorpusCopy, stemDocument) # stem words\n", "writeLines(strwrap(myCorpusCopy[[190]]$content, 60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stemCompletion2 <- function(x, dictionary) {\n", " x <- unlist(strsplit(as.character(x), \" \"))\n", " x <- x[x != \"\"]\n", " x <- stemCompletion(x, dictionary=dictionary)\n", " x <- paste(x, sep=\"\", collapse=\" \")\n", " PlainTextDocument(stripWhitespace(x))\n", "}\n", "myCorpusCopy <- lapply(myCorpusCopy, stemCompletion2, dictionary=myCorpusCopy)\n", "myCorpusCopy <- Corpus(VectorSource(myCorpusCopy))\n", "writeLines(strwrap(myCorpusCopy[[190]]$content, 60))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Issues in Stem completion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's count some words to see what is going on with this stemming\n", "wordFreq <- function(corpus, word) {\n", " results <- lapply(corpus,\n", " function(x) { grep(as.character(x), pattern=paste0(\"\\\\<\",word)) }\n", " )\n", " sum(unlist(results))\n", "}\n", "n.miner <- wordFreq(myCorpusCopy, \"miner\")\n", "n.mining <- wordFreq(myCorpusCopy, \"mining\")\n", "cat(n.miner, n.mining)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution: replace oldword with newword (to fix stemming issue)\n", "replaceWord <- function(corpus, oldword, newword) {\n", " tm_map(corpus, content_transformer(gsub),\n", " pattern=oldword, replacement=newword)\n", "}\n", "myCorpus <- replaceWord(myCorpus, \"miner\", \"mining\")\n", "myCorpus <- replaceWord(myCorpus, \"universidad\", \"university\")\n", "myCorpus <- replaceWord(myCorpus, \"scienc\", \"science\")\n", "\n", "writeLines(strwrap(myCorpus[[190]]$content, 60))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Finally! Ready to Build a document term matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tdm <- TermDocumentMatrix(myCorpus,\n", " control = list(wordLengths = c(1, Inf)))\n", "tdm\n", "\n", "# look at document term matrix\n", "idx <- which(dimnames(tdm)$Terms %in% c(\"r\", \"data\", \"mining\"))\n", "as.matrix(tdm[idx, 21:30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's look at the Top Frequent Terms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(freq.terms <- findFreqTerms(tdm, lowfreq = 20))\n", "\n", "# sum up the document term matrix by rows\n", "term.freq <- rowSums(as.matrix(tdm))\n", "term.freq <- subset(term.freq, term.freq >= 20)\n", "# prepare sums for plotting\n", "df <- data.frame(term = names(term.freq), freq = term.freq)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### And visualize those results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create histogram of word frequencies\n", "library(ggplot2)\n", "ggplot(df, aes(x=term, y=freq)) + geom_bar(stat=\"identity\") +\n", " xlab(\"Terms\") + ylab(\"Count\") + coord_flip() +\n", " theme(axis.text=element_text(size=7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Want something more colorful and playful?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#prep\n", "m <- as.matrix(tdm)\n", "# calculate the frequency of words and sort it by frequency\n", "word.freq <- sort(rowSums(m), decreasing = T)\n", "\n", "word.freq\n", "# colors\n", "library(RColorBrewer)\n", "pal <- brewer.pal(9, \"BuGn\")[-(1:4)]\n", "\n", "pal" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot word cloud\n", "library(wordcloud)\n", "wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3,\n", " random.order = F, colors = pal)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word Associations\n", "\n", "Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together in documents, while a score approaching 0 means the terms seldom appear in the same document.\n", "\n", "Keep in mind the calculation for findAssocs() is done at the document level. So for every document that contains the word in question, the other terms in those specific documents are associated. Documents without the search term are ignored." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# which words are associated with 'r'?\n", "findAssocs(tdm, \"r\", 0.2)\n", "\n", "findAssocs(tdm, \"data\", 0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Network of terms\n", "\n", "Once a few interesting term correlations have been identified, it can be useful to visually represent term correlations using the plot() function. By default the plot() function will default to a handful of randomly chosen terms, with the chosen correlation threshold, e.g.:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# network of terms\n", "library(graph)\n", "library(Rgraphviz)\n", "plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(tdm, terms = names(findAssocs(tdm,term=\"data\",0.2)[[\"data\"]]), corThreshold = 0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Topic Modeling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dtm <- as.DocumentTermMatrix(tdm)\n", "library(topicmodels)\n", "lda <- LDA(dtm, k = 8) # find 8 topics\n", "term <- terms(lda, 7) # first 7 terms of every topic\n", "(term <- apply(term, MARGIN = 2, paste, collapse = \", \"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(data.table)\n", "topics <- topics(lda) # 1st topic identified for every document (tweet)\n", "topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)\n", "ggplot(topics, aes(date, fill = term[topic])) +\n", " geom_density(position = \"stack\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Top retweetet tweets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select top retweeted tweets\n", "table(tweets.df$retweetCount)\n", "selected <- which(tweets.df$retweetCount >= 9)\n", "# plot them\n", "dates <- strptime(tweets.df$created, format=\"%Y-%m-%d\")\n", "plot(x=dates, y=tweets.df$retweetCount, type=\"l\", col=\"grey\",\n", " xlab=\"Date\", ylab=\"Times retweeted\")\n", "colors <- rainbow(10)[1:length(selected)]\n", "points(dates[selected], tweets.df$retweetCount[selected],\n", " pch=19, col=colors)\n", "text(dates[selected], tweets.df$retweetCount[selected],\n", " tweets.df$text[selected], col=colors, cex=.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Many more things that one wants to explore with Twitter data\n", "\n", "e.g. Retrieve User Info and Followers or Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentiment Analysis\n", "\n", "this code will not work on the DSI Jupyter instance, but you might be able to run it locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# different way to install a package\n", "#require(devtools)\n", "#install_github(\"sentiment140\", \"okugami79\")\n", "\n", "library(sentiment)\n", "\n", "sentiments <- sentiment(tweets.df$text)\n", "table(sentiments$polarity)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sentiment plot\n", "sentiments$score <- 0\n", "sentiments$score[sentiments$polarity == \"positive\"] <- 1\n", "sentiments$score[sentiments$polarity == \"negative\"] <- -1\n", "sentiments$date <- as.IDate(tweets.df$created)\n", "result <- aggregate(score ~ date, data = sentiments, sum)\n", "plot(result, type = \"l\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Tutorial is based on:\n", "Yanchang Zhao http://www.rdatamining.com/docs/twitter-analysis-with-r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 1 }