{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# DSI Summer Workshops Series\n", "\n", "## June 7, 2018\n", "\n", "Peggy Lindner
\n", "Center for Advanced Computing & Data Science (CACDS)
\n", "Data Science Institute (DSI)
\n", "University of Houston \n", "plindner@uh.edu \n", "\n", "Please make sure you have a copy of R up and running, as well as a Python 3 installation (ideally from Anacodna)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Goals for today\n", "\n", "Understand basics of text analysis using R" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(well enough so that you can Google your problems, find the answer, and implement it.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More specifically\n", "\n", "1. Up and running with R & IPython\n", "2. Understand a basic exploratory data analysis workflow\n", "3. Basics of R and Topic Modeling " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Why R and not Python \n", "\n", "It's good for data exploration! " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 1: Getting yourself ready\n", "\n", "### First: Install software on your computer\n", "\n", "* R [CRAN](https://www.anaconda.com/download/)\n", "* Python[Anaconda](https://www.anaconda.com/download/)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "### Second: Prep your R environment\n", "On a Mac open a terminal and start R\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/on-a-mac.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "On Windows: Open the Anaconda Command line and start R\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/anaconda-start.png)\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/windows.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's install some packages ...\n", "\n", "```\n", "> install.packages(c('readr', 'stringr', 'SnowballC', 'wordcloud', 'RColorBrewer'))\n", "> install.packages(c('tm', 'ggplot2', 'topicmodels'))\n", "> install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))\n", "> devtools::install_github('IRkernel/IRkernel')\n", "```\n", "\n", "When you see \"Please select a CRAN mirror\" , well select one.\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/cran-repo.png)\n", "\n", "... one last step - installing the Kernel\n", "\n", "```\n", "> IRkernel::installspec()\n", "```\n", "\n", "Now we can close the R environment (but leave your terminal and console open)\n", "\n", "\n", "```\n", "> quit()\n", "```\n", "\n", "Say \"N\" (no) when asked to save the workspace." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Jupyter Notebooks is what we will be going to use\n", "\n", "We are now ready to start up our Jupyter Environment from the terminal or the console:\n", "\n", "```\n", "$ jupyter notebook --notebook-dir C:/Users/[your username]\n", "\n", "or on a Mac\n", "\n", "$ jupyter notebook --notebook-dir /Users/[your username]\n", "```\n", "\n", "And your browser should open at the address: http://localhost:8888/tree\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/jupyter.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Open the downloaded notebook on your computer\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/second-screen.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Quick intro to Jupyter notebooks\n", "\n", "Cells can be Markdown (like this one) or code\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### To start off with\n", "\n", "Make sure you hit `Shift-Enter` or `Ctrl-Enter` when you are done.\n", "You can also use the \"Run\" button." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "2 + 2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Part 2: The Exploratory Analysis Workflow\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/data-science-workflow.png)\n", "Image source: Hadley Wickham, R for Data Science" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Our Example \n", "\n", "Media Analysis of a bunch of articles downloaded from a database called \"Factiva\"\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/factiva.png)\n", "\n", "\n", "Make sure you download the data source file: https://raw.githubusercontent.com/peggylind/Materials_Summer2018/master/dataJune7th/sample.txt and store it in a folder called dataJune7th next the Jupyter notebook directory (next to the *.ipynb file)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Frequently used R Packages in conjunction with text data\n", "\n", "* [readr](https://cran.r-project.org/web/packages/readr/readr.pdf) Import data\n", "\n", "Data Analysis of text based material\n", "\n", "* [stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) Clean up text\n", "* [SnowballC](https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf) Stemming of words\n", "* [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) Text mining\n", "* [Quanteda](https://quanteda.io/) veratile text analysis tool\n", "* [topicmodels](https://www.tidytextmining.com/topicmodeling.html) Topic Modeling\n", "\n", "Visualization\n", "\n", "* [ggplot2](http://ggplot2.tidyverse.org/) Modern R visulaizations\n", "* [wordcloud](http://developer.marvel.com) Make some nice word clouds\n", "* [RColorBrewer](https://dataset.readthedocs.org/en/latest/) Get color into your visualizations\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#load all required libraries\n", "library(readr)\n", "library(stringr)\n", "library(SnowballC)\n", "library(wordcloud)\n", "library(RColorBrewer)\n", "library(tm)\n", "library(ggplot2)\n", "library(topicmodels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Import" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# put the name of your csv file\n", "inputfile <- \"dataJune7th/sample.txt\"\n", "# read the data\n", "alldata <- read_file(inputfile)\n", "# look at the dat\n", "# what type is our data?\n", "str(alldata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data wrangling - split the file in different articles \n", "split.word <- \"Document AJAZEN(.*)\" \n", "\n", "# split up into individual documents\n", "list_alldata_splitted <- str_split(alldata, split.word)\n", "# convert to vector and remove last element (which is a leftover)\n", "alldata_splitted <- unlist(list_alldata_splitted)\n", "alldata_splitted <- alldata_splitted[-length(alldata_splitted)]\n", "str(alldata_splitted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### create corpus \n", "article.corpus <- Corpus(VectorSource((alldata_splitted)))\n", "\n", "article.corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#inspect a particular document\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Check details (look at bunched up corpus to find anomalies)\n", "inspect(article.corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data cleaning" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create the toSpace content transformer \n", "toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, \" \", x))})\n", "#to remove potentially problematic symbols\n", "article.corpus <- tm_map(article.corpus, toSpace, \"-\")\n", "article.corpus <- tm_map(article.corpus, toSpace, \":\")\n", "article.corpus <- tm_map(article.corpus, toSpace, \"'\")\n", "article.corpus <- tm_map(article.corpus, toSpace, \"'\")\n", "article.corpus <- tm_map(article.corpus, toSpace, \" -\")\n", "\n", "#Good practice to check after each step.\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Remove punctuation - replace punctuation marks with \" \"\n", "article.corpus <- tm_map(article.corpus, removePunctuation)\n", "\n", "#Good practice to check after each step.\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Transform to lower case\n", "article.corpus <- tm_map(article.corpus,content_transformer(tolower))\n", "\n", "#Strip digits\n", "article.corpus <- tm_map(article.corpus, removeNumbers)\n", "\n", "#Remove stopwords from standard stopword list \n", "article.corpus <- tm_map(article.corpus, removeWords, stopwords(\"english\"))\n", "\n", "#inspect output\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Stopwords](https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#define and eliminate all custom stopwords\n", "myStopwords <- c(\"monday\")\n", "article.corpus <- tm_map(article.corpus, removeWords, myStopwords)\n", "\n", "#Strip whitespace (cosmetic?)\n", "article.corpus <- tm_map(article.corpus, stripWhitespace)\n", "\n", "#inspect output\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Word Stemming](http://www.omegahat.net/Rstem/stemming.pdf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Stem document\n", "article.corpus <- tm_map(article.corpus,stemDocument)\n", "\n", "#inspect output\n", "writeLines(as.character(article.corpus[[30]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare for Analysis - create word counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create document-term matrix\n", "dtm <- DocumentTermMatrix(article.corpus)\n", "\n", "dtm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#inspect segment of document term matrix\n", "inspect(dtm[15:16,100:105])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#collapse matrix by summing over columns - this gets total counts (over all docs) for each term\n", "freq <- colSums(as.matrix(dtm))\n", "#length should be total number of terms\n", "length(freq)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create sort order (descending)\n", "ord <- order(freq,decreasing=TRUE)\n", "#inspect most frequently occurring terms\n", "freq[head(ord)]\n", "#inspect least frequently occurring terms\n", "freq[tail(ord)]\n", "\n", "#List all terms in decreasing order of freq and write to disk\n", "write.csv(freq[ord],\"word_freq.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#alterantive: remove very frequent and very rare words\n", "dtmr <-DocumentTermMatrix(article.corpus, control=list(wordLengths=c(4, 20),\n", " bounds = list(global = c(3,27))))\n", "\n", "dtmr\n", "\n", "freqr <- colSums(as.matrix(dtmr))\n", "#length should be total number of terms\n", "length(freqr)\n", "\n", "#create sort order (desc)\n", "ordr <- order(freqr,decreasing=TRUE)\n", "#inspect most frequently occurring terms\n", "freqr[head(ordr)]\n", "#inspect least frequently occurring terms\n", "freqr[tail(ordr)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#list most frequent terms. Lower bound specified as second argument\n", "findFreqTerms(dtmr,lowfreq=60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the most frequently occurring terms in hand, we can check for correlations between some of these and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#correlations\n", "findAssocs(dtm,\"turkish\",0.5)\n", "findAssocs(dtm,\"children\",0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One needs to specify the DTM, the term of interest and the correlation limit. The latter is a number between 0 and 1 that serves as a lower bound for the strength of correlation between the search and result terms. For example, if the correlation limit is 1, findAssocs() will return only those words that always co-occur with the search term. A correlation limit of 0.5 will return terms that have a search term co-occurrence of at least 50% and so on.\n", "\n", "#### Visualizations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Basic graphics\n", "#histogram\n", "wf=data.frame(term=names(freq),occurrences=freq)\n", "\n", "p <- ggplot(subset(wf, freq>50), aes(term, occurrences))\n", "p <- p + geom_bar(stat=\"identity\")\n", "p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))\n", "p" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#wordcloud\n", "#setting the same seed each time ensures consistent look across clouds\n", "set.seed(42)\n", "#limit words by specifying min frequency\n", "wordcloud(names(freq),freq, min.freq=70)\n", "#...add color\n", "wordcloud(names(freq),freq,min.freq=70,colors=brewer.pal(6,\"Dark2\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/topicmodeling.png)\n", "![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/lda.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Topic modeling\n", "#Set parameters for Gibbs sampling\n", "burnin <- 4000\n", "iter <- 2000\n", "thin <- 500\n", "seed <-list(2003,5,63,100001,765)\n", "nstart <- 5\n", "best <- TRUE\n", "\n", "\n", "#Number of topics\n", "k <- 5\n", "\n", "#Run LDA using Gibbs sampling\n", "ldaOut <-LDA(dtm,k, method=\"Gibbs\", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#have a look at the model and some output\n", "ldaOut\n", "topics(ldaOut)\n", "as.matrix(terms(ldaOut,6))\n", "\n", "#write out results\n", "#docs to topics\n", "ldaOut.topics <- as.matrix(topics(ldaOut))\n", "write.csv(ldaOut.topics,file=paste(\"LDAGibbs\",k,\"DocsToTopics.csv\"))\n", "\n", "#top 6 terms in each topic\n", "ldaOut.terms <- as.matrix(terms(ldaOut,6))\n", "write.csv(ldaOut.terms,file=paste(\"LDAGibbs\",k,\"TopicsToTerms.csv\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 1 }