--- title: 'Lecture 8: Introduction to R, Part 2, 19 and 23 September 2019' output: html_document: toc: yes html_notebook: theme: united toc: yes --- ### Putting regular expressions to use #### Data wrangling There are a few functions in R that use regular expressions: `regexpr`, `gregexpr`, `regmatches`, `sub`, `gsub`. Briefly we will perform a basic data wrangling exercise. Allison Parrish created a data set that gathers all of the poems in Project Gutenberg into one json file, which can be found on [github](https://github.com/aparrish/gutenberg-poetry-corpus). But suppose we do not want to work with json, and we just want a plain text file of all of the poems in Project Gutenberg? That could be useful. We would then use regular expressions to strip out the json and render a plain text file. ```{r} setwd("~/Desktop") # make sure your notebook file and all other files are saved on your Desktop gutenberg.poetry.v <- scan(file="./gutenburg-poetry/gutenberg-poetry-v001-sample500k.ndjson", what="character", sep="\n", encoding = "UTF-8") # you may want to use the smaller file "gutenberg-poetry-v001-sample10k.ndjson" with 10k lines to test poetry.strip.s.v <- gsub('\\{"s": "', " ", gutenberg.poetry.v) poetry.strip.s.v gutenberg.poems.plain.v <- gsub(', "gid": "\\d+"\\}', " ", poetry.strip.s.v) gutenberg.poems.plain.v[1:10] # show the first ten lines just to see if it worked write.table(gutenberg.poems.plain.v, "gutenberg-poems.txt", row.names=F) ``` Now you have a plain text file with a numbered list of lines of poetry. Now you can upload this file into Voyant or run it through AntConc for basic text analysis results. #### Cleaning up Dickens If you have not already, download the text file of Dickens's [*Great Expectations*](https://www.dropbox.com/s/qji9ueb46ajait9/dickens_great-expectations.txt?dl=0), or copy the file from our github corpus, onto your working directory and scan the text. ```{r} dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8") ``` You have now loaded *Great Expectations* into a variable called `dickens.v`. With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies. ```{r} length(dickens.v) # this finds the number of lines in the book dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries # each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list. class(dickens.words) # the class function tells you the data structure of your variable dickens.words.v <- unlist(dickens.words) class(dickens.words.v) dickens.words.v[1:20] # find the first 20 ten words in Great Expectations ``` Did you notice the "\\W" in the `strsplit` argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape. Also, did you notice the blank result on the 10th word? This requires a little clean-up step. ```{r} not.blanks.v <- which(dickens.words.v!="") dickens.words.v <- dickens.words.v[not.blanks.v] ``` Extra white spaces often cause problems for text analysis. ```{r} dickens.words.v[1:20] ``` Voila! We might want to examine how many times the third result "father" occurs (the fourth word result, and one that will probably be an important word in this book). ```{r} length(dickens.words.v[which(dickens.words.v=="father")]) ``` Or produce a list of all unique words. ```{r} unique(sort(dickens.words.v, decreasing = FALSE))[1:50] ``` Here we find another problem: we find in our unique word list some odd non-words such as "0037m." We should strip those out. ## Exercise Create a regular expression to remove those non-words in `dickens.words.v`? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful [cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf). ```{r} ``` Now let's re-run that not.blanks vector to strip out the blank you just added. ```{r} not.blanks.v <- which(dickens.words.clean.v!="") dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v] unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50] ``` Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book? ```{r} length(unique(dickens.words.clean.v)) ``` Divide this by the amount of words in the whole book to calculate vocabulary density ratios. ```{r} unique.words <- length(unique(dickens.words.clean.v)) total.words <- length(dickens.words.clean.v) unique.words/total.words # you could do this quicker this way: # length(unique(dickens.words.v))/length(dickens.words.v) # BUT it's good to get into the practice of storing results in variables ``` That's actually a fairly small density number, 5.7% (*Moby-Dick* by comparison is about 8%). The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we'll see later. ### Flow control: For-loops and conditionals in R Flow control involves **stochastic simulation**, or repetitive operations or pattern recognition---two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax for (`name` in `seq`) {[enter commands]} This sets a variable called `name` (any thing you choose to assign) equal to each of the elements of the sequence (any sequence of values), which is usually a vector. Each of these iterates over the command as many times as is necessary. ```{r} for (i in letters[1:10]){ cat(i, ", which is followed by \n") } ``` What this literally means is: create a variable called `i` as an index for the loop. The first value of `i` is `a` (the first value of `letters`, after the `in`), and R executes the function within the loop (taking the instructions within the curly brackets). The code above just prints `i` and the text ", which is followed by" with a new line (signified by the regex "\n"). When the closing bracket is reached, `i` moves onto the next value (the second letter). When the loop reaches the last value of the sequence (the tenth of the `letters`), it is completed. Another simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers. ```{r} Fibonacci <- numeric(20) # creates a vector called Fibonacci that consists of 20 numeric vectors Fibonacci[1] <- Fibonacci[2] <- 1 # defines the first and second elements as a value of 1. This is important b/c the first two Fibonacci numbers are 1, and the next (3rd) number is gained by adding the first two for (i in 3:20) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] # says for each instance of the 3rd through 20th Fibonacci numbers, take the first element - 2 and add that to the next element - 1 Fibonacci ``` There is another important component to flow control: the conditional. In programming this takes the form of `if()` statements. **Syntax** `if (condition) {commands when TRUE}` `if (condition) {commands when TRUE} else {commands when FALSE}` We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else's code. Now, using what we know about regular expressions and flow control, let's have look at a for() loop that Matthew Jockers uses in Chapter 4 of his *Text Analysis for Students of Literature*. It's a fairly complicated but useful way of breaking up a novel text into chapters for analysis. Let's use it to process the Dickens novel. ```{r} text.v <- scan("dickens_great-expectations.txt", what="character", sep="\n", encoding = "UTF-8") not.blanks.v <- which(text.v!="") clean.text.v <- text.v[not.blanks.v] start.v <- which(clean.text.v == "Chapter I") end.v <- which(clean.text.v == "THE END") novel.lines.v <- clean.text.v[start.v:end.v] chap.positions.v <- grep("^Chapter \\w", novel.lines.v) novel.lines.v[chap.positions.v] chapter.raws.l <- list() chapter.freqs.l <- list() # the following for loop starts by iterating over each item in chap.positions.v for(i in 1:length(chap.positions.v)){ # in this if statement: if the value of i is not equal to the length of the vector, keep iterating over the vector if(i != length(chap.positions.v)){ chapter.title <- novel.lines.v[chap.positions.v[i]] # this variable captures the chapter title in novel.lines.v that is indicated by the value held in the chap.positions.v. If this is confusing, try this: In your console, set i to 1 by running i <- 1. Then run novel.lines.v[chap.positions.v[i]] start <- chap.positions.v[i]+1 # i+1 gives me the position of the first line of the chapter text (the first paragraph after the ch title, in other words) end <- chap.positions.v[i+1]-1 # run these lines in the console: i <- 1, then chap.positions.v[i+1]. Instead of adding 1 to the value stored in the ith position of chap.positions.v, it adds 1 to i as an index. Instead of extracting the value of the ith item in the vector, the program identifies the value of the item in the next position beyond i in the vector. This line returns the next item in the vector, and the value held in that spot is the position for the start of a new chapter. This ensures the processing of the next chapter. To ignore the words in the chapter heading, you subtract 1 from that value in order to get the line number in novel.lines.v that comes just before the start of a new chapter. chapter.lines.v <- novel.lines.v[start:end] # having defined start and end points of each chapter, you extract the lines chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" ")) # pastes chapter lines into a single block of text, and lowercases each word chapter.words.l <- strsplit(chapter.words.v, "\\W") # split all words in each chapter into a vector of words chapter.word.v <- unlist(chapter.words.l) chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] chapter.freqs.t <- table(chapter.word.v) # tabulates vector of words into a frequency count of each word type chapter.raws.l[[chapter.title]] <- chapter.freqs.t # here you dump the table of raw frequency counts into the empty list that was created before entering the loop. The double brackets assign a label to the list item; here each item in the list is named with the chapter heading extracted a few lines above chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) # converts the raw counts to relative frequencies based on the number of words in the chapter chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel } } chapter.freqs.l[1] length(chapter.freqs.l)[1] ``` Suppose I wanted to get all relative frequencies of the word "father" in each chapter. ```{r} father.freqs <- lapply(chapter.freqs.l, '[', 'father') father.freqs ``` You could also use variations of the `which` function to identify the chapters with the highest and lowest frequencies. ```{r} which.max(father.freqs) which.min(father.freqs) ``` ### Exercise Create a vector that confines your results to only the paragraphs with dialogue. ```{r} dialogue.v <- grep('"(.*?)"', novel.lines.v) # grep is another regex function novel.lines.v[dialogue.v][1:20] # check your work by finding all the dialogue lines in novel.lines.v ``` ### Bonus Exercise Modify the for loop in Jockers to find word frequencies only of content with dialogue. ```{r} dialogue.chapter.raws.l <- list() dialogue.chapter.freqs.l <- list() for(i in 1:length(chap.positions.v)){ if(i != length(chap.positions.v)){ chapter.title <- novel.lines.v[chap.positions.v[i]] start <- chap.positions.v[i]+1 end <- chap.positions.v[i+1]-1 chapter.lines.v <- novel.lines.v[start:end] dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" ")) chapter.words.l <- strsplit(chapter.words.v, "\\W") chapter.word.v <- unlist(chapter.words.l) chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] chapter.freqs.t <- table(chapter.word.v) dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel } } dialogue.chapter.freqs.l[1] ``` ### Visualising the data with `plot` ```{r} # to extract the frequency data from all of the chapters at once lapply(chapter.raws.l,mean) # putting results into a matric object mean.word.use.m <- do.call(rbind, lapply(chapter.raws.l,mean)) dim(mean.word.use.m) # this reports 703 rows in 1 column, but there's more info in the matrix plot(mean.word.use.m, type = "h", main = "Mean word usage patterns in each chapter of Dickens's Great Expectations", ylab = "mean word use", xlab = "Each chapter") ``` ```{r} # using scale to method has the effect of sub- tracting away the expected value # (expected as calculated by the overall mean) and then showing only the deviations from the mean scale(mean.word.use.m) plot(scale(mean.word.use.m), type = "h", main = "Scaled mean word usage patterns in each chapter of Dickens's Great Expectations", ylab = "mean word use", xlab = "Each chapter") ``` This gives us a general impression of vocabulary density on a chapter-by-chapter basis. Let's now return to the previous word search of "father." Suppose we wanted to visualise that frequency of "father" alongside a similar concept, "son." We need to introduce a new function, `lapply`. The `lapply` function is similar to a for loop, in that it iterates over the elements in a data structure, but it is specifically designed for dealing with lists. It also requires a list as a second argument, and the name of some other function. ```{r} chapter.freqs.l[[1]]["father"] ``` ```{r} chapter.freqs.l[[10]]["son"] ``` The above is just an example: The word "father" appears with 27% relative frequency (that is, 27 times for every 100 words in the chapter) in the first chapter, and the word "son" appear with a 4% relative frequency in the 10th chapter. Now let's create vectors that store the relative frequencies for each chapter. ```{r} fathers.l <- lapply(chapter.freqs.l, '[', 'father') sons.l <- lapply(chapter.freqs.l, '[', 'son') ``` Instead of just printing out the values held in this new list, you can capture the results into a single matrix using the `rbind` function. The `do.call` functions binds the contents of each list item into rows; this effectively activate the `rbind` function across the list of "father" and "son" results, respectively. ```{r} fathers.m <- do.call(rbind, fathers.l) sons.m <- do.call(rbind, sons.l) ``` Let's look at one of these matrices. ```{r} sons.m ``` Compare it to the other matrix of "father" results. ```{r} fathers.m ``` Next we create vectors for each search term; the following extracts the father and son values into two new vectors, and uses the `cbind` functions to combine these vectors into a new, two-column matrix consisting of 58 rows and 2 columns. ```{r} fathers.v <- fathers.m[,1] sons.v <- sons.m[,1] fathers.sons.m <- cbind(fathers.v, sons.v) dim(fathers.sons.m) ``` Now we can visualise these two word searches. ```{r} colnames(fathers.sons.m) <- c("father", "son") barplot(fathers.sons.m, beside=T, col="grey", ylab = "relative word frequency") ```