Putting regular expressions to use

Data wrangling

There are a few functions in R that use regular expressions: regexpr, gregexpr, regmatches, sub, gsub.

Briefly we will perform a basic data wrangling exercise. Allison Parrish created a data set that gathers all of the poems in Project Gutenberg into one json file, which can be found on github. But suppose we do not want to work with json, and we just want a plain text file of all of the poems in Project Gutenberg? That could be useful. We would then use regular expressions to strip out the json and render a plain text file.

setwd("~/Desktop") # make sure your notebook file and all other files are saved on your Desktop
gutenberg.poetry.v <- scan(file="./gutenburg-poetry/gutenberg-poetry-v001-sample500k.ndjson", what="character", sep="\n", encoding = "UTF-8") # you may want to use the smaller file "gutenberg-poetry-v001-sample10k.ndjson" with 10k lines to test
poetry.strip.s.v <- gsub('\\{"s": "', " ", gutenberg.poetry.v)
poetry.strip.s.v
gutenberg.poems.plain.v <- gsub(', "gid": "\\d+"\\}', " ", poetry.strip.s.v)
gutenberg.poems.plain.v[1:10] # show the first ten lines just to see if it worked
write.table(gutenberg.poems.plain.v, "gutenberg-poems.txt", row.names=F)

Now you have a plain text file with a numbered list of lines of poetry. Now you can upload this file into Voyant or run it through AntConc for basic text analysis results.

Cleaning up Dickens

If you have not already, download the text file of Dickens’s Great Expectations, or copy the file from our github corpus, onto your working directory and scan the text.

dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")

You have now loaded Great Expectations into a variable called dickens.v.

With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.

length(dickens.v) # this finds the number of lines in the book

dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list

dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries
# each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list.

class(dickens.words) # the class function tells you the data structure of your variable

dickens.words.v <- unlist(dickens.words)

class(dickens.words.v)

dickens.words.v[1:20] # find the first 20 ten words in Great Expectations

Did you notice the “\W” in the strsplit argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.

Also, did you notice the blank result on the 10th word? This requires a little clean-up step.

not.blanks.v <- which(dickens.words.v!="")

dickens.words.v <- dickens.words.v[not.blanks.v]

Extra white spaces often cause problems for text analysis.

dickens.words.v[1:20]

Voila! We might want to examine how many times the third result “father” occurs (the fourth word result, and one that will probably be an important word in this book).

length(dickens.words.v[which(dickens.words.v=="father")])

Or produce a list of all unique words.

unique(sort(dickens.words.v, decreasing = FALSE))[1:50]

Here we find another problem: we find in our unique word list some odd non-words such as “0037m.” We should strip those out.

Exercise

Create a regular expression to remove those non-words in dickens.words.v? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful cheat sheet.

Now let’s re-run that not.blanks vector to strip out the blank you just added.

not.blanks.v <- which(dickens.words.clean.v!="")

dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v]

unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]

Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?

length(unique(dickens.words.clean.v))

Divide this by the amount of words in the whole book to calculate vocabulary density ratios.

unique.words <- length(unique(dickens.words.clean.v))

total.words <- length(dickens.words.clean.v)

unique.words/total.words 
# you could do this quicker this way: 
# length(unique(dickens.words.v))/length(dickens.words.v) 
# BUT it's good to get into the practice of storing results in variables

That’s actually a fairly small density number, 5.7% (Moby-Dick by comparison is about 8%).

The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we’ll see later.

Flow control: For-loops and conditionals in R

Flow control involves stochastic simulation, or repetitive operations or pattern recognition—two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax

for (name in seq) {[enter commands]}

This sets a variable called name (any thing you choose to assign) equal to each of the elements of the sequence (any sequence of values), which is usually a vector. Each of these iterates over the command as many times as is necessary.

for (i in letters[1:10]){
  cat(i, ", which is followed by \n")
}

What this literally means is: create a variable called i as an index for the loop. The first value of i is a (the first value of letters, after the in), and R executes the function within the loop (taking the instructions within the curly brackets). The code above just prints i and the text “, which is followed by” with a new line (signified by the regex “”). When the closing bracket is reached, i moves onto the next value (the second letter). When the loop reaches the last value of the sequence (the tenth of the letters), it is completed.

Another simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers.

Fibonacci <- numeric(20) # creates a vector called Fibonacci that consists of 20 numeric vectors

Fibonacci[1] <- Fibonacci[2] <- 1 # defines the first and second elements as a value of 1. This is important b/c the first two Fibonacci numbers are 1, and the next (3rd) number is gained by adding the first two

for (i in 3:20) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] # says for each instance of the 3rd through 20th Fibonacci numbers, take the first element - 2 and add that to the next element - 1
Fibonacci

There is another important component to flow control: the conditional. In programming this takes the form of if() statements.

Syntax

if (condition) {commands when TRUE}

if (condition) {commands when TRUE} else {commands when FALSE}

We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else’s code.

Now, using what we know about regular expressions and flow control, let’s have look at a for() loop that Matthew Jockers uses in Chapter 4 of his Text Analysis for Students of Literature. It’s a fairly complicated but useful way of breaking up a novel text into chapters for analysis. Let’s use it to process the Dickens novel.

length(chapter.freqs.l)[1]
[1] 58

Suppose I wanted to get all relative frequencies of the word “father” in each chapter.

father.freqs <- lapply(chapter.freqs.l, '[', 'father')

father.freqs

You could also use variations of the which function to identify the chapters with the highest and lowest frequencies.

which.max(father.freqs)

which.min(father.freqs)

Exercise

Create a vector that confines your results to only the paragraphs with dialogue.

dialogue.v <- grep('"(.*?)"', novel.lines.v) # grep is another regex function

novel.lines.v[dialogue.v][1:20] 
# check your work by finding all the dialogue lines in novel.lines.v

Bonus Exercise

Modify the for loop in Jockers to find word frequencies only of content with dialogue.

dialogue.chapter.raws.l <- list()
dialogue.chapter.freqs.l <- list()

for(i in 1:length(chap.positions.v)){
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue
chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" ")) 
chapter.words.l <- strsplit(chapter.words.v, "\\W")
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) 
dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t 
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 
dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

dialogue.chapter.freqs.l[1]

Visualising the data with plot

This gives us a general impression of vocabulary density on a chapter-by-chapter basis. Let’s now return to the previous word search of “father.” Suppose we wanted to visualise that frequency of “father” alongside a similar concept, “son.”

We need to introduce a new function, lapply. The lapply function is similar to a for loop, in that it iterates over the elements in a data structure, but it is specifically designed for dealing with lists. It also requires a list as a second argument, and the name of some other function.

chapter.freqs.l[[1]]["father"]
   father 
0.2695418 
chapter.freqs.l[[10]]["son"]
       son 
0.03901678 

The above is just an example: The word “father” appears with 27% relative frequency (that is, 27 times for every 100 words in the chapter) in the first chapter, and the word “son” appear with a 4% relative frequency in the 10th chapter. Now let’s create vectors that store the relative frequencies for each chapter.

Instead of just printing out the values held in this new list, you can capture the results into a single matrix using the rbind function. The do.call functions binds the contents of each list item into rows; this effectively activate the rbind function across the list of “father” and “son” results, respectively.

Let’s look at one of these matrices.

sons.m
                      <NA>
Chapter I               NA
Chapter II              NA
Chapter III             NA
Chapter IV              NA
Chapter V               NA
Chapter VI              NA
Chapter VII             NA
Chapter VIII            NA
Chapter IX      0.03696858
Chapter X       0.03901678
Chapter XI              NA
Chapter XII             NA
Chapter XIII            NA
Chapter XIV             NA
Chapter XV              NA
Chapter XVI             NA
Chapter XVII            NA
Chapter XVIII   0.01953507
Chapter XIX             NA
Chapter XX              NA
Chapter XXI             NA
Chapter XXII    0.03972195
Chapter XXIII   0.03092146
Chapter XXIV            NA
Chapter XXV     0.06563833
Chapter XXVI            NA
Chapter XXVII           NA
Chapter XXVIII          NA
Chapter XXIX            NA
Chapter XXX     0.08818342
Chapter XXXI            NA
Chapter XXXII           NA
Chapter XXXIII          NA
Chapter XXXIV           NA
Chapter XXXV            NA
Chapter XXXVI           NA
Chapter XXXVII  0.27864855
Chapter XXXVIII         NA
Chapter XXXIX   0.04004806
Chapter XL              NA
Chapter XLI             NA
Chapter XLII            NA
Chapter XLIII           NA
Chapter XLIV    0.03423485
Chapter XLV             NA
Chapter XLVI    0.03289474
Chapter XLVII           NA
Chapter XLVIII          NA
Chapter XLIX            NA
Chapter L               NA
Chapter LI              NA
Chapter LII             NA
Chapter LIII            NA
Chapter LIV             NA
Chapter LV      0.03460208
Chapter LVI             NA
Chapter LVII            NA
Chapter LVIII           NA

Compare it to the other matrix of “father” results.

fathers.m
                    father
Chapter I       0.26954178
Chapter II              NA
Chapter III             NA
Chapter IV              NA
Chapter V               NA
Chapter VI              NA
Chapter VII     0.14702279
Chapter VIII            NA
Chapter IX      0.03696858
Chapter X               NA
Chapter XI              NA
Chapter XII             NA
Chapter XIII            NA
Chapter XIV             NA
Chapter XV              NA
Chapter XVI             NA
Chapter XVII            NA
Chapter XVIII           NA
Chapter XIX             NA
Chapter XX      0.06211180
Chapter XXI     0.11055832
Chapter XXII    0.37735849
Chapter XXIII   0.03092146
Chapter XXIV            NA
Chapter XXV             NA
Chapter XXVI            NA
Chapter XXVII   0.06510417
Chapter XXVIII          NA
Chapter XXIX    0.02010454
Chapter XXX     0.26455026
Chapter XXXI            NA
Chapter XXXII           NA
Chapter XXXIII          NA
Chapter XXXIV   0.04230118
Chapter XXXV            NA
Chapter XXXVI           NA
Chapter XXXVII  0.03483107
Chapter XXXVIII         NA
Chapter XXXIX   0.02002403
Chapter XL              NA
Chapter XLI             NA
Chapter XLII            NA
Chapter XLIII           NA
Chapter XLIV            NA
Chapter XLV             NA
Chapter XLVI    0.13157895
Chapter XLVII           NA
Chapter XLVIII          NA
Chapter XLIX            NA
Chapter L       0.13071895
Chapter LI      0.29791460
Chapter LII             NA
Chapter LIII    0.01871608
Chapter LIV             NA
Chapter LV      0.03460208
Chapter LVI             NA
Chapter LVII            NA
Chapter LVIII   0.03205128

Next we create vectors for each search term; the following extracts the father and son values into two new vectors, and uses the cbind functions to combine these vectors into a new, two-column matrix consisting of 58 rows and 2 columns.

dim(fathers.sons.m)
[1] 58  2

Now we can visualise these two word searches.

