Putting regular expressions to use
Data wrangling
There are a few functions in R that use regular expressions: regexpr
, gregexpr
, regmatches
, sub
, gsub
.
Briefly we will perform a basic data wrangling exercise. Allison Parrish created a data set that gathers all of the poems in Project Gutenberg into one json file, which can be found on github. But suppose we do not want to work with json, and we just want a plain text file of all of the poems in Project Gutenberg? That could be useful. We would then use regular expressions to strip out the json and render a plain text file.
setwd("~/Desktop") # make sure your notebook file and all other files are saved on your Desktop
gutenberg.poetry.v <- scan(file="./gutenburg-poetry/gutenberg-poetry-v001-sample500k.ndjson", what="character", sep="\n", encoding = "UTF-8") # you may want to use the smaller file "gutenberg-poetry-v001-sample10k.ndjson" with 10k lines to test
poetry.strip.s.v <- gsub('\\{"s": "', " ", gutenberg.poetry.v)
poetry.strip.s.v
gutenberg.poems.plain.v <- gsub(', "gid": "\\d+"\\}', " ", poetry.strip.s.v)
gutenberg.poems.plain.v[1:10] # show the first ten lines just to see if it worked
write.table(gutenberg.poems.plain.v, "gutenberg-poems.txt", row.names=F)
Now you have a plain text file with a numbered list of lines of poetry. Now you can upload this file into Voyant or run it through AntConc for basic text analysis results.
Cleaning up Dickens
If you have not already, download the text file of Dickens’s Great Expectations, or copy the file from our github corpus, onto your working directory and scan the text.
dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
You have now loaded Great Expectations into a variable called dickens.v
.
With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.
length(dickens.v) # this finds the number of lines in the book
dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list
dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries
# each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list.
class(dickens.words) # the class function tells you the data structure of your variable
dickens.words.v <- unlist(dickens.words)
class(dickens.words.v)
dickens.words.v[1:20] # find the first 20 ten words in Great Expectations
Did you notice the “\W” in the strsplit
argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.
Also, did you notice the blank result on the 10th word? This requires a little clean-up step.
not.blanks.v <- which(dickens.words.v!="")
dickens.words.v <- dickens.words.v[not.blanks.v]
Extra white spaces often cause problems for text analysis.
dickens.words.v[1:20]
Voila! We might want to examine how many times the third result “father” occurs (the fourth word result, and one that will probably be an important word in this book).
length(dickens.words.v[which(dickens.words.v=="father")])
Or produce a list of all unique words.
unique(sort(dickens.words.v, decreasing = FALSE))[1:50]
Here we find another problem: we find in our unique word list some odd non-words such as “0037m.” We should strip those out.
Exercise
Create a regular expression to remove those non-words in dickens.words.v
? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful cheat sheet.
Now let’s re-run that not.blanks vector to strip out the blank you just added.
not.blanks.v <- which(dickens.words.clean.v!="")
dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v]
unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]
Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?
length(unique(dickens.words.clean.v))
Divide this by the amount of words in the whole book to calculate vocabulary density ratios.
unique.words <- length(unique(dickens.words.clean.v))
total.words <- length(dickens.words.clean.v)
unique.words/total.words
# you could do this quicker this way:
# length(unique(dickens.words.v))/length(dickens.words.v)
# BUT it's good to get into the practice of storing results in variables
That’s actually a fairly small density number, 5.7% (Moby-Dick by comparison is about 8%).
The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we’ll see later.
Flow control: For-loops and conditionals in R
Flow control involves stochastic simulation, or repetitive operations or pattern recognition—two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax
for (name
in seq
) {[enter commands]}
This sets a variable called name
(any thing you choose to assign) equal to each of the elements of the sequence (any sequence of values), which is usually a vector. Each of these iterates over the command as many times as is necessary.
for (i in letters[1:10]){
cat(i, ", which is followed by \n")
}
What this literally means is: create a variable called i
as an index for the loop. The first value of i
is a
(the first value of letters
, after the in
), and R executes the function within the loop (taking the instructions within the curly brackets). The code above just prints i
and the text “, which is followed by” with a new line (signified by the regex “”). When the closing bracket is reached, i
moves onto the next value (the second letter). When the loop reaches the last value of the sequence (the tenth of the letters
), it is completed.
Another simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers.
Fibonacci <- numeric(20) # creates a vector called Fibonacci that consists of 20 numeric vectors
Fibonacci[1] <- Fibonacci[2] <- 1 # defines the first and second elements as a value of 1. This is important b/c the first two Fibonacci numbers are 1, and the next (3rd) number is gained by adding the first two
for (i in 3:20) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] # says for each instance of the 3rd through 20th Fibonacci numbers, take the first element - 2 and add that to the next element - 1
Fibonacci
There is another important component to flow control: the conditional. In programming this takes the form of if()
statements.
Syntax
if (condition) {commands when TRUE}
if (condition) {commands when TRUE} else {commands when FALSE}
We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else’s code.
Now, using what we know about regular expressions and flow control, let’s have look at a for() loop that Matthew Jockers uses in Chapter 4 of his Text Analysis for Students of Literature. It’s a fairly complicated but useful way of breaking up a novel text into chapters for analysis. Let’s use it to process the Dickens novel.
length(chapter.freqs.l)[1]
[1] 58
Suppose I wanted to get all relative frequencies of the word “father” in each chapter.
father.freqs <- lapply(chapter.freqs.l, '[', 'father')
father.freqs
You could also use variations of the which
function to identify the chapters with the highest and lowest frequencies.
which.max(father.freqs)
which.min(father.freqs)
Exercise
Create a vector that confines your results to only the paragraphs with dialogue.
dialogue.v <- grep('"(.*?)"', novel.lines.v) # grep is another regex function
novel.lines.v[dialogue.v][1:20]
# check your work by finding all the dialogue lines in novel.lines.v
Bonus Exercise
Modify the for loop in Jockers to find word frequencies only of content with dialogue.
dialogue.chapter.raws.l <- list()
dialogue.chapter.freqs.l <- list()
for(i in 1:length(chap.positions.v)){
if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue
chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" "))
chapter.words.l <- strsplit(chapter.words.v, "\\W")
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")]
chapter.freqs.t <- table(chapter.word.v)
dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t))
dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
}
}
dialogue.chapter.freqs.l[1]
Visualising the data with plot
This gives us a general impression of vocabulary density on a chapter-by-chapter basis. Let’s now return to the previous word search of “father.” Suppose we wanted to visualise that frequency of “father” alongside a similar concept, “son.”
We need to introduce a new function, lapply
. The lapply
function is similar to a for loop, in that it iterates over the elements in a data structure, but it is specifically designed for dealing with lists. It also requires a list as a second argument, and the name of some other function.
chapter.freqs.l[[1]]["father"]
father
0.2695418
chapter.freqs.l[[10]]["son"]
son
0.03901678
The above is just an example: The word “father” appears with 27% relative frequency (that is, 27 times for every 100 words in the chapter) in the first chapter, and the word “son” appear with a 4% relative frequency in the 10th chapter. Now let’s create vectors that store the relative frequencies for each chapter.
Instead of just printing out the values held in this new list, you can capture the results into a single matrix using the rbind
function. The do.call
functions binds the contents of each list item into rows; this effectively activate the rbind
function across the list of “father” and “son” results, respectively.
Let’s look at one of these matrices.
sons.m
<NA>
Chapter I NA
Chapter II NA
Chapter III NA
Chapter IV NA
Chapter V NA
Chapter VI NA
Chapter VII NA
Chapter VIII NA
Chapter IX 0.03696858
Chapter X 0.03901678
Chapter XI NA
Chapter XII NA
Chapter XIII NA
Chapter XIV NA
Chapter XV NA
Chapter XVI NA
Chapter XVII NA
Chapter XVIII 0.01953507
Chapter XIX NA
Chapter XX NA
Chapter XXI NA
Chapter XXII 0.03972195
Chapter XXIII 0.03092146
Chapter XXIV NA
Chapter XXV 0.06563833
Chapter XXVI NA
Chapter XXVII NA
Chapter XXVIII NA
Chapter XXIX NA
Chapter XXX 0.08818342
Chapter XXXI NA
Chapter XXXII NA
Chapter XXXIII NA
Chapter XXXIV NA
Chapter XXXV NA
Chapter XXXVI NA
Chapter XXXVII 0.27864855
Chapter XXXVIII NA
Chapter XXXIX 0.04004806
Chapter XL NA
Chapter XLI NA
Chapter XLII NA
Chapter XLIII NA
Chapter XLIV 0.03423485
Chapter XLV NA
Chapter XLVI 0.03289474
Chapter XLVII NA
Chapter XLVIII NA
Chapter XLIX NA
Chapter L NA
Chapter LI NA
Chapter LII NA
Chapter LIII NA
Chapter LIV NA
Chapter LV 0.03460208
Chapter LVI NA
Chapter LVII NA
Chapter LVIII NA
Compare it to the other matrix of “father” results.
fathers.m
father
Chapter I 0.26954178
Chapter II NA
Chapter III NA
Chapter IV NA
Chapter V NA
Chapter VI NA
Chapter VII 0.14702279
Chapter VIII NA
Chapter IX 0.03696858
Chapter X NA
Chapter XI NA
Chapter XII NA
Chapter XIII NA
Chapter XIV NA
Chapter XV NA
Chapter XVI NA
Chapter XVII NA
Chapter XVIII NA
Chapter XIX NA
Chapter XX 0.06211180
Chapter XXI 0.11055832
Chapter XXII 0.37735849
Chapter XXIII 0.03092146
Chapter XXIV NA
Chapter XXV NA
Chapter XXVI NA
Chapter XXVII 0.06510417
Chapter XXVIII NA
Chapter XXIX 0.02010454
Chapter XXX 0.26455026
Chapter XXXI NA
Chapter XXXII NA
Chapter XXXIII NA
Chapter XXXIV 0.04230118
Chapter XXXV NA
Chapter XXXVI NA
Chapter XXXVII 0.03483107
Chapter XXXVIII NA
Chapter XXXIX 0.02002403
Chapter XL NA
Chapter XLI NA
Chapter XLII NA
Chapter XLIII NA
Chapter XLIV NA
Chapter XLV NA
Chapter XLVI 0.13157895
Chapter XLVII NA
Chapter XLVIII NA
Chapter XLIX NA
Chapter L 0.13071895
Chapter LI 0.29791460
Chapter LII NA
Chapter LIII 0.01871608
Chapter LIV NA
Chapter LV 0.03460208
Chapter LVI NA
Chapter LVII NA
Chapter LVIII 0.03205128
Next we create vectors for each search term; the following extracts the father and son values into two new vectors, and uses the cbind
functions to combine these vectors into a new, two-column matrix consisting of 58 rows and 2 columns.
dim(fathers.sons.m)
[1] 58 2
Now we can visualise these two word searches.
---
title: 'Lecture 8: Introduction to R, Part 2, 19 and 23 September 2019'
output:
  html_document:
    toc: yes
  html_notebook:
    theme: united
    toc: yes
---

### Putting regular expressions to use

#### Data wrangling

There are a few functions in R that use regular expressions: `regexpr`, `gregexpr`, `regmatches`, `sub`, `gsub`.

Briefly we will perform a basic data wrangling exercise. Allison Parrish created a data set that gathers all of the poems in Project Gutenberg into one json file, which can be found on [github](https://github.com/aparrish/gutenberg-poetry-corpus). But suppose we do not want to work with json, and we just want a plain text file of all of the poems in Project Gutenberg? That could be useful. We would then use regular expressions to strip out the json and render a plain text file.

```{r} 
setwd("~/Desktop") # make sure your notebook file and all other files are saved on your Desktop
gutenberg.poetry.v <- scan(file="./gutenburg-poetry/gutenberg-poetry-v001-sample500k.ndjson", what="character", sep="\n", encoding = "UTF-8") # you may want to use the smaller file "gutenberg-poetry-v001-sample10k.ndjson" with 10k lines to test
poetry.strip.s.v <- gsub('\\{"s": "', " ", gutenberg.poetry.v)
poetry.strip.s.v
gutenberg.poems.plain.v <- gsub(', "gid": "\\d+"\\}', " ", poetry.strip.s.v)
gutenberg.poems.plain.v[1:10] # show the first ten lines just to see if it worked
write.table(gutenberg.poems.plain.v, "gutenberg-poems.txt", row.names=F)
```

Now you have a plain text file with a numbered list of lines of poetry. Now you can upload this file into Voyant or run it through AntConc for basic text analysis results.

#### Cleaning up Dickens

If you have not already, download the text file of Dickens's [*Great Expectations*](https://www.dropbox.com/s/qji9ueb46ajait9/dickens_great-expectations.txt?dl=0), or copy the file from our github corpus, onto your working directory and scan the text.

```{r}
dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
```
You have now loaded *Great Expectations* into a variable called `dickens.v`.

With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.

```{r}
length(dickens.v) # this finds the number of lines in the book

dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list

dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries
# each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list.

class(dickens.words) # the class function tells you the data structure of your variable

dickens.words.v <- unlist(dickens.words)

class(dickens.words.v)

dickens.words.v[1:20] # find the first 20 ten words in Great Expectations
```

Did you notice the "\\W" in the `strsplit` argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.

Also, did you notice the blank result on the 10th word? This requires a little clean-up step.

```{r}
not.blanks.v <- which(dickens.words.v!="")

dickens.words.v <- dickens.words.v[not.blanks.v]
```

Extra white spaces often cause problems for text analysis.

```{r}
dickens.words.v[1:20]
```


Voila! We might want to examine how many times the third result "father" occurs (the fourth word result, and one that will probably be an important word in this book).

```{r}
length(dickens.words.v[which(dickens.words.v=="father")])
```

Or produce a list of all unique words.

```{r}
unique(sort(dickens.words.v, decreasing = FALSE))[1:50]
```

Here we find another problem: we find in our unique word list some odd non-words such as "0037m." We should strip those out.

## Exercise

Create a regular expression to remove those non-words in `dickens.words.v`? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful [cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).

```{r}

```

Now let's re-run that not.blanks vector to strip out the blank you just added. 

```{r}
not.blanks.v <- which(dickens.words.clean.v!="")

dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v]

unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]
```

Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?

```{r}
length(unique(dickens.words.clean.v))
```

Divide this by the amount of words in the whole book to calculate vocabulary density ratios.

```{r}
unique.words <- length(unique(dickens.words.clean.v))

total.words <- length(dickens.words.clean.v)

unique.words/total.words 
# you could do this quicker this way: 
# length(unique(dickens.words.v))/length(dickens.words.v) 
# BUT it's good to get into the practice of storing results in variables
```
That's actually a fairly small density number, 5.7% (*Moby-Dick* by comparison is about 8%).

The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we'll see later.

### Flow control: For-loops and conditionals in R

Flow control involves **stochastic simulation**, or repetitive operations or pattern recognition---two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax

for (`name` in `seq`) {[enter commands]}

This sets a variable called `name` (any thing you choose to assign) equal to each of the elements of the sequence (any sequence of values), which is usually a vector. Each of these iterates over the command as many times as is necessary. 

```{r}
for (i in letters[1:10]){
  cat(i, ", which is followed by \n")
}
```


What this literally means is: create a variable called `i` as an index for the loop. The first value of `i` is `a` (the first value of `letters`, after the `in`), and R executes the function within the loop (taking the instructions within the curly brackets). The code above just prints `i` and the text ", which is followed by" with a new line (signified by the regex "\n"). When the closing bracket is reached, `i` moves onto the next value (the second letter). When the loop reaches the last value of the sequence (the tenth of the `letters`), it is completed.

Another simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers.

```{r}
Fibonacci <- numeric(20) # creates a vector called Fibonacci that consists of 20 numeric vectors

Fibonacci[1] <- Fibonacci[2] <- 1 # defines the first and second elements as a value of 1. This is important b/c the first two Fibonacci numbers are 1, and the next (3rd) number is gained by adding the first two

for (i in 3:20) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] # says for each instance of the 3rd through 20th Fibonacci numbers, take the first element - 2 and add that to the next element - 1
Fibonacci
```

There is another important component to flow control: the conditional. In programming this takes the form of `if()` statements.

**Syntax**

`if (condition) {commands when TRUE}`

`if (condition) {commands when TRUE} else {commands when FALSE}`

We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else's code.

Now, using what we know about regular expressions and flow control, let's have look at a for() loop that Matthew Jockers uses in Chapter 4 of his *Text Analysis for Students of Literature*. It's a fairly complicated but useful way of breaking up a novel text into chapters for analysis. Let's use it to process the Dickens novel. 

```{r}
text.v <- scan("dickens_great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
not.blanks.v <- which(text.v!="")
clean.text.v <- text.v[not.blanks.v]

start.v <- which(clean.text.v == "Chapter I")
end.v <- which(clean.text.v == "THE END")
novel.lines.v <- clean.text.v[start.v:end.v]
chap.positions.v <- grep("^Chapter \\w", novel.lines.v)

novel.lines.v[chap.positions.v]

chapter.raws.l <- list()
chapter.freqs.l <- list()

# the following for loop starts by iterating over each item in chap.positions.v

for(i in 1:length(chap.positions.v)){
  # in this if statement: if the value of i is not equal to the length of the vector, keep iterating over the vector
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]] # this variable captures the chapter title in novel.lines.v that is indicated by the value held in the chap.positions.v. If this is confusing, try this: In your console, set i to 1 by running i <- 1. Then run novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1 # i+1 gives me the position of the first line of the chapter text (the first paragraph after the ch title, in other words)
end <- chap.positions.v[i+1]-1 # run these lines in the console: i <- 1, then chap.positions.v[i+1]. Instead of adding 1 to the value stored in the ith position of chap.positions.v, it adds 1 to i as an index. Instead of extracting the value of the ith item in the vector, the program identifies the value of the item in the next position beyond i in the vector. This line returns the next item in the vector, and the value held in that spot is the position for the start of a new chapter. This ensures the processing of the next chapter. To ignore the words in the chapter heading, you subtract 1 from that value in order to get the line number in novel.lines.v that comes just before the start of a new chapter.
chapter.lines.v <- novel.lines.v[start:end] # having defined start and end points of each chapter, you extract the lines
chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" ")) # pastes chapter lines into a single block of text, and lowercases each word
chapter.words.l <- strsplit(chapter.words.v, "\\W") # split all words in each chapter into a vector of words
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) # tabulates vector of words into a frequency count of each word type
chapter.raws.l[[chapter.title]] <- chapter.freqs.t # here you dump the table of raw frequency counts into the empty list that was created before entering the loop. The double brackets assign a label to the list item; here each item in the list is named with the chapter heading extracted a few lines above
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) # converts the raw counts to relative frequencies based on the number of words in the chapter
chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

chapter.freqs.l[1]

length(chapter.freqs.l)[1]
```

Suppose I wanted to get all relative frequencies of the word "father" in each chapter.

```{r}
father.freqs <- lapply(chapter.freqs.l, '[', 'father')

father.freqs
```

You could also use variations of the `which` function to identify the chapters with the highest and lowest frequencies.

```{r}
which.max(father.freqs)

which.min(father.freqs)
```

### Exercise

Create a vector that confines your results to only the paragraphs with dialogue.

```{r}
dialogue.v <- grep('"(.*?)"', novel.lines.v) # grep is another regex function

novel.lines.v[dialogue.v][1:20] 
# check your work by finding all the dialogue lines in novel.lines.v
```

### Bonus Exercise

Modify the for loop in Jockers to find word frequencies only of content with dialogue.

```{r}
dialogue.chapter.raws.l <- list()
dialogue.chapter.freqs.l <- list()

for(i in 1:length(chap.positions.v)){
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue
chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" ")) 
chapter.words.l <- strsplit(chapter.words.v, "\\W")
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) 
dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t 
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 
dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

dialogue.chapter.freqs.l[1]
```

### Visualising the data with `plot`

```{r}
# to extract the frequency data from all of the chapters at once
lapply(chapter.raws.l,mean)
# putting results into a matric object
mean.word.use.m <- do.call(rbind, lapply(chapter.raws.l,mean))
dim(mean.word.use.m)
# this reports 703 rows in 1 column, but there's more info in the matrix

plot(mean.word.use.m, type = "h", main = "Mean word usage patterns in each chapter of Dickens's Great Expectations", ylab = "mean word use", xlab = "Each chapter")
```

```{r}
# using scale to method has the effect of sub- tracting away the expected value 
# (expected as calculated by the overall mean) and then showing only the deviations from the mean
scale(mean.word.use.m)
plot(scale(mean.word.use.m), type = "h", main = "Scaled mean word usage patterns in each chapter of Dickens's Great Expectations", ylab = "mean word use", xlab = "Each chapter") 
```

This gives us a general impression of vocabulary density on a chapter-by-chapter basis. Let's now return to the previous word search of "father." Suppose we wanted to visualise that frequency of "father" alongside a similar concept, "son."

We need to introduce a new function, `lapply`. The `lapply` function is similar to a for loop, in that it iterates over the elements in a data structure, but it is specifically designed for dealing with lists. It also requires a list as a second argument, and the name of some other function.

```{r}
chapter.freqs.l[[1]]["father"]
```

```{r}
chapter.freqs.l[[10]]["son"]
```

The above is just an example: The word "father" appears with 27% relative frequency (that is, 27 times for every 100 words in the chapter) in the first chapter, and the word "son" appear with a 4% relative frequency in the 10th chapter. Now let's create vectors that store the relative frequencies for each chapter.

```{r}
fathers.l <- lapply(chapter.freqs.l, '[', 'father')
sons.l <- lapply(chapter.freqs.l, '[', 'son')
```

Instead of just printing out the values held in this new list, you can capture the results into a single matrix using the `rbind` function. The `do.call` functions binds the contents of each list item into rows; this effectively activate the `rbind` function across the list of "father" and "son" results, respectively.

```{r}
fathers.m <- do.call(rbind, fathers.l)
sons.m <- do.call(rbind, sons.l)
```

Let's look at one of these matrices.

```{r}
sons.m
```

Compare it to the other matrix of "father" results.

```{r}
fathers.m
```

Next we create vectors for each search term; the following extracts the father and son values into two new vectors, and uses the `cbind` functions to combine these vectors into a new, two-column matrix consisting of 58 rows and 2 columns.

```{r}
fathers.v <- fathers.m[,1]
sons.v <- sons.m[,1]

fathers.sons.m <- cbind(fathers.v, sons.v)

dim(fathers.sons.m)
```

Now we can visualise these two word searches.

```{r}
colnames(fathers.sons.m) <- c("father", "son")

barplot(fathers.sons.m, beside=T, col="grey", ylab = "relative word frequency")

```