# Worksheet B-2: Strings and Regular Expressions


In this tutorial, you'll practice how to:

- Manipulate a character vector in R using the stringr package.
- Write simple regular expressions (regex).
- Apply regular expressions to data manipulation.

Load the requirements for this worksheet:

In [None]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(digest))

The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them.

In [None]:
# An unlocked code chunk.

# Part 1: Warming up to the stringr functions


## Question 1

There's that famous sentence about the "quick fox" that contains all letters of the alphabet, although we don't quite remember the sentence. Obtain a vector of all sentences from the `stringr::sentences` dataset containing the word `"fox"`. Store the resulting vector in a variable named `answer1`.

```
answer1 <- str_subset(FILL_THIS_IN, FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer1)

In [None]:
test_that("Question 1", {
 expect_identical(digest(answer1), "b54efc522343ff2628fee7e71bd17747")
})

## Question 2

Make an (atomic) vector of the individual words in the sentence. Store the result in a variable named `answer2`.

Hint: Use `str_split(string, pattern)`, and carefully note what the output of this function is.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2)

In [None]:
test_that("Question 2", {
 expect_true(digest(answer2) %in% 
 c("e9776d44cb7da14ddffdfa985e1d8908", 
 "ed8ec2bd7d42477ad678dd7f3e077f6e"))
})

## Question 3

With stringr, we can substitute parts of a string, too. Replace the word "fox" from `answer1` with "giraffe" using `str_replace()`, and store the result in a variable named `answer3`.

```
answer3 <- str_replace(answer1, pattern = FILL_THIS_IN, replacement = FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3)

In [None]:
test_that("Question 3", {
 expect_identical(digest(answer3), "8659b349bfc1e359cdbb08cd38f5537d")
})

## Question 4: pig latin

Convert `words` to a simplistic version of pig latin:

1. Move the first letter to the end of the word.
2. Add "ay" to the end of the word.

Hint: subset by position using `str_sub(string, start, end)`.

Store the result in a variable named `answer4`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4)

In [None]:
test_that("Question 4", {
 expect_identical(digest(answer4), "66f9cc0b279607492b6d015f979210d2")
})

Now let's practice working with character columns in a tibble. Consider the wedding dataset on the UBC-STAT/stat545.stat.ubc.ca GitHub repository:

In [None]:
wedding <- suppressMessages(read_csv("https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/wedding/attend.csv"))
head(wedding)

Back in Worksheet A-4, we used `tidyr::separate()` to split the `name` column into two columns, named `first` and `last` containing the first and last names (which are currently separated by a space): 

In [None]:
wedding_fl <- wedding %>% 
 separate(name, into = c("first", "last"), sep = " ")
head(wedding_fl)

## Question 5

Make a new column named `greeting` with entries of each row following the following format: 
`"Hello there, [FIRST_NAME_HERE] from party [PARTY_NUMBER_HERE]!"` Store the resulting tibble in a variable named `answer5`. 

```
answer5 <- wedding_fl %>%
 mutate(greeting = str_c(FILL_THIS_IN))
```

*Hint 1*: The `str_c()` function can take in any number of character vectors of the same length, stack them up side by side, and glue them together. 

*Hint 2*: `str_c()` can recycle values. For example, if you pass in a character vector of length 1 (say `"Apple"`, and a character vector of length 3 (say `c("Pie", "Crisp", "Crumble")`, then it can return `c("Apple Pie", "Apple Crisp", "Apple Crumble")`. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer5, width=Inf)

In [None]:
test_that("Question 5", {
 expect_identical(
 digest(unclass(select(answer5, party, first, greeting))), 
 "428d7ce6c81771e8ac3cd6a64e31a614"
)
})

## Question 6

Make a tibble with one row per party, with columns named `people` and `wedding_status`:

- `people`: contains the first names of everyone in the party, separated by commas (and a space: `", "`).
- `wedding_status`: should be `"CONFIRMED"` if all their wedding status entries are `"CONFIRMED"`, and `"PENDING"` otherwise. 

Store the resulting tibble in a variable named `answer6`.

Starter code:

```
answer6 <- wedding_fl %>% 
 group_by(party) %>% 
 summarise(
 people = str_flatten(FILL_THIS_IN),
 wedding_status = if_else(FILL_THIS_IN, "CONFIRMED", "PENDING")
 )
```

*Hint*: The `str_flatten()` function concatenates a vector of characters into a single string, with an optional argument that lets you put things between vector elements before concatenating them. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer6)

In [None]:
test_that("Question 6", {
 expect_identical(
 digest(unclass(select(answer6, people, wedding_status))), 
 "cf18af7e44c899c48e53b83614739b86"
)
})

*Good to know*: If you wanted to be more gramatically correct, you could have replaced `str_flatten()` with the `str_flatten_comma()` function to enforce English grammar rules for listing off items. For example, this function would allow you to separate two names with `" and "`. 

# Part 2: Introduction to Regular Expressions

Regular expressions -- or "Regex" for short -- express a pattern in text that can be passed into `stringr` functions to do powerful things. Let's start by learning the basics of how to write these patterns with the helpful `str_view()` function:

In [None]:
str_view(fruit, "melon")

The `str_view()` function took in a character vector containing fruits, and highlighted the entries matching the regular expression pattern `"melon"`. This is the simplest type of regex pattern, and it means to look for an exact match. 

Let's learn more about the language of regex patterns using the countries in the gapminder data set: 

In [None]:
countries <- levels(gapminder::gapminder$country)
head(countries)

## Question 7: "any" characters

The "." character when used in a regular expression means "any single character". 

Use the `str_subset()` function to find all countries in the gapminder data set with the following pattern: "i", followed by any single character, followed by "a". Store the result in a vector named `answer7`. 

Note that Italy will not be on the list, because regex is case-sensitive.

*Good to know*: You can specify "any single character in this list of characters" or "any single character except those in this list of characters" using square brackets. For example, `"[abc]"` for "a, b, or c" and `"[^abc]"` for "anything but a, b, or c". 

In [None]:
# str_view(countries, pattern = "FILL_THIS_IN")
# answer7 <- str_subset(countries, pattern = "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer7)

In [None]:
test_that("Question 7", {
 expect_identical(digest(answer7), "fdf1c0b93db219fb32d927700cab3c4e")
})

## Question 8: the "escape" 

Uh oh! But what if I wanted to literally search for countries with a period in the name? I can't use the regex `"."`, since that'll match "any single character". I need to "escape the period" to indicate that I really mean to search for the character ".", and don't mean to use the character "." in its special regex meaning. We can escape the period by adding `\\` in front of it.

"Escape the period" to make a vector of all countries with at least one period in their name. Store the result in a vector named `answer8`.

*Good to know*: If you've used regex outside of R, you might be surprised to see that we need to add `\\` rather than `\`. This is because `\` itself is a special character in R strings that need to be escaped with `\`. 

In [None]:
# str_view(countries, pattern = "FILL_THIS_IN")
# answer8 <- str_subset(countries, pattern = "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer8)

In [None]:
test_that("Question 8", {
 expect_identical(digest(answer8), "4c500f226f5abbe540ef2506a4644375")
})

## Question 9: Position indicators

Use:

- `^` to correspond to the __beginning__ of a string.
- `$` to correspond to the __end__ of a string.

Find all countries that end in "land". Store the result in a vector named `answer9`.

In [None]:
# str_view(countries, "FILL_THIS_IN")
# answer9 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer9)

In [None]:
test_that("Question 9", {
 expect_identical(digest(answer9), "692ee00b59194cea743c5ac3bf2302ae")
})

## Question 10: Quantifiers/Repetition

The handy ones are:

- `*` for 0 or more
- `+` for 1 or more
- `?` for 0 or 1

Find all countries that have any number of "o"'s (but at least 1), following an "r". Store the resulting vector in a variable named `answer10`.

In [None]:
# str_view(countries, "FILL_THIS_IN")
# answer10 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer10)

In [None]:
test_that("Question 10", {
 expect_identical(digest(answer10), "fa31d9cfe634b9a841cabdf9e31c0eeb")
})

## Question 11: "Or" and Precedence

Use `|` to denote "or". "And" is implied otherwise, and has precedence. Use parentheses to be deliberate with precedence.

For example:

In [None]:
bbb <- c("bear", "beer", "bar")
cat("'bee' or 'ar':")
str_view(bbb, pattern = "bee|ar")
cat("'e' or 'a':")
str_view(bbb, pattern = "be(e|a)r") 

Now, find all countries that have either "o" twice in a row or "e" twice in a row ("oe" and "eo" are not allowed). Store the resulting vector in a variable named `answer11`.

In [None]:
# str_view(countries, "FILL_THIS_IN")
# answer11 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer11)

In [None]:
test_that("Question 11", {
 expect_identical(digest(answer11), "558af24f1d19b86ffc6b74541aef9f9b")
})

## Question 13: Groups

You can use parentheses not only to specify precendence, but also to indicate groups that you can refer to later using integers to refer to the group number. 

Example using a's and b's: matching all instances of a character sandwiched between the same two characters:

In [None]:
ab <- c("aaa", "aab", "aba", "baa", "abb", "bab", "bba", "bbb")
str_view(ab, pattern="(.)(.)\\1")

Example: matching all instances of a character followed by two identical characters:

In [None]:
str_view(ab, pattern="(.)(.)\\2")

Your task: Find all countries that have the same letter repeated twice (like "Greece", which has "ee"). Store the result in a vector named `answer12`.

In [None]:
# str_view(countries, "FILL_THIS_IN")
# answer12 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer12)

In [None]:
test_that("Question 12", {
 expect_identical(digest(answer12), "3531a88f6935e86d4ff1054504182875")
})

# Part 3: Stringr with regular expressions

Now that you have your bearings with stringr and with regular expressions, let's practice putting them together in (semi)realistic scenarios. 

Useful links: 
- [Posit Strings cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) covers stringr on page 1 and regular expressions on page 2.
- [Regexlearn.com](https://regexlearn.com/) for another regular expressions tutorial (general, not specific to R). 
- [Regexr](https://regexr.com/) is very helpful, especially when constructing more complex regular expressions.

## Question 13

Select individuals in the wedding tibble whose first name starts between "A" and "Em" inclusive, and sort them in alphabetical order by first name. Store the resulting tibble in a variable named `answer14`.

Starter code:

```
answer13 <- wedding %>% 
 filter(FILL_THIS_IN(name, "FILL_THIS_IN")) %>% 
 arrange(name)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer13)

In [None]:
test_that("Question 13", {
 expect_identical(digest(sort(answer13$name)), "6bbf440d3cca5b2e4b670b48f7bddc14")
})

## Question 14

Add a column called `prop_vowels` to the `wedding_fl` tibble that contains the proportion of vowels in each first name. For example, "Emaan" has 3 vowels and 5 letters, so the proportion of vowels is 3/5 = 60\%. Store the resulting tibble in a variable named `answer14`. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer14)

In [None]:
test_that("Question 14", {
 expect_identical(
 answer14 %>% 
 select(first, prop_vowels) %>% 
 arrange(first) %>%
 mutate(prop_vowels = round(prop_vowels, digits = 3)) %>% 
 digest(), 
 "f37d71c09ea60921ddd3959252db3f14"
 )
})

## Question 15

Task: what letters are used in the first sentence of the `stringr::sentences` dataset? Make a vector of all the unique letters in the sentence (in lowercase), and store it in a variable called `answer15`. Don't forget to remove non-letters, which are either a space or a period.

Hint:

```
answer15 <- sentences[1] %>% 
 str_remove_all("FILL_THIS_IN") %>% 
 FILL_THIS_IN() %>% 
 str_split(FILL_THIS_IN) %>% 
 .[[1]] %>% 
 unique()
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer15)

In [None]:
test_that("Question 15", {
 expect_identical(digest(sort(answer15)), "d586631001ba6d44947a09efecc4f960")
})

## Question 16

Here is a tibble with made-up names and telephone numbers: 

In [None]:
contact <- tibble(name = c("Kayden Lavoie", 
 "Ethan Fortin", 
 "Emma Davis", 
 "Aliyah Chan"), 
 phone = c("604-971-9949", 
 "6046182277", 
 "(778)881-5831", 
 "604-544-2554"))
print(contact)

Unfortunately, these four people have entered in their phone numbers in different formats. Let's fix that, in the spirit of routine data cleaning. Change the `phone` column to have all four phone numbers match Aliyah Chan's format. Store the resulting tibble in a variable called `answer16`. 

*Hint*: `mutate()`, `str_remove_all()`, `separate()`, and `unite()`. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer16)

In [None]:
test_that("Question 16", {
 expect_identical(answer16 %>% arrange(name) %>% digest(), 
 "83e461c86c45469beecbbd011c6b11e6")
})