# Worksheet B-3: Nesting, List Columns, and `purrr`


From this topic, students are anticipated to be able to:

- Use the `map` family of functions from the purrr package to iteratively apply a function.
- Create and operate on list columns in a tibble using `nest()`, `unnest()`, and the `map` family of functions.
- Define functions on-the-fly within a `map` function using shortcuts.
- Apply list columns to cases in data analysis: columns of models, columns of nested lists (JSON-style data), and operating on entire groups within a tibble.

Load the worksheet requirements:

In [None]:
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(digest))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(palmerpenguins))
suppressPackageStartupMessages(library(glue))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(repurrrsive))

The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them. 

In [None]:
# An unlocked code chunk.

# Part 1: Exploring `purrr` Fundamentals

The `purrr` package is also part of the `tidyverse`.

Apply a function to each element in a list/vector with `map`.

General usage: `purrr::map(VECTOR_OR_LIST, YOUR_FUNCTION)`

Note:

- `map` always returns a list.
- `YOUR_FUNCTION` can return anything!

There are many variations of `map_*`, which you can find in this [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf).

For the next few tasks, you will be converting for-loop(s) to vectorized expressions that reproduce the output (numbers should be the same, the format can be different).

## QUESTION 1

Without using vectorization, take the square root of the following vector:

In [None]:
x <- 1:10

The result should be a list of the calculations. Store your answer in `answer1`.

```r
answer1 <- map(FILL_THIS_IN, FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
answer1

In [None]:
test_that('Question 1', {
 expect_known_hash(mode(answer1), '086ebc4c59c08c43e75bae74f1e16897')
 expect_known_hash(round(unlist(answer1), 4), 'ad16817e39d61cdf2ce38234f61306de')
})

## QUESTION 2 

In Question 1, we used the generic `map` function, and got a list. Let's use a more specific `map_*` function this time.

Again without using vectorization, square each component of `x`. The result should be a numeric vector. Store your answer in `answer2`:

```r
answer2 <- map_dbl(FILL_THIS_IN, FILL_THIS_IN)
```

_Hint:_ The last `FILL_THIS_IN` corresponds to an anonymous function!

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2)

In [None]:
test_that('Question 2', {
 expect_known_hash(mode(answer2), '46606ee201b428a3fa6c8a0d3d9e671c')
 expect_known_hash(round(unlist(answer2), 4), '84a2193460cb35ff884e4c3144abf122')
})

Now we've used both `map` and a more specific `map_dbl`. Now you can see how they differ, and how the use of one is better justified than the other for our purpose. Now it's your turn to choose! Remember to use the [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) if you need it!

## QUESTION 3

Below is sample code that computes the mean of every column in the `mtcars` dataset. Use the appropriate `purrr` function to vectorize this task.

In [None]:
mtcars_means <- numeric()
for (c in seq_along(mtcars)){
 mtcars_means[[c]] <- mean(mtcars[[c]])
}
mtcars_means

Store your answer in `answer3`; as above, your answer should be a vector. _Hint_: remember that a tibble / data frame is just a list, where each entry is a column (a vector).

```r
answer3 <- FILL_THIS_IN(datasets::mtcars, FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3)

In [None]:
test_that('Question 3', {
 expect_known_hash(floor(unname(answer3)), '9a69e180a47954630685d24f403fe3af')
})

## QUESTION 5

Below is sample code that computes the number of unique values in each column of `mtcars` as a named vector, using for-loops. Use the appropriate `purrr` function to vectorize this task.

In [None]:
mtcars_unique <- numeric()
for (c in seq_along(datasets::mtcars)){
 mtcars_unique[[c]] <- length(unique(datasets::mtcars[[c]]))
}
names(mtcars_unique) <- names(datasets::mtcars)
mtcars_unique

Store your answer in `answer4`:

```r
answer4 <- datasets::mtcars %>% 
 FILL_THIS_IN(FILL_THIS_IN) %>% 
 FILL_THIS_IN(FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer4)

In [None]:
test_that('Question 4', {
 expect_known_hash(mode(answer4), '46606ee201b428a3fa6c8a0d3d9e671c')
 expect_known_hash(as.integer(answer4), '1981b33e1151073e1c227fe95218c6f5')
})

## QUESTION 4

Below is sample code that divides the values in each column of the `mtcars` dataset by the maximum in that column. 

In [None]:
for (i in seq_along(mtcars)){
 mtcars[[c]] <- mtcars[[i]] / max(mtcars[[i]], na.rm = TRUE)
}
head(mtcars)

Find a way to do this without a loop. Store your answer in `answer5`:

```r
answer5 <- datasets::mtcars %>%
 mutate(FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN))
 ```
 
 _Hint_: The last `FILL_THIS_IN` corresponds to an anonymous function.

 _Food for thought_: Here is some sample code that divides the values in each column of the `mtcars` dataset by the maximum in that column using `purrr` instead of a loop or using `mutate`: `map(mtcars, function(x) x/max(x))`. How and why is the output of this code different from `answer5`? 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer5)

In [None]:
test_that('Question 5', {
 expect_known_hash(class(answer5), '555434c8748e07b094500256087cdcc5')
 expect_known_hash(dimnames(answer5), '3a51b37e4731153f63a1f5f9dc188269')
 expect_known_hash(round(answer5$mpg, 3), 'af82f570a0aa02d8abcbbd14386e98b0')
})

## QUESTION 6: `map2`

Let's use `purrr` to calculate some probabilities from negative binomial distributions. One parameterization of the negative binomial distribution has a mean parameter and a dispersion parameter - this parameterization is often used to model highly variable count-valued data. 

We would like to calculate the probability of observing a count of 0 and the probability of observing a count of 1 for a bunch of negative binomial distributions. Here's a function to do this for a single negative binomial distribution: 

In [None]:
prob_01_neg_bin <- function(mean, disp) { 
 dnbinom(x=c(0, 1), mu=mean, size=disp) 
}

prob_01_neg_bin(1, 1) # calculates probability of observing 0 count and probability of observing 1 count

The `prob_01_neg_bin()` function has two arguments corresponding to the two parameters of the negative binomial distribution, so to calculate probabilities for a bunch of negative binomial distributions, we'd need a `purrr` function to plug in the two parameters. Here are the parameters we would like to plug in: 

In [None]:
mean_params <- c(1, 5, 10)
disp_params <- c(0.1, 1, 5)

Store your answer in `answer6`. It should be a list of probabilities.

```r
answer6 <- map2(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer6)

In [None]:
test_that('Question 6', {
 expect_known_hash(round(unlist(answer6), 4), 'b9facb41d852fe0d051fc0c943d0fc5a')
})

## Question 7

Introducing the Big Bang `!!!` and `rlang::exec()`.

Using the ```palmerpenguins::penguins``` tibble: 

Let's use the `recode()` function from the `dplyr` package to rename the levels of the `island` factor variable. The straightforward way to do this would be to do:

In [None]:
penguins %>% mutate(island = 
 recode(island, Biscoe = "B", Dream = "D", Torgersen = "T"))

But what if you wanted to specify the levels to be renamed and their new names via a list? 

In [None]:
island_level_key <- c(Biscoe = "B", Dream = "D", Torgersen = "T")

Inputting the list itself via `penguins %>% mutate(island = recode(island, island_level_key))` throws an error, because `dplyr::recode()` is not expecting a list input. What's the alternative?

Your task: use the big bang operator (`!!!`) in front of the list argument, to get the desired result. This effectively takes the arguments of a list, and puts them as arguments to a function.

```
answer12 <- penguins %>% mutate(island = recode(island, FILL_THIS_IN))
```

__FYI__: Conveniently, `dplyr::recode()` recognizes the big bang operator. If you have a function that doesn't recognize it (like `sum()`), use `rlang::exec(function, !!!list_of_arguments)` instead.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer7)

In [None]:
test_that('Question 7', {
 expect_known_hash(answer7 %>% pull(island), 
 "ed9cf5d471bb4656bbb8ad65ada2a605")
})

# Part 2: Nesting and List Columns

_One_ of the ways a list-column can be made is by using `nest()`.

## QUESTION 8

Create a tibble that bundles everything in `gapminder` except for `country` and `continent` into a list-column. Name your list column `other` (without using `rename()` or `mutate()`). Store your answer in `answer8`.

```r
answer8 <- gapminder %>%
 nest(FILL_THIS_IN = FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer8, n = 3)

In [None]:
test_that('Question 8', {
 expect_known_hash(enc2utf8(sapply(answer8$other, colnames)), 'ceba7fd58def34a537b5b13430a7ec2a')
 expect_known_hash(sapply(answer8$other, dim), '388a8eae98b3cb184d3fe8ed8dd46916')
 expect_known_hash(sapply(answer8$other, `[[`, 'year'), '0370844f5c0d097891d284949811883e')
})

## QUESTION 9

_Reproducibly_ sample 5 countries in the `gapminder` tibble at random. Store your answer in `answer9`, and set the seed as 123.

```r
FILL_THIS_IN(123)
answer9 <- gapminder %>%
 nest(FILL_THIS_IN = FILL_THIS_IN) %>% 
 slice_sample(n=5) %>% 
 unnest(FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer9)

In [None]:
test_that('Question 9', {
 expect_known_hash(sort(enc2utf8(as.character(answer9$country)), method = 'radix'), 'ca060e5983d51a09aeb24c2393462353')
 expect_known_hash(round(answer9$gdpPercap[order(enc2utf8(as.character(answer9$country)), method = 'radix')], 3), '75d94ea77a21140b54a374ed59f2253a')
})

## QUESTION 10

For each `gapminder` continent, fit a linear model of `lifeExp` from `log(gdpPercap)` and put this as a new column. Store your answer into `answer10`:

```r
answer10 <- gapminder %>% 
 select(continent, gdpPercap, lifeExp) %>% 
 nest(data = c(FILL_THIS_IN, FILL_THIS_IN)) %>% 
 mutate(model = FILL_THIS_IN(data, ~ lm(FILL_THIS_IN ~ FILL_THIS_IN, data = FILL_THIS_IN)))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer10)

In [None]:
test_that('Question 10', {
 expect_known_hash(sapply(answer10$model, class), '2fe5bf6c6fb725f272c801e5f7560afe')
 expect_known_hash(round(unlist(lapply(answer10$model, coef)), 3), 'e536d3378586d3c54b920504b3238cde')
})

## QUESTION 11

Using your model from Question 10, make predictions using `augment()` from the `broom` package, and then `unnest`. Store your answer in `answer11`:

```r
answer11 <- answer10 %>% 
 mutate(continent, yhat = map(FILL_THIS_IN, FILL_THIS_IN), .keep = "none") %>% 
 unnest(FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer11)

In [None]:
test_that('Question 11', {
 expect_known_hash(dimnames(answer11), 'f58a9b4d518ca8cb04492350384da646')
 expect_known_hash(round(with(answer11, .fitted[order(lifeExp)]), 3), 'f8db1a712fe6b69f7577c88882248dc5')
 expect_known_hash(round(with(answer11, .sigma[order(lifeExp)]), 3), '83e927e2f02fd18c293c3a1ddb7a0ed2')
})

## Question 12: `pmap`

Using the `palmerpenguins::penguins` tibble, calculate a 95% confidence interval based on the one-sample t-test for the mean body mass in grams of each penguin species. Here is code to do this for the Adelie penguins: 

In [None]:
get_adelie_stats <- penguins %>% 
 filter(species == "Adelie") %>% 
 summarise(mean = mean(body_mass_g, na.rm = TRUE), 
 sd = sd(body_mass_g, na.rm = TRUE), 
 n = n())

# Given a sample mean, sample standard deviation, and sample size, 
# calculate a 95% confidence interval for the population mean
make_95p_conf_int_t <- function(mean, sd, n) { 
 c(lower_95p_ci = mean - qt(0.975, df=n-1), upper_95p_ci = mean + qt(0.975, df=n-1)) 
}

make_95p_conf_int_t(get_adelie_stats$mean, get_adelie_stats$var, get_adelie_stats$n)

To do this for all three species of penguins, we will need a `purrr` function to plug the three parameters of `make_95p_conf_int_t()` into. 

Starter code:

```
answer12 <- penguins %>% 
 group_by(species) %>% 
 summarise(mean = mean(body_mass_g, na.rm = TRUE),
 sd = sd(body_mass_g, na.rm = TRUE), 
 n = n()) %>% 
 mutate(ci = pmap(FILL_THIS_IN, FILL_THIS_IN))
```

Your answer should be a tibble with five columns named `species`, `mean`, `sd`, `n`, and `ci`. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer12)

In [None]:
test_that('Question 12', {
 answer12 %>% 
 pull(ci) %>% 
 unlist() %>% 
 round(3) %>%
 expect_known_hash('bcdb58de75d3144dcfc728e79860e14d')
 expect_true(all(c("species", "mean", "sd", "n", "ci") %in% names(answer12)))
})

## Question 13 `unnest()`

`unnest()` need not always be paired with `nest()`. Make a tibble `answer13` that is a longer version of `answer12` where the column `ci` is of double type. 

Starter code:

```
answer13 <- answer12 %>% unnest(FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer13)

In [None]:
test_that('Question 13', {
 answer13 %>% 
 pull(ci) %>% 
 round(4) %>% 
 expect_known_hash("790f2dedb437d642c8fdbca63117eac1")
 expect_true(all(c("species", "mean", "sd", "n", "ci") %in% names(answer12)))
})

## Question 14

Output a list of gapminder tibbles, one for each continent. Do not include the `continent` column in the divided tibbles -- the name of each list entry should be the continent name. 

_Hint_: Check out the `enframe()` and `deframe()` functions. 

Starter code:

```
answer14 <- gapminder %>% 
 nest(FILL_THIS_IN) %>% 
 FILL_THIS_IN()
```


In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer14)

In [None]:
test_that('Question 14', {
 expect_known_hash(names(answer14), '90da4aa25e5abc752edec3d524ea2677')
 map(answer14, pull, gdpPercap) %>% 
 unlist() %>% 
 unname() %>% 
 round(4) %>% 
 expect_known_hash("a621bfc9dba8da1f02e4dc19fa4083f6")
})

## Question 15 

Sometimes the vector/list we're iterating over has names, and it's useful to use those names. To access these names, use the `imap` family.

For the list of tibbles made in the above question, save each one to file using the appropriate purrr function, using the names as the file names.

Starter code:

```
answer15 <- FILL_THIS_IN(answer14, ~ write_csv(FILL_THIS_IN, glue::glue(FILL_THIS_IN, ".csv")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
dir()

In [None]:
test_that('Question 15', {
 expect_true("Africa.csv" %in% dir())
 expect_true("Americas.csv" %in% dir())
 expect_true("Asia.csv" %in% dir())
 expect_true("Europe.csv" %in% dir())
 expect_true("Oceania.csv" %in% dir())
})

# Part 3: Recursive Lists

We won't focus much on recursive lists in this course, but here is a little taste of it.

## Question 16: `unnest_wider()` and `unnest_longer()`

Explore the `repurrrsive::got_chars` nested list. It contains information about Game of Thrones characters.

In [None]:
str(got_chars, list.len = 4)

Put the list in a tibble:

In [None]:
got_chars_tbl <- tibble(character = repurrrsive::got_chars)
print(got_chars_tbl, n = 5)

Would widening the list column work best, or lengthening? Do it.

```
answer16 <- FILL_THIS_IN(got_chars_tbl, character)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer16, n = 5)

In [None]:
test_that('Question 16', {
 answer16 %>% 
 pull(culture) %>% 
 expect_known_hash("239b2663946d88db14fb52f017d749da")
 answer16 %>% 
 pull(url) %>% 
 expect_known_hash("40d4d84edde6c1573c6eef61b2bd49c2")
 answer16 %>% 
 pull(name) %>% 
 expect_known_hash("9fa482de54b3e866524eff35d7e4dee9")
})

### Attributions

Thanks to Diana Lin for putting this worksheet together, IcĂ­ar Fernandez Boyano for reviewing, and David Kepplinger for assistance implementing these questions. Thanks to Firas Moosvi for providing a bunch of the questions on this worksheet. Thanks to Andy Tai for implementing the autograder for many of these questions.