# Worksheet A-4: Tidy Data & The Model-Fitting Paradigm in R

By the end of this worksheet, you will be able to:

(From tidyr)

- convert a dataset between the 'long' and 'wide' format, using `tidyr::pivot_longer()` and `tidyr::pivot_wider()`
- assess which format is best suited for each type of analysis
- deal with missing data in a tibble

(From modelling)

- make a model object in R, using `lm()` as an example.
- write a formula in R.
- predict on a model object with the `broom::augment()` and `predict()` functions.
- extract information from a model object using `broom::tidy()`, `broom::glance()`, and traditional means.

To get full marks for this worksheet, you must successfully answer at least 10 of the autograded questions. There are 15 questions in total.

## Getting Started

Load the requirements for this worksheet:

In [None]:
library(testthat)
library(digest)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(gapminder))
lotr <- suppressMessages(read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/lotr_tidy.csv"))
guest <- suppressMessages(read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/wedding/attend.csv"))

The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them.

In [None]:
# An unlocked code cell.

# Part 1: Tidy Data with Univariate Pivoting

Consider the Lord of the Rings data. Run the code cell below to see the first few lines of the tibble.

In [None]:
print(lotr, n = 5)

## Question 1.1
Widen the data so that we see the words spoken by each race, by puttting race as its own column. Store the answer in `answer1.1`.

```
(answer1.1 <- lotr %>%
 FILL_THIS_IN(names_from = FILL_THIS_IN,
 values_from = FILL_THIS_IN))
```

Your `answer1.1` should look something like this (full tibble not always shown):

![answer1.1](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer1.1.png)

_Sidenote:_ Putting a variable assignment in parenthesis will not only assign the value to the variable, but also print to console. Normally when you assign a variable, you do not get to see the value of the variable. This is a helpful tip so you can see what you are storing!

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.1)
arrange_all(answer1.1)

In [None]:
test_that("Question 1.1", {
 expect_true(all(c("Film", "Gender", "Elf", "Hobbit", "Man") %in% names(answer1.1)))
 expect_equal(nrow(answer1.1), 6L)
})

## Question 1.2
Re-lengthen the wide `lotr` data (i.e. `answer1.1`) from Question 1.1 above. Store your answer in `answer1.2`.

**Hint:** the resulting data frame should appear to be the _almost the same_ as the original! (No need to reorder the columns)

```
(answer1.2 <- answer1.1 %>% 
 FILL_THIS_IN(cols = c(-FILL_THIS_IN, -FILL_THIS_IN), 
 names_to = FILL_THIS_IN, 
 values_to = FILL_THIS_IN))
```

Your `answer1.2` should look something like this (full tibble not always shown):

![answer1.2](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer1.2.png)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.2)

In [None]:
test_that("Question 1.2", {
 expect_true(all(c("Film", "Gender", "Race", "Words") %in% names(answer1.2)))
 expect_equal(nrow(answer1.2), 18L)
})

## QUESTION 1.3

Using the `gapminder` dataset: what's the relationship between Canada's GDP per capita and the United Kingdom's? First, produce a tidy tibble from the `gapminder` tibble to address this question. Store your tibble in a variable named `question1.3`. Do not rename any columns.

_Food for thought_: After tidying the data for this problem, we should be able to make a scatterplot of Canada's GDP per capita against the UK's. But, doing so for the original `gapminder` dataset would be difficult. 

```
answer1.3 <- gapminder %>% 
 filter(FILL_TIHS_IN) %>% 
 pivot_FILL_THIS_IN(FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.3)

In [None]:
test_that("Question 1.3", {
 expect_true("Canada" %in% names(answer1.3))
 expect_true("United Kingdom" %in% names(answer1.3))
 expect_equal(nrow(answer1.3), 12L)
})

# Part 2: Tidy Data with Multivariate Pivoting

Congratulations, you’re getting married! In addition to the wedding, you’ve decided to hold two other events: a day-of brunch and a day-before round of golf. You’ve made a guestlist of attendance so far, along with food preference for the food events (wedding and brunch).

Run the code cell below to see the first few rows of the `guest` data frame.

In [None]:
head(guest)

## Question 2.1
Put `meal` and `attendance` as their own columns, with the events living in a new column. Store your answer in `answer2.1`.

```
(answer2.1 <- guest %>% 
 FILL_THIS_IN(cols = c(-FILL_THIS_IN, -FILL_THIS_IN), 
 names_to = c(FILL_THIS_IN, FILL_THIS_IN),
 names_sep = FILL_THIS_IN))
``` 

**Hint**: Read the possible values for `names_to` in the corresponding documentation of the function you choose!

Your `answer2.1` should look something like this (full tibble not always shown):

![answer2.1](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.1.png)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2.1)

In [None]:
test_that("Question 2.1", {
 expect_true(all(c("party", "name", "event", "meal", "attendance") %in% names(answer2.1)))
 expect_equal(nrow(answer2.1), 90L)
})

## Question 2.2
Use `tidyr::separate()` to split the `name` in `answer2.1` into two columns: `first_name` and `last_name`. Store your answer in `answer2.2`.

```
(answer2.2 <- answer2.1 %>% 
 FILL_THIS_IN(FILL_THIS_IN, into = c(FILL_THIS_IN, FILL_THIS_IN), sep=FILL_THIS_IN))
``` 

Your `answer2.2` should look something like this (full tibble not always shown):

![answer2.2](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.2.png)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2.2)

In [None]:
test_that("Question 2.2", {
 expect_true(all(c("party", "first_name", "last_name", "event", "meal", "attendance") %in% names(answer2.2)))
 expect_equal(nrow(answer2.2), 90L)
})

### Question 2.3
Re-unite `first_name` and `last_name` in `answer2.2` back into `name` using `tidyr::unite()`. Store your answer in `answer2.3`.

```
(answer2.3 <- answer2.2 %>%
 FILL_THIS_IN(col = FILL_THIS_IN, c(FILL_THIS_IN, FILL_THIS_IN), sep = FILL_THIS_IN))
``` 

Your `answer2.3` should look something like this (full tibble not always shown):

![answer2.3](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.3.png)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2.3)

In [None]:
test_that("Question 2.3", {
 expect_true(all(c("party", "name", "event", "meal", "attendance") %in% names(answer2.3)))
 expect_equal(nrow(answer2.3), 90L)
})

## Question 2.4

Which parties still have a "PENDING" attendance status for all of its members and all of the events? Your answer should be a vector of party ID's (not a tibble). Store your answer in `answer2.4`.

**Hint**: use `answer2.1` as a starting point. Use `pull()` to access a column as a vector.

```
answer2.4 <- answer2.1 %>% 
 group_by(FILL_THIS_IN) %>% 
 summarize(all_pending = all(FILL_THIS_IN == "PENDING")) %>%
 filter(all_pending) %>%
 FILL_THIS_IN(party)
``` 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2.4)

In [None]:
test_that("Question 2.4", {
 expect_equal(digest(unclass(answer2.4)),"f13a65bc5c8793a2cad1415aad7dff93")
})

## Question 2.5
Which parties still have a "PENDING" attendance status for all of its members, for the wedding event only? Your answer should be a vector of party ID's (not a tibble). Store your answer in `answer2.5`.

**Hint**: Use `pull()` to access a column as a vector.

```
answer2.5 <- guest %>% 
 group_by(FILL_THIS_IN) %>% 
 summarize(pending_wedding = all(FILL_THIS_IN == "PENDING")) %>%
 filter(FILL_THIS_IN) %>%
 FILL_THIS_IN(party)
``` 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer2.5)

In [None]:
test_that("Question 2.5", {expect_equal(digest(unclass(answer2.5)), "f13a65bc5c8793a2cad1415aad7dff93")})

# Part 3: The Model-Fitting Paradigm in R

**Overview**

1. Fit a linear regression model to life expectancy ("Y") from year ("X") by filling in the formula. Notice what appears as the output.
2. Use the `unclass` function to uncover the object's true nature: a list.

## Question 3.1
First, create a subset of the `gapminder` dataset containing only the country of `France`. Store your answer in `answer3.1`.

```
(answer3.1 <- gapminder %>%
 FILL_THIS_IN(FILL_THIS_IN == FILL_THIS_IN))
``` 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3.1)

In [None]:
test_that("Question 3.1", {expect_equal(digest(unclass(answer3.1)), "a6125bcbb25047b7a8c932acbb1f2250")})

### Question 3.2

> Fit a linear regression model to life expectancy ("Y") from year ("X") by filling in the formula

Now, using the `lm()` function we will create the linear model. Store your answer in `answer3.2`.

```
(answer3.2 <- FILL_THIS_IN(FILL_THIS_IN ~ FILL_THIS_IN, answer3.1)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3.2)

In [None]:
test_that("Question 3.2", {
 expect_known_hash(round(coef(answer3.2), 4), "e7375b4c2683882feb9d7215f6929f69")
 expect_known_hash(answer3.2$terms, "9c71cfd9974bfbc8e160b1a31936d137")
})
print("Success!")

## Question 3.3

We are interested in the modeling results around the modeling period which starts at year 1952. To get a meaniningful "interpretable" intercept we can use the `I()` function.

Use `I()` to make the intercept so that the "beginning" of our dataset (1952) corresponds to '0' in the model. This makes all the years in the data set relative to the first year, 1952.

Store your answer in `answer3.3`.

```
answer3.3 <- FILL_THIS_IN(FILL_THIS_IN ~ I(FILL_THIS_IN-1952), answer3.1)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3.3)

In [None]:
test_that("Question 3.3", {
 expect_known_hash(round(coef(answer3.3), 4), "6a83f591b39b440f2a699dbee2c23468")
 expect_known_hash(answer3.3$terms, "f7d8f19ef010f5932ba9b321f3f88282")
})

## Question 3.4
Use the `unclass()` function to take a look at how the `lm()` object actually looks like. Store your answer in `answer3.4`.

```
answer3.4 <- FILL_THIS_IN(answer3.3)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer3.4)

In [None]:
test_that("Question 3.4", {
 expect_known_hash(round(answer3.4$coefficients, 4), "6a83f591b39b440f2a699dbee2c23468")
 expect_known_hash(class(answer3.4), "086ebc4c59c08c43e75bae74f1e16897")
 expect_known_hash(answer3.4$terms, "f7d8f19ef010f5932ba9b321f3f88282")
 
})

# Part 4: Producing Tidy Tibbles with broom

## Question 4.1

Apply `broom::tidy()` to `answer3.3`. Store your answer in `answer4.1`.

```
answer4.1 <- FILL_THIS_IN(answer3.3)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer4.1)

In [None]:
test_that("Question 4.1", {
 expect_known_hash(dimnames(answer4.1), "b7e5db66048ee1c33cb090078dc59103")
 expect_known_hash(answer4.1[[1]], "6736fedebcaa557ef7a78f8db206000f")
 expect_known_hash(round(answer4.1[[3]], 4), "004727aa166a650b3f55c2c11f6be257")
})

## Question 4.2
Apply `broom::augment()` to `answer3.3`. Store your answer in `answer4.2`.

```
answer4.2 <- FILL_THIS_IN(answer3.3)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer4.2)

In [None]:
test_that("Question 4.2", {
 expect_known_hash(round(answer4.2$.fitted, 4), "3a2e6323312173544d30c4dc75d5b604")
 expect_known_hash(round(answer4.2$.std.resid, 4), "b2796c0f920703cb5066fcda716be2e7")
})

## Question 4.3
Apply `broom::glance()` to `answer3.3`. Store your answer in `answer4.3`.

```
answer4.3 <- FILL_THIS_IN(answer3.3)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer4.3)

In [None]:
test_that("Question 4.3", {
 expect_known_hash(dimnames(answer4.3), "78044903eb403fb9220d796ac127297c")
 expect_known_hash(round(answer4.3[[2]], 4), "a3ec6ee89f16b0783571e8f9e26c9ef5")
 expect_known_hash(round(answer4.3[[4]], 4), "f95d29661fc511c6a038ae4e06b9ea02")
})

### Attribution

Thanks to Diana Lin, Icíar Fernández Boyano, David Kepplinger, and Vincenzo Coia for putting this worksheet together.