---
title: 'Exercise: Motivating reproducibility'
output:
  html_document:
    fig_height: 3
    fig_width: 6
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, error = TRUE)
```

This RMarkdown file contains R code you can use to complete the exercises 
from the Motivating Reproducibility exercise. To see the output simply click 
on *Knit HTML* above. This might prompt you to install and load some required 
packages, specifically `knitr`. Just click yes, and the document including 
this narrative, the R code, and figures should pop up.

## Packages:

We will use packages from **tidyverse**, a collection of R packages that 
share common philosophies and are designed to work together. Specifically, 
we'll use functions from the following three packages:

- **readr** for loading data: The goal of readr is to provide a fast and 
friendly way to read rectangular data (like csv, tsv, and fwf). It is designed 
to flexibly parse many types of data found in the wild, while still cleanly 
failing when data unexpectedly changes. Find out more at http://readr.tidyverse.org/.
- **ggplot2** for plotting: This is not the only way to make this plot, but 
this package has some aesthetic defaults that makes it attractive. Find out 
more at 
at http://ggplot2.tidyverse.org/.
- **dplyr** for data wrangling: This is not the only way manipulate data in 
R, but this package uses piping and a simple set of verbs for most common 
data wranling tasks which makes it attractive. Find out more at 
http://dplyr.tidyverse.org/.

Additionally, we will use the **testthat** package for testing data 
integrity expectations.

If you have not yet done so, install tidyverse via `install.packages(tidyverse)` 
and testthat via `install.packages(testthat)`.

Now we need to explicitly load the packages:

```{r package, message=FALSE, cache=FALSE}
library("tidyverse")
library("testthat")
```

## Part 1: Analyze + document

#### 1. Load the data: 

Note that `read_csv` uses `stringsAsFactors = FALSE` by default.

```{r}
gap_5060 <- read_csv("data/gapminder-5060.csv")
```

#### 2. Visualize life expectancy over time for Canada in the 1950s and 1960s using a line plot.
  
```{r}
# Filter the data for Canada only
gap_5060_CA <- gap_5060 %>%
  filter(country == "Canada")
# Visualize
ggplot(data = gap_5060_CA, aes(x = year, y = lifeExp)) +
  geom_line()
```

Something is clearly wrong with this plot!

#### 3. Test data integrity expectations, and make appropriate corrections.

Life expectancy shouldn't exceed even the most extreme age observed 
for humans. The following is one way to test for this:

```{r}
if (any(gap_5060$lifeExp > 150)) {
  stop("Improbably high life expectancy.")
}
```

Another approach is using the testthat package, which allows us to 
make the test a little more readable:

```{r include=TRUE, eval=FALSE}
expect_false(any(gap_5060$lifeExp > 150),
            "One or more life expectancies are improbably high.")
```

Note: If you run this in your console, you should get an error that 
reads: `Error: any(gap_5060$lifeExp > 150) isn't false. One or more 
life expectancies are improbably high. Execution halted`.

We can also check for 0 or negative population:

```{r, eval=FALSE}
expect_false(any(gap_5060$pop <= 0),
            "One or more population sizes are zero or negative.")
```

Turns out there's a data error in the data file: life expectancy for 
Canada in the year 1957 is coded as `999999`, it should actually be 
`69.96`. We now make this correction:

```{r}
gap_5060 <- gap_5060 %>%
  mutate(lifeExp = replace(lifeExp, (country == "Canada" & year == 1957), 69.96))
```

Also, documentation is already done!

#### 4. Visualize life expectancy over time for Canada again, with the corrected data.

Exact same code as before, but note that the contents of `gap_5060` 
are different as it has been updated in the previous task.

```{r}
gap_5060_CA <- gap_5060 %>%
  filter(country == "Canada")

ggplot(data = gap_5060_CA, aes(x = year, y = lifeExp)) +
  geom_line()
```

#### 4 - Stretch goal. Add lines for Mexico and United States.

- `%in%` for logical operator testing if a country's name is in 
the list provided
- Same visualization code as before, only difference is the input dataset

```{r}
gap_5060_NA <- gap_5060 %>%
  filter(country %in% c("Canada", "Mexico", "United States"))

ggplot(data = gap_5060_NA, aes(x = year, y = lifeExp, color = country)) +
  geom_line()
```