--- title: 'Exercise: Motivating reproducibility' output: html_document: fig_height: 3 fig_width: 6 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(include = TRUE, error = TRUE) ``` This RMarkdown file contains R code you can use to complete the exercises from the Motivating Reproducibility exercise. To see the output simply click on *Knit HTML* above. This might prompt you to install and load some required packages, specifically `knitr`. Just click yes, and the document including this narrative, the R code, and figures should pop up. ## Packages: We will use packages from **tidyverse**, a collection of R packages that share common philosophies and are designed to work together. Specifically, we'll use functions from the following three packages: - **readr** for loading data: The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. Find out more at http://readr.tidyverse.org/. - **ggplot2** for plotting: This is not the only way to make this plot, but this package has some aesthetic defaults that makes it attractive. Find out more at at http://ggplot2.tidyverse.org/. - **dplyr** for data wrangling: This is not the only way manipulate data in R, but this package uses piping and a simple set of verbs for most common data wranling tasks which makes it attractive. Find out more at http://dplyr.tidyverse.org/. Additionally, we will use the **testthat** package for testing data integrity expectations. If you have not yet done so, install tidyverse via `install.packages(tidyverse)` and testthat via `install.packages(testthat)`. Now we need to explicitly load the packages: ```{r package, message=FALSE, cache=FALSE} library("tidyverse") library("testthat") ``` ## Part 1: Analyze + document #### 1. Load the data: Note that `read_csv` uses `stringsAsFactors = FALSE` by default. ```{r} gap_5060 <- read_csv("data/gapminder-5060.csv") ``` #### 2. Visualize life expectancy over time for Canada in the 1950s and 1960s using a line plot. ```{r} # Filter the data for Canada only gap_5060_CA <- gap_5060 %>% filter(country == "Canada") # Visualize ggplot(data = gap_5060_CA, aes(x = year, y = lifeExp)) + geom_line() ``` Something is clearly wrong with this plot! #### 3. Test data integrity expectations, and make appropriate corrections. Life expectancy shouldn't exceed even the most extreme age observed for humans. The following is one way to test for this: ```{r} if (any(gap_5060$lifeExp > 150)) { stop("Improbably high life expectancy.") } ``` Another approach is using the testthat package, which allows us to make the test a little more readable: ```{r include=TRUE, eval=FALSE} expect_false(any(gap_5060$lifeExp > 150), "One or more life expectancies are improbably high.") ``` Note: If you run this in your console, you should get an error that reads: `Error: any(gap_5060$lifeExp > 150) isn't false. One or more life expectancies are improbably high. Execution halted`. We can also check for 0 or negative population: ```{r, eval=FALSE} expect_false(any(gap_5060$pop <= 0), "One or more population sizes are zero or negative.") ``` Turns out there's a data error in the data file: life expectancy for Canada in the year 1957 is coded as `999999`, it should actually be `69.96`. We now make this correction: ```{r} gap_5060 <- gap_5060 %>% mutate(lifeExp = replace(lifeExp, (country == "Canada" & year == 1957), 69.96)) ``` Also, documentation is already done! #### 4. Visualize life expectancy over time for Canada again, with the corrected data. Exact same code as before, but note that the contents of `gap_5060` are different as it has been updated in the previous task. ```{r} gap_5060_CA <- gap_5060 %>% filter(country == "Canada") ggplot(data = gap_5060_CA, aes(x = year, y = lifeExp)) + geom_line() ``` #### 4 - Stretch goal. Add lines for Mexico and United States. - `%in%` for logical operator testing if a country's name is in the list provided - Same visualization code as before, only difference is the input dataset ```{r} gap_5060_NA <- gap_5060 %>% filter(country %in% c("Canada", "Mexico", "United States")) ggplot(data = gap_5060_NA, aes(x = year, y = lifeExp, color = country)) + geom_line() ```