---
title: "Advanced Data Manipulation Homework"
---

(_Refer back to the [Advanced Data Manipulation lesson](r-dplyr-yeast.html))._

```{r inithw, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, message = FALSE, warning = FALSE)
```

### Key Concepts

> 
- **dplyr** verbs
- the pipe `%>%`
- the `tbl_df`
- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions

### Getting Started

We're going to work with a different dataset for the homework here. It's a [cleaned-up excerpt](https://github.com/jennybc/gapminder) from the [Gapminder data](http://www.gapminder.org/data/). Download the [**gapminder.csv** data by clicking here](data/gapminder.csv) or using the link above. Download it, and save it in a `data/` subfolder of the project directory where you can access it easily from R.


Load the **dplyr** and **readr** packages, and read the gapminder data into R using the `read_csv()` function (n.b. `read_csv()` is _not_ the same as `read.csv()`). Assign the data to an object called `gm`.

In your submitted homework assignment, I would prefer you use the `read_csv()` function to read the data directly from the web (see below). This way I can run your R code without worrying about whether I have the `data/` directory or not.

```{r loaddata, echo=TRUE, eval=FALSE}
library(dplyr)
library(readr)

# Preferably: read data from web
gm <- read_csv("http://bioconnector.org/workshops/data/gapminder.csv")

# Alternatively read from file:
# gm <- read_csv("data/gapminder.csv")

# Display the data
gm
```


```{r loaddatatrue, eval=TRUE, include=FALSE}
library(dplyr)
library(readr)
gm <- read_csv("data/gapminder.csv")
gm
```


### Problem set

Use **dplyr** functions to address the following questions:

1) How many unique countries are represented per continent?

```{r problem1}
# gm %>%
#     distinct(country, .keep_all=TRUE) %>%
#     group_by(continent) %>%
#     summarise(n = n())
gm %>% 
  group_by(continent) %>% 
  summarize(n=n_distinct(country))
```


2) Which European nation had the lowest GDP per capita in 1997? 

```{r problem2}
gm %>%
    filter(continent == "Europe" & year == 1997) %>%
    arrange(gdpPercap) %>%
    head(1)
```


3) According to the data available, what was the average life expectancy across each continent in the 1980s?

```{r problem3}
gm %>%
    filter(year == 1982 | year == 1987) %>%
    group_by(continent) %>%
    summarize(mean.lifeExp = mean(lifeExp))
```


4) What 5 countries have the highest total GDP over all years combined?

```{r problem4}
gm %>%
    mutate(gdp = gdpPercap*pop) %>%
    group_by(country) %>%
    summarise(Total.GDP = sum(gdp)) %>%
    arrange(desc(Total.GDP)) %>%
    head(5)
```


5) What countries and years had life expectancies of _at least_ 80 years? _N.b. only output the columns of interest: country, life expectancy and year (in that order)._

```{r problem5}
gm %>%
    filter(lifeExp >= 80) %>%
    select(country, lifeExp, year)
```


6) What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?

```{r problem6}
gm %>%
    group_by(country) %>%
    summarise(r = abs(cor(lifeExp, gdpPercap))) %>%
    arrange(desc(r)) %>%
    head(10)
```


7) Which combinations of continent (besides Asia) and year have the highest average population across all countries? _N.b. your output should include all results sorted by highest average population_. With what you already know, this one may stump you. See [this Q&A](http://stackoverflow.com/q/27207963/654296) for how to `ungroup` before `arrange`ing. This also [behaves differently in more recent versions of dplyr](https://github.com/hadley/dplyr/releases/tag/v0.5.0).

```{r problem7}
gm %>%
    filter(continent != "Asia") %>%
    group_by(continent, year) %>%
    summarise(mean.pop = mean(pop)) %>%
    ungroup() %>%
    arrange(desc(mean.pop)) 
```


8) Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data? 

```{r problem8}
gm %>%
    group_by(country) %>%
    summarize(sd.pop = sd(pop)) %>%
    arrange(sd.pop) %>%
    head(3)
```


9) Subset **gm** to only include observations from 1992 and store the results as **gm1992**. What kind of object is this?

```{r problem9}
gm1992 <-
    gm %>%
    filter(year == 1992)

gm1992 %>% 
    class()
```


10) **_Bonus!_** Which observations indicate that the population of a country has *decreased* from the previous year **and** the life expectancy has *increased* from the previous year? See [the vignette on window functions](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html).

```{r problem10}
gm %>% 
  arrange(country, year) %>% 
  group_by(country) %>% 
  filter(pop < lag(pop) & lifeExp > lag(lifeExp))
```

----

Source: <https://raw.githubusercontent.com/4va/biodatasci/master/r-dplyr-homework.Rmd>