---
title: "Single numbers and parts of a whole"
date: "2017-09-12"
---

# Task 1: Reflection memo

Write a 500-word memo about [the assigned readings](/reading/02-reading/) for this week. You can use some of the prompt questions there if you want. As you write the memo, also consider these central questions:

- How do these readings connect to our main goal of discovering truth?
- How does what I just read apply to me?
- How can this be useful to me?

**[E-mail me](mailto:andrew_heiss@byu.edu) a PDF of the memo.**


# Task 2: Playing with R

This example uses data from the [Gapminder project](https://www.gapminder.org/).[^gapminder] You'll need to install the `gapminder` R package first. Install it either with the "Packages" panel in RStudio or by typing `install.packages("gapminder")` in the R console.

[^gapminder]: {-}
  You may have seen Hans Rosling's [delightful TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) showing how global health and wealth have been increasing. If you haven't, you should watch it. Sadly, Hans died in February 2017.

For this first R-based assignment, you won't do any actual coding. [Download this file](https://raw.githubusercontent.com/andrewheiss/dataviz.andrewheiss.com/master/content/assignment/02-assignment.Rmd),[^download-note] open it in RStudio, and walk through the examples in RStudio on your computer. If you place your cursor on some R code and press "⌘ + enter" (for macOS users) or "ctrl + enter" (for Windows users), RStudio will send that line to the console and run it.

[^download-note]: {-}
  Your browser might show the file instead of downloading it. If that's the case, you can copy/paste the code from the browser to RStudio. In RStudio, go to "File" > "New" > "New R Markdown…", click "OK" with the default options, delete all the placeholder code/text in the new file, and paste the example code in the now-blank file.

There are a few questions that you'll need to answer, but that's all.

## Life expectancy in 2007

```{r load-packages, warning=FALSE, message=FALSE}
# This loads ggplot, dplyr, and other packages you'll need
library(tidyverse)
library(gapminder)
```

Let's first look at the first few rows of data:

```{r view-data}
head(gapminder)
```

Right now, the `gapminder` data frame contains rows for all years for all countries. We want to only look at 2007, so we create a new data frame that filters only rows for 2007.[^pipe]

[^pipe]: {-}
  Note how there's a weird sequence of characters: `%>%`. This is called a *pipe* and lets you chain functions together. We could have also written this as `gapminder_2007 <- filter(gapminder, year == 2007)`.

```{r filter-2007}
gapminder_2007 <- gapminder %>%
  filter(year == 2007)

head(gapminder_2007)
```

Now we can plot a histogram of 2007 life expectancies with the default settings:

```{r plot-2007-1}
ggplot(gapminder_2007, aes(x = lifeExp)) +
  geom_histogram()
```

R will use 30 histogram bins by default, but that's not always appropriate, and it will yell at you for doing so. **Adjust the number of bins to 2, then 40, then 100.** **What's a good number for this data? Why?**

```{r plot-2007-2}
ggplot(gapminder_2007, aes(x = lifeExp)) +
  geom_histogram(bins = 2)
```

## Average life expectancy in 2007 by continent

We're also interested in the differences of life expectancy across continents. First, we can group all rows by continent and calculate the mean:[^pipe2]

[^pipe2]: {-}
  This is where the `%>%` function is actually super useful. Remember that it lets you chain functions together—this means we can read these commands as a set of instructions: take the `gapminder` data frame, filter it, group it by continent, and summarize each group by calculating the mean. Without using the `%>%`, we could write this same chain like this: `summarize(group_by(filter(gapminder, year == 2007), continent), avg_life_exp = mean(lifeExp))`. But that's *awful* and impossible to read and full of parentheses that can easily be mismatched.

```{r calc-mean}
gapminder_cont_2007 <- gapminder %>%
  filter(year == 2007) %>% 
  group_by(continent) %>%
  summarize(avg_life_exp = mean(lifeExp))

head(gapminder_cont_2007)
```

Let's plot these averages as a bar chart:

```{r plot-2007-bar}
ggplot(gapminder_cont_2007, aes(x = continent, y = avg_life_exp, fill = continent)) + 
  geom_col()
```

Then, let's plot them as density distributions. We don't need to use the summarized data frame for this, just the original filtered `gapminder_2007` data frame:

```{r plot-2007-density}
ggplot(gapminder_2007, aes(x = lifeExp, fill = continent)) + 
  geom_density()
```

Now let's plot life expectancies as violin charts. These are the density distributions turned sideways:

```{r plot-2007-violin}
ggplot(gapminder_2007, aes(x = continent, y = lifeExp, fill = continent)) + 
  geom_violin()
```

Finally, we can add actual points of data for each country to the violin chart:

```{r plot-2007-violin-points}
ggplot(gapminder_2007, aes(x = continent, y = lifeExp, fill = continent)) + 
  geom_violin() +
  geom_point()
```

The bar chart, density plot, violin plot, and violin plot + points each show different ways of looking at a single number—the average life expectancy in each continent. **Answer these questions:**

- Which plot is most helpful?
- Which ones show variability?
- What's going on with Oceania?

[E-mail me](mailto:andrew_heiss@byu.edu) the answers to the questions posed in this example.