--- title: "Single numbers and parts of a whole" date: "2017-09-12" --- # Task 1: Reflection memo Write a 500-word memo about [the assigned readings](/reading/02-reading/) for this week. You can use some of the prompt questions there if you want. As you write the memo, also consider these central questions: - How do these readings connect to our main goal of discovering truth? - How does what I just read apply to me? - How can this be useful to me? **[E-mail me](mailto:andrew_heiss@byu.edu) a PDF of the memo.** # Task 2: Playing with R This example uses data from the [Gapminder project](https://www.gapminder.org/).[^gapminder] You'll need to install the `gapminder` R package first. Install it either with the "Packages" panel in RStudio or by typing `install.packages("gapminder")` in the R console. [^gapminder]: {-} You may have seen Hans Rosling's [delightful TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) showing how global health and wealth have been increasing. If you haven't, you should watch it. Sadly, Hans died in February 2017. For this first R-based assignment, you won't do any actual coding. [Download this file](https://raw.githubusercontent.com/andrewheiss/dataviz.andrewheiss.com/master/content/assignment/02-assignment.Rmd),[^download-note] open it in RStudio, and walk through the examples in RStudio on your computer. If you place your cursor on some R code and press "⌘ + enter" (for macOS users) or "ctrl + enter" (for Windows users), RStudio will send that line to the console and run it. [^download-note]: {-} Your browser might show the file instead of downloading it. If that's the case, you can copy/paste the code from the browser to RStudio. In RStudio, go to "File" > "New" > "New R Markdown…", click "OK" with the default options, delete all the placeholder code/text in the new file, and paste the example code in the now-blank file. There are a few questions that you'll need to answer, but that's all. ## Life expectancy in 2007 ```{r load-packages, warning=FALSE, message=FALSE} # This loads ggplot, dplyr, and other packages you'll need library(tidyverse) library(gapminder) ``` Let's first look at the first few rows of data: ```{r view-data} head(gapminder) ``` Right now, the `gapminder` data frame contains rows for all years for all countries. We want to only look at 2007, so we create a new data frame that filters only rows for 2007.[^pipe] [^pipe]: {-} Note how there's a weird sequence of characters: `%>%`. This is called a *pipe* and lets you chain functions together. We could have also written this as `gapminder_2007 <- filter(gapminder, year == 2007)`. ```{r filter-2007} gapminder_2007 <- gapminder %>% filter(year == 2007) head(gapminder_2007) ``` Now we can plot a histogram of 2007 life expectancies with the default settings: ```{r plot-2007-1} ggplot(gapminder_2007, aes(x = lifeExp)) + geom_histogram() ``` R will use 30 histogram bins by default, but that's not always appropriate, and it will yell at you for doing so. **Adjust the number of bins to 2, then 40, then 100.** **What's a good number for this data? Why?** ```{r plot-2007-2} ggplot(gapminder_2007, aes(x = lifeExp)) + geom_histogram(bins = 2) ``` ## Average life expectancy in 2007 by continent We're also interested in the differences of life expectancy across continents. First, we can group all rows by continent and calculate the mean:[^pipe2] [^pipe2]: {-} This is where the `%>%` function is actually super useful. Remember that it lets you chain functions together—this means we can read these commands as a set of instructions: take the `gapminder` data frame, filter it, group it by continent, and summarize each group by calculating the mean. Without using the `%>%`, we could write this same chain like this: `summarize(group_by(filter(gapminder, year == 2007), continent), avg_life_exp = mean(lifeExp))`. But that's *awful* and impossible to read and full of parentheses that can easily be mismatched. ```{r calc-mean} gapminder_cont_2007 <- gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarize(avg_life_exp = mean(lifeExp)) head(gapminder_cont_2007) ``` Let's plot these averages as a bar chart: ```{r plot-2007-bar} ggplot(gapminder_cont_2007, aes(x = continent, y = avg_life_exp, fill = continent)) + geom_col() ``` Then, let's plot them as density distributions. We don't need to use the summarized data frame for this, just the original filtered `gapminder_2007` data frame: ```{r plot-2007-density} ggplot(gapminder_2007, aes(x = lifeExp, fill = continent)) + geom_density() ``` Now let's plot life expectancies as violin charts. These are the density distributions turned sideways: ```{r plot-2007-violin} ggplot(gapminder_2007, aes(x = continent, y = lifeExp, fill = continent)) + geom_violin() ``` Finally, we can add actual points of data for each country to the violin chart: ```{r plot-2007-violin-points} ggplot(gapminder_2007, aes(x = continent, y = lifeExp, fill = continent)) + geom_violin() + geom_point() ``` The bar chart, density plot, violin plot, and violin plot + points each show different ways of looking at a single number—the average life expectancy in each continent. **Answer these questions:** - Which plot is most helpful? - Which ones show variability? - What's going on with Oceania? [E-mail me](mailto:andrew_heiss@byu.edu) the answers to the questions posed in this example.