--- title: "Introduction to ggplot2" description: | A discussion of ggplot2 terminology, and an example of iteratively refining a simple scatterplot. author: - name: Kris Sankaran affiliation: UW Madison date: 01-26-2021 output: distill::distill_article: self_contained: false --- ```{r, echo = FALSE} library("knitr") knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` _[Reading](https://rafalab.github.io/dsbook/ggplot2.html), [Recording](https://mediaspace.wisc.edu/media/1_w7aw5yln), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-12-week1-2/week1-2.Rmd)_ [ggplot2](https://ggplot2.tidyverse.org/) is an R implementation of the _Grammar of Graphics_. The idea is to define the basic "words" from which visualizations are built, and then let users compose them in original ways. This is in contrast to systems with prespecified chart types, where the user is forced to pick from a limited dropdown menu of plots. Just like in ordinary language, the creative combination of simple building blocks can support a very wide range of expression. These are libraries we'll use in this lecture. ```{r} library("dplyr") library("dslabs") library("ggplot2") library("ggrepel") library("scales") ``` ### Components of a Graph We're going to create this plot in these notes. ```{r, echo = FALSE} data(murders) r <- murders %>% summarize(rate = sum(total) / sum(population)) %>% pull(rate) ggplot(murders, aes(x = population, y = total)) + geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") + geom_text_repel(aes(label = abb), segment.size = 0.2) + # I moved it up so that the geom_point's appear on top of the lines geom_point(aes(col = region)) + scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) + # used to convert scientific notation to readable labels scale_y_log10() + scale_color_brewer(palette = "Set2") + labs( x = "Population (log scale)", y = "Total number of murders (log scale)", color = "Region", title = "US Gun Murders in 2010" ) + theme_bw() + theme( legend.position = "top", panel.grid.minor = element_blank() ) ``` Every ggplot2 plot is made from three components, * Data: This is the `data.frame` that we want to visualize. * Geometry: These are the types of visual marks that appear on the plot. * Aesthetic Mapping: This links the data with the visual marks. ### The Data Let’s load up the data. Each row is an observation, and each column is an attribute that describes the observation. This is important because each mark that you see on a ggplot -- a line, a point, a tile, ... -- had to start out as a row within an R `data.frame`. The visual properties of the mark (e.g., color) are determined by the values along columns. These type of data are often referred to as tidy data, and we'll have a full week discussing this topic. Here's an example of the data above in tidy format, ```{r} data(murders) head(murders) ``` This is one example of how the same information might be stored in a non-tidy way, making visualization much harder. ```{r} non_tidy <- data.frame(t(murders)) colnames(non_tidy) <- non_tidy[1, ] non_tidy <- non_tidy[-1, ] non_tidy[, 1:6] ``` Often, one of the hardest parts in making a ggplot2 plot is not coming up with the right ggplot2 commands, but reshaping the data so that it’s in a tidy format. ### Geometry The words in the grammar of graphics are the geometry layers. We can associate each row of a data frame with points, lines, tiles, etc., just by referring to the appropriate geom in ggplot2. A typical plot will compose a chain of layers on top of a dataset, > ggplot(data) + [layer 1] + [layer 2] + ... For example, by deconstructing the plot above, we would expect to have point and text layers. For now, let's just tell the plot to put all the `geom`'s at the origin. ```{r} ggplot(murders) + geom_point(x = 0, y = 0) + geom_text(x = 0, y = 0, label = "test") ``` You can see all the types of `geoms` in the [cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf). We'll be experimenting with a few of these in a [later lecture](../2021-01-18-week1-5/index.html). ### Aesthetic mappings Aesthetic mappings make the connection between the data and the geometry. It's the piece that translates abstract data fields into visual properties. Analyzing the original graph, we recognize these specific mappings, * State Population → $x$-axis coordinate * Number of murders → $y$-axis coordinate * Geographical region → color To establish these mappings, we need to use the `aes` function. Notice that column names don't have to be quoted -- ggplot2 knows to refer back to the `murders` data frame in `ggplot(murders)`. ```{r} ggplot(murders) + geom_point(aes(x = population, y = total, col = region)) ``` The original plot used a log-scale. To transform the x and y axes, we can use [scales](https://ggplot2.tidyverse.org/reference/index.html#section-scales). ```{r} ggplot(murders) + geom_point(aes(x = population, y = total, col = region)) + scale_x_log10() + scale_y_log10() ``` Once nuance is that scales aren't limited to $x$ and $y$ transformations. They can be applied to modify any relationship between a data field and its appearance on the page. For example, this changes the mapping between the region field and circle color. ```{r} ggplot(murders) + geom_point(aes(x = population, y = total, col = region)) + scale_x_log10() + scale_y_log10() + scale_color_manual(values = c("#6a4078", "#aa1518", "#9ecaf8", "#50838c")) # exercise: find better colors using https://imagecolorpicker.com/ ``` A problem with this graph is that it doesn't tell us which state each point corresponds to. For that, we'll need text labels. We can encode the coordinates for these marks again using `aes`, but this time within a `geom_text` layer. ```{r} ggplot(murders) + geom_point(aes(x = population, y = total, col = region)) + geom_text( aes(x = population, y = total, label = abb), nudge_x = 0.08 # what would happen if I remove this? ) + scale_x_log10() + scale_y_log10() ``` Note that each type of layer uses different visual properties to encode the data -- the argument `label` is only available for the `geom_text` layer. You can see which aesthetic mappings are required for each type of `geom` by checking that `geom`'s documentation page, under the Aesthetics heading. It's usually a good thing to make your code as concise as possible. For ggplot2, we can achieve this by sharing elements across `aes` calls (e.g., not having to type `population` and `total` twice). This can be done by defining a "global" aesthetic, putting it inside the initial `ggplot` call. ```{r} ggplot(murders, aes(x = population, y = total)) + geom_point(aes(col = region)) + geom_text(aes(label = abb), nudge_x = 0.08) + scale_x_log10() + scale_y_log10() ``` ## Finishing touches How can we improve the readability of this plot? You might already have ideas, 1. Prevent labels from overlapping. It's impossible to read some of the state names. 2. Add a line showing the national rate. This serves as a point of reference, allowing us to see whether an individual state is above or below the national murder rate. 3. Give meaningful axis / legend labels and a title. 4. Move the legend to the top of the figure. Right now, we're wasting a lot of visual real estate in the right hand side, just to let people know what each color means. 5. Use a better color theme. For 1., the `ggrepel` package find better state name positions, drawing links when necessary. ```{r} ggplot(murders, aes(x = population, y = total)) + geom_text_repel(aes(label = abb), segment.size = 0.2) + # I moved it up so that the geom_point's appear on top of the lines geom_point(aes(col = region)) + scale_x_log10() + scale_y_log10() ``` For 2., let's first compute the national murder rate, ```{r} r <- murders %>% summarize(rate = sum(total) / sum(population)) %>% pull(rate) r ``` Now, we can use this as the slope in a `geom_abline` layer, which encodes a slope and intercept as a line on a graph. ```{r} ggplot(murders, aes(x = population, y = total)) + geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") + geom_text_repel(aes(label = abb), segment.size = 0.2) + geom_point(aes(col = region)) + scale_x_log10() + scale_y_log10() ``` For 3., we can add a `labs` layer to write labels and a `theme` to reposition the legend. I used `unit_format` from the `scales` package to change the scientific notation in the $x$-axis labels to something more readable. ```{r} ggplot(murders, aes(x = population, y = total)) + geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") + geom_text_repel(aes(label = abb), segment.size = 0.2) + geom_point(aes(col = region)) + scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) + # used to convert scientific notation to readable labels scale_y_log10() + labs( x = "Population (log scale)", y = "Total number of murders (log scale)", color = "region", title = "US Gun Murders in 2010" ) + theme(legend.position = "top") ``` For 5., I find the gray background with reference lines a bit distracting. We can simplify the appearance using `theme_bw`. I also like the colorbrewer palette, which can be used by calling a different color scale. ```{r} ggplot(murders, aes(x = population, y = total)) + geom_abline(intercept = log10(r), size = 0.4, col = "#b3b3b3") + geom_text_repel(aes(label = abb), segment.size = 0.2) + geom_point(aes(col = region)) + scale_x_log10(labels = unit_format(unit = "million", scale = 1e-6)) + scale_y_log10() + scale_color_brewer(palette = "Set2") + labs( x = "Population (log scale)", y = "Total number of murders (log scale)", color = "Region", title = "US Gun Murders in 2010" ) + theme_bw() + theme( legend.position = "top", panel.grid.minor = element_blank() ) ``` Some bonus exercises, which will train you to look at your graphics more carefully, as well as build your familiarity with ggplot2. * Try reducing the size of the text labels. Hint: use the `size` argument in `geom_text_repel`. * Increase the size of the circles in the legend. Hint: Use `override.aes` within a `guide`. * Re-order the order of regions in the legend. Hint: Reset the factor levels in the `region` field of the `murders` data.frame. * Only show labels for a subset of states that are far from the national rate. Hint: Filter the `murders` data.frame, and use a `data` field specific to the `geom_text_repel` layer.