---
title: "Data Visualization With ggplot2"
author: "Dr. Hua Zhou"
date: "Jan 23, 2018"
output:
ioslides_presentation: default
subtitle: Biostat M280
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
fig.width = 5, fig.height = 3.5, fig.align = 'center',
cache = TRUE)
```
## Outline
We will spend next couple lectures studying R. I'll closely follow a few great books by Hadley Wickham.
* Data wrangling (import, visualization, transformation, tidy).
[R for Data Science](http://r4ds.had.co.nz) by Garrett Grolemund and Hadley Wickham.
* R programming, Rcpp.
[Advanced R](http://adv-r.had.co.nz) by Hadley Wickham.
* R package development.
[R Packages](http://r-pkgs.had.co.nz) by Hadley Wickham.
* Web applications.
* Interface with SQL and Apache Spark.
##
A typical data science project:
## Tidyverse
- `tidyverse` is a collection of R packages that make data wrangling easy.
- Install `tidyverse` from RStudio menu `Tools -> Install Packages...` or
```{r, eval = FALSE}
install.packages("tidyverse")
```
- After installation, load `tidyverse` by
```{r}
library("tidyverse")
```
## `mpg` data {.smaller}
- `mpg` data is available from the `ggplot2` package:
```{r}
mpg
```
- `displ`: engine size, in litres.
`hwy`: highway fuel efficiency, in mile per gallen (mpg).
# Aesthetic mappings | r4ds chapter 3.3
## Scatter plot {.smaller}
- `hwy` vs `displ`
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
- Check available aesthetics for a geometric object by `?geom_point`.
## Color of points {.smaller}
- Color points according to `class`:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
## Size of points {.smaller}
- Assign different sizes to points according to `class`:
```{r, warning = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
## Transparency of points {.smaller}
- Assign different transparency levels to points according to `class`:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
```
## Shape of points {.smaller}
- Assign different shapes to points according to `class`:
```{r, warning = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
- Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
## Manual setting of an aesthetic {.smaller}
- Set the color of all points to be blue:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
# Facets | r4ds chapter 3.5
## Facets {.smaller}
- Facets divide a plot into subplots based on the values of one or more discrete variables.
- A subplot for each car type:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
```
----
- A subplot for each car type and drive:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
```
# Geometric objects | r4ds chapter 3.6
## `geom_smooth()`: smooth line
- `hwy` vs `displ` line:
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
## Different line types
- Different line types according to `drv`:
```{r, fig.width = 4.5, fig.height = 3, , message = FALSE}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
```
## Different line colors
- Different line colors according to `drv`:
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
```
## Points and lines
- Lines overlaid over scatter plot:
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
----
- Same as
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
```
## Aesthetics for each geometric object
- Different aesthetics in different layers:
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
```
# Bar charts | r4ds chapter 3.7
## `diamonds` data {.smaller}
- `diamonds` data:
```{r}
diamonds
```
## Bar chart
- `geom_bar()` creates bar chart:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
----
- Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
- Check available computed variables for a geometric object via help:
```{r, eval = FALSE}
?geom_bar
```
----
- Use `stat_count()` directly:
```{r}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
```
- `stat_count()` has a default geom `geom_bar()`.
----
- Display frequency instead of counts:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```
----
- Color bar:
```{r, results = 'hold'}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
```
----
- Fill color:
```{r, results = 'hold'}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
----
- Fill color according to another variable:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
# Positional arguments | r4ds chapter 3.8
----
- `position_gitter()` add random noise to X and Y position of each
element to avoid overplotting:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
----
- `geom_jitter()` is similar:
```{r}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
```
----
- `position_fill()` stack elements on top of one another,
normalize height:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
```
----
- `position_dodge()` arrange elements side by side:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
----
- `position_stack()` stack elements on top of each other:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
```
# Coordinate systems | r4ds chapter 3.9
----
- A boxplot:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```
----
- `coord_cartesian()` is the default cartesian coordinate system:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_cartesian(xlim = c(0, 5))
```
----
- `coord_fixed()` specifies aspect ratio:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_fixed(ratio = 1/2)
```
----
- `coord_flip()` flips x- and y- axis:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
```
----
- A map:
```{r}
library("maps")
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
```
----
- `coord_quickmap()` puts maps in scale:
```{r}
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
```
# Graphics for communications | r4ds chapter 28
## Title {.smaller}
- Figure title should be descriptive:
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
```
## Subtitle and caption {.smaller}
-
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)
```
## Axis labels {.smaller}
-
```{r, fig.width = 4.5, fig.height = 3, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)"
)
```
## Math equations {.smaller}
-
```{r, fig.width = 4.5, fig.height = 3}
df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
```
- `?plotmath`
## Annotations {.smaller}
- Create labels
```{r}
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
best_in_class
```
---
- Annotate points
```{r}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
```
----
- `ggrepel` package automatically adjust labels so that they don’t overlap:
```{r, fig.width = 4.5, fig.height = 3}
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
```
## Scales
-
```{r, eval = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
```
automatically adds scales
```{r, eval = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
```
----
- `breaks`
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
```
----
- `labels`
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
```
----
- Plot y-axis at log scale:
```{r}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_log10()
```
----
- Plot x-axis in reverse order:
```{r}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_reverse()
```
## Legends
- Set legend position: `"left"`, `"right"`, `"top"`, `"bottom"`, `none`:
```{r, collapse = TRUE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
theme(legend.position = "left")
```
----
- See following link for more details on how to change title, labels, ... of a legend.
## Zooming
- Without clipping (removes unseen data points)
```{r, message = FALSE}
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
```
----
- With clipping (removes unseen data points)
```{r, message = FALSE, warning = FALSE}
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
xlim(5, 7) + ylim(10, 30)
```
----
-
```{r, message = FALSE, warning = FALSE}
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 7)) +
scale_y_continuous(limits = c(10, 30))
```
----
-
```{r, message = FALSE}
mpg %>%
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
```
## Themes
-
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
```
----
## Saving plots
```{r, collapse = TRUE}
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
```