This document is intended to help you quickly learn how to perform basic data visualization using ggplot2.
An accompanying YouTube playlist that walks through this document is available by following the link here. I have also included direct, relevant video links throughout the file (e.g., immediately after the the relevant heading). An html version of this document can be downloaded to your current working direction (run getwd()
in the Console to see where this is) by running the following command:
The raw R Markdown code used to generate the html file can be obtained by toggling the “Code” box in the upper-right corner of the html file and selecting “Download Rmd”.
Graphics ecosystems in R (Video: YouTube, Panopto)
There are three main graphics ecosystems in R:
- base
- lattice
- ggplot2
base graphics are traditional S-like graphics.
- Run
?graphics-package
in the Console for more details.
- These are the graphics you get by default when you use the
plot
function, the hist
function, the boxplot
function, etc.
lattice graphics are an implementation of Trellis graphics (Becker, Cleveland, and Shyu 1996) for R.
- Run
?lattice
in the Console for additional information about the lattice package.
- The lattice package focuses on elegantly plotting multivariate data and makes it easy to distinguish different levels of
factor
data.
- lattice and base graphics do not interact with each other (changing aspects of one graphics system has no impact on the other.)
ggplot2 is a layered graphical system based on implementing the Grammar of Graphics (Wilkinson 2005).
- It has gained widespread popularity because of its friendliness for visual exploration of data by data scientists.
- It provides an elegant approach for constructing complex plots in a systematic way.
- Run
?ggplot2-package
in the Console for more information.
We demonstrate use of the three plotting ecosystems using the penguins
data set from the palmerpenguins package (Horst, Hill, and Gorman 2020), which will be discussed in more detail later. The code below creates grouped scatter plots comparing bill length (mm) and body mass (g) of different penguin species using each of the ecosystems.
# load data
data(penguins, package = "palmerpenguins")
# base graphics
plot(bill_length_mm ~ body_mass_g, data = penguins, col = penguins$species)
legend(x = "topleft", legend = levels(penguins$species), col = c("black", "red",
"green"), pch = 1)

# lattice graphics
lattice::xyplot(bill_length_mm ~ body_mass_g, data = penguins, group = species, auto.key = TRUE)

# ggplot2 graphics
ggplot2::ggplot(data = penguins) + ggplot2::geom_point(mapping = aes(x = body_mass_g,
y = bill_length_mm, color = species, shape = species))
Warning: Removed 2 rows containing missing values (geom_point).

Simple examples
The penguins
data (Video: YouTube, Panopto)
We will use the penguins
data set in the palmerpenguins package (Horst, Hill, and Gorman 2020) to illustrate some of the ways that ggplot2 can be used.
The penguins
data set provides data related to various penguin species measured in the Palmer Archipelago (Antarctica), originally provided by Gorman, Williams, and Fraser (2014). We start by loading the data into memory.
data(penguins, package = "palmerpenguins")
The data set includes 344 observations of 8 variables. The variables are:
species
: a factor
indicating the penguin species
island
: a factor
indicating the island the penguin was observed
bill_length_mm
: a numeric
variable indicating the bill length in millimeters
bill_depth_mm
: a numeric
variable indicating the bill depth in millimeters
flipper_length_mm
: an integer
variable indicating the flipper length in millimeters
body_mass_g
: an integer
variable indicating the body mass in grams
sex
: a factor
indicating the penguin sex (female
, male
)
year
: an integer denoting the study year the penguin was observed (2007
, 2008
, or 2009
)
We construct a bar chart of the species
. A bar chart uses bars to indicate the number of values each level of a factor
includes. Alternatively, the bars can indicate the relative frequency (i.e., proportion) of values having each level of the factor
.
ggplot(penguins) + geom_bar(aes(x = species))

Relatively speaking, we can see that Adelie penguins were most frequently observed, followed by Gentoo penguins, and then Chinstrap penguins.
We can color the bars by species
for visual clarity by specifying the fill
aesthetic.
ggplot(penguins) + geom_bar(aes(x = species, fill = species))

This doesn’t add any new information to the graphic but does make it more visually appealing. We could remove the legend to simplify the graphic, if we wanted.
ggplot(penguins) + geom_bar(aes(x = species, fill = species)) + theme(legend.position = "none")

A histogram is used to display the distribution of a continuous numeric
variable. The range of the variable is partitioned into classes. The number of observations falling in each class is counted. A histogram draws a bar for each class with height corresponding to the number of observations in that class.
Let’s construct a histogram of bill length.
ggplot(penguins) + geom_histogram(aes(x = bill_length_mm))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (stat_bin).

The histogram is multimodal (i.e., has multiple prominent peaks), having 2 or 3 prominent peaks. There may be sub-populations we should distinguish. It isn’t easy to identify sub-populations with histograms unless the sub-populations are split into separate histograms (which we will learn to do later with “facets”) or to make the bars semi-transparent for the different sub-populations.
A density plot is often preferred to a histogram because it’s more flexible, though it often provides similar information. A density plot is essentially a smoothed version of a histogram.
A density plot is appropriate for displaying the distribution of a continuous numeric
variable and indicates the values for which the data is more “dense”. More specifically, density plots indicate the values of data you are most likely to observe. A higher density region means you are more likely to observe data with values in that region. There is a statistical definition we will not go into.
A simple density plot is shown below. We can’t learn a lot because of the sub-populations previously mentioned.
ggplot(penguins) + geom_density(aes(x = bill_length_mm))
Warning: Removed 2 rows containing non-finite values (stat_density).

There are at least two modes in the bill_length_mm
density plot.
Let’s create separate densities for the different penguin species by specifying the color
aesthetic to scale different colors to the different levels of species
.
ggplot(penguins) + geom_density(aes(x = bill_length_mm, color = species))
Warning: Removed 2 rows containing non-finite values (stat_density).

There are clear differences in the bill length of Adelie, Chinstrap, and Gentoo penguins. The bill length for the Adelie penguins is about 35 mm and closer to 47 mm for the Gentoo penguins. The Chinstrap penguins have the greatest bill length, in general. It appears the Chinstrap penguins may have a further sub-population.
A weakness of the previous graphic is that the legend distinguishing the different species
is a bit subtle. The colored line around the border may not be easy to distinguish for everyone. We could use a different linetype
for each species or we could use different color
and linetype
for each species. We consider the different results below.
ggplot(penguins) + geom_density(aes(x = bill_length_mm, linetype = species))
Warning: Removed 2 rows containing non-finite values (stat_density).

ggplot(penguins) + geom_density(aes(x = bill_length_mm, col = species, linetype = species))
Warning: Removed 2 rows containing non-finite values (stat_density).

An even better option is to fill
the densities for each species with different colors. However, the density curves will overlap, masking some of the information.
ggplot(penguins) + geom_density(aes(x = bill_length_mm, fill = species))
Warning: Removed 2 rows containing non-finite values (stat_density).

To address the masking issue, we can control the transparency of the colors using the alpha
aesthetic. This value, between 0 and 1, controls how transparent the objects are. 0
means transparent, 1
means completely opaque. We will set alpha = 0.3
. Notice that we specify this OUTSIDE the aesthetic mapping since we are controlling the aesthetic manually instead of having ggplot2 scale the provided variable.
ggplot(penguins) + geom_density(aes(x = bill_length_mm, fill = species), alpha = 0.3)
Warning: Removed 2 rows containing non-finite values (stat_density).

Scaling
When a variable is mapped to an aesthetic, ggplot2 will assign a unique value to each unique level of the variable. This process is known as scaling.
Scaling is often not very exciting, but it can be important when customizing the look of a ggplot
.
When a variable is mapped to a non x
or y
aesthetic, ggplot2 will automatically add a legend to indicate how the variable was scaled.
A boxplot is a robust display of information for a continuous numeric
variable. Boxplots are robust because they rely on robust statistics not easily affected by outliers.
In general, a boxplot:
- Draws a box extending from Q1 (the 0.25 quantile) to Q3 (the 0.75 quantile) of the data.
- A line between Q1 and Q3 indicates the median (the 0.5 quantile) of the data.
- A “whisker” extends from Q1 to the smallest observation that is not an outlier.
- A “whisker” extends from Q3 to the largest observation that is not an outlier.
- An outlier is typically defined as an observation smaller than Q1 - 1.5 (Q3 - Q1) or larger than Q3 + 1.5 (Q3 - Q1).
Parallel boxplots are an effective tool for comparing differences between a variable across multiple levels of a factor
variable.
Consider parallel boxplots of bill length distinguished by species
.
ggplot(penguins) + geom_boxplot(aes(y = bill_length_mm, x = species))
Warning: Removed 2 rows containing non-finite values (stat_boxplot).

We can see that the typical responses for the Adelie penguins tend to be lower than for the Chinstrap and Gentoo penguins. The Chinstrap penguins tend to have slightly longer beaks that the Gentoo penguins. All three penguins species seem to have similar variability (i.e., the spread of the data is similar).
A weakness of boxplots is that they throw away a lot of information. e.g., We don’t see the two sub-populations in the Chinstrap penguins that we noticed earlier when looking at density plots.
Adding a fill
color to the boxplots (while removing the legend) might be a more visually appealing graphic.
ggplot(penguins) + geom_boxplot(aes(y = bill_length_mm, x = species, fill = species)) +
theme(legend.position = "none")
Warning: Removed 2 rows containing non-finite values (stat_boxplot).

A violin plot is like a cross between a boxplot and a density plot. A violin plot unites a density with its mirror image and then displays the unified object like a boxplot.
We construct violin plots of bill length for each penguin species
.
ggplot(penguins) + geom_violin(aes(x = species, y = bill_length_mm))
Warning: Removed 2 rows containing non-finite values (stat_ydensity).

While the Adelie bill lengths are approximately symmetric and unimodal, both the Chinstrap and Gentoo bill lengths are bimodal and asymmetric.
A scatter plot draws the locations of (x, y) positions based on two numeric
variables. In general, we use scatter plots to learn information about the association between the two variables.
- A positive association is present when both variables tend to increase together.
- A negative association is present when one variable tends to decrease when the other increases.
- A linear association is present when the relationship between the two variables is a straight line.
- A nonlinear association is present when the relationship between the two variables is a curve.
- It is possible that no clear association is present in the data.
Let’s consider a scatter plot of bill_length_mm
versus body_mass_g
.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm))
Warning: Removed 2 rows containing missing values (geom_point).

We see a positive association in the graphic above. As body_mass_g
increases, bill_length_mm
also tends to increase. However, it appears that there may be some observations in the upper left part of the graph that do not match the overall linear trend.
This could mean:
- The data are noisy, which simply means the patterns aren’t as strong and clear as we would like.
- There is a third variable that not accounted for in our plot but that has a relationship with the other two variables.
- When we are not accounting for this variable, it is called a lurking variable.
- When we account for a lurking variable in our analysis, it becomes a confounding variable.
- We are defining lurking and confounding variables in a general sense and not with statistical rigor.
Let’s consider an analysis of body_mass_g
and bill_length_mm
that accounts for species
. Specifically, we will use the color
aesthetic so that each penguin species is shown in a different color.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species))
Warning: Removed 2 rows containing missing values (geom_point).

Accounting for species
in the analysis shows clear patterns for the three penguin species. Chinstrap penguins tend to have smaller body mass but longer bill lengths. Adelie penguins have smaller body mass and bill lengths. Gentoo penguins tend to have larger body mass and bill lengths than the other two species.
As previously mentioned, we want our plots to be as clear as possible. Some individuals have difficulty interpreting colors in graphics. We can add another distinguishing characteristic for the different species. In this case, we will change the shape
of the plotted points for each species. This provides another aid for correctly interpreting the graphic.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species,
shape = species))
Warning: Removed 2 rows containing missing values (geom_point).

A smooth in ggplot2 is a model that that attempts to estimate the average relationship between a response variable (typically the y
variable) and one or more predictor variables (typically the x
and possibly other variables)
Adding a “smooth” to a plot can make relationships between variables even clearer. There are many different smoothing methods.
A smoothing layer can be added to a plot using geom_smooth
. Many methods can be used for the smooth. ggplot2 takes the methods lm
(linear model), glm
(generalized linear model), gam
(generalized additive model), loess
(local fitting), or a function (e.g., MASS::rlm
for a robust linear model). If you don’t know what these are, do not add them to your plot because they won’t help you to explain the data.
A simple linear regression model fits the straight line that minimizes the squared deviations between the y values and the estimated line. We add simple linear regression line to the scatter plot by adding a geom_smooth
layer.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species,
shape = species)) + geom_smooth(aes(x = body_mass_g, y = bill_length_mm), method = "lm")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 2 rows containing non-finite values (stat_smooth).
Warning: Removed 2 rows containing missing values (geom_point).

This only adds a single line across all groups, which clearly doesn’t explain the relationship when accounting for species
. To add a separate line for each species
, we need to include that in the aesthetic mapping for geom_smooth
. If we scale the species
variable using the color
aesthetic, then a simple linear regression line will be estimated for each species in the color that matches the scatter plot.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species,
shape = species)) + geom_smooth(aes(x = body_mass_g, y = bill_length_mm, color = species),
method = "lm")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 2 rows containing non-finite values (stat_smooth).
Warning: Removed 2 rows containing missing values (geom_point).

Using common aesthetics across geometries (Video: YouTube, Panopto)
Notice that we had to specify the aesthetic mappings in each geom. This seems redundant.
If you specify an aesthetic mapping in a geom, then the aesthetic mapping is local and only applies to that particular geom. If you want to specify a global aesthetic mapping that applies across all geoms, then you can specify the mapping
in the ggplot
function.
Here is a simpler version of the previous plot.
ggplot(data = penguins, mapping = aes(x = body_mass_g, y = bill_length_mm, color = species)) +
geom_point(aes(shape = species)) + geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 2 rows containing non-finite values (stat_smooth).
Warning: Removed 2 rows containing missing values (geom_point).

Customizing plots (Video: YouTube, Panopto)
Unsurprisingly, the graphics produced by ggplot2 can be greatly customized. We consider some of the basic customizations below.
Assigning a plot
One of the strengths of ggplot2 is that you can easily build a plot layer by layer. In this case, it is useful to assign the current plot a name that can then be added to later.
Note: When you assign a ggplot
a name, it will NOT be displayed, as is typical when assigning objects in R. You have to then print the object using the print
function or by simply running the object name in the Console. Let’s assign the name ggp
to the basic penguin scatter plot displaying bill length versus body mass, using colors and shapes to distinguish between different species.
ggp <- ggplot(data = penguins, mapping = aes(x = body_mass_g, y = bill_length_mm,
color = species)) + geom_point(aes(shape = species))
ggp
Warning: Removed 2 rows containing missing values (geom_point).

We’ll customize this plot below.
Axis labels and titles
The x-axis labels, y-axis labels, and title of a plot are controlled using the xlab
, ylab
, and ggtitle
functions, respectively. Let’s customize the penguin scatter plot previously discussed. We’ll update the ggp
object to include our improvements then print the result.
ggp <- ggp + xlab("body mass (g)") + ylab("bill length (mm)") + ggtitle("Penguin body characteristics")
ggp
Warning: Removed 2 rows containing missing values (geom_point).

Axis limits
The x-axis and y-axis limits are controlled using the xlim
and ylim
functions. Each function takes the lower and upper limit you want to set, e.g., xlim(0, 100)
. Let’s change the axes for the penguin scatter plot previously discussed. There’s nothing wrong with the previous limits, but we’ll extend them a bit for demonstration.
ggp + xlim(2000, 7000) + ylim(0, 80)
Warning: Removed 2 rows containing missing values (geom_point).

Manual aesthetics
Aesthetics can be set manually in a geom by specifying the aesthetic directly in the geom but outside the aes
function. Let’s change the shape of the points for a scatter plot, increase the size of the points, and manually change the color of the points.
ggplot(data = penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm), shape = 15,
size = 5, color = "blue")
Warning: Removed 2 rows containing missing values (geom_point).

Scale customization
Scaling is the name of the process used by ggplot2 to map a variable to a unique value. This is particularly important for factor
variables. For example, color scaling is the process by which the the color
aesthetic is mapped to particular colors. These can be customized using the various scale_*
functions, where the “*” is a placeholder for the rest of the function name.
Let’s manually set the color scaling using the scale_color_manual
function. The colors provided below come from the Dark2 color palette of the qualitative type provided by the ColorBrewer website (https://colorbrewer2.org/), which provides color advice. It’s particularly great for identifying colorblind-friendly color palettes.
ggp + scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3"))
Warning: Removed 2 rows containing missing values (geom_point).

Alternatively, we could have used scale_color_brewer
function instead, which automatically includes the color palettes desigend by Color Brewer. Run ?scale_color_brewer
for more details.
ggp + scale_color_brewer(type = "seq", palette = "Dark2")
Warning: Removed 2 rows containing missing values (geom_point).

Scalings can be customized in many ways, but it quickly becomes complicated, so we don’t discuss them further.
Legend customization
The legend can be customized in many ways. To move the location of the legend, you can specify the legend.position
argument of the theme
function.
ggp + theme(legend.position = "bottom")
Warning: Removed 2 rows containing missing values (geom_point).

You can customize legend size, background, names, colors, etc. For specific implementation details, it is probably best to do a web search.
Themes
A theme is a customized style that changes the overall appearance of your graphic. ggplot2 provides a number of built-in themes. Some of the built-in themes include:
theme_gray
: the default theme. Gray background and white grid lines.
theme_bw
: white background and gray grid lines.
theme_classic
: white background and no grid lines.
theme_minimal
: white background, no grid lines, no axis lines.
theme_light
: white background, gray grid lines, and a box around the plotting region.
theme_dark
: essentially a “dark” opposite to theme_light
.
We will use the patchwork package (Pedersen 2022) to easily combine several plots into a single graphic. The patchwork package uses ()
to group a set of plots, |
to separate plots horizontally, and /
to stack sets of plots vertically.
ggp_gray <- ggp + theme_gray() + ggtitle("theme_gray")
ggp_bw <- ggp + theme_bw() + ggtitle("theme_bw")
ggp_classic <- ggp + theme_classic() + ggtitle("theme_classic")
ggp_minimal <- ggp + theme_minimal() + ggtitle("theme_minimal")
ggp_light <- ggp + theme_light() + ggtitle("theme_light")
ggp_dark <- ggp + theme_dark() + ggtitle("theme_dark")
library(patchwork)
(ggp_gray | ggp_bw)/(ggp_classic | ggp_minimal)/(ggp_light | ggp_dark)
Warning: Removed 2 rows containing missing values (geom_point).
Removed 2 rows containing missing values (geom_point).
Removed 2 rows containing missing values (geom_point).
Removed 2 rows containing missing values (geom_point).
Removed 2 rows containing missing values (geom_point).
Removed 2 rows containing missing values (geom_point).

These themes can be customized further by changing the appropriate aspects of the plots. We leave the user to use web searches for the desired customizations.
Other components of a ggplot
Facetting a data set creates separate plotting panels for each level of one or more discrete variables. Facetting is useful for examining patterns for combinations of levels for one or more discrete variables. Technically, facetting can be performed for numeric
variables with a relatively small number of unique values, but for practical reasons, facetting is most appropriate for factor
variables.
There are two functions for creating facetted graphics in ggplot2:
facet_wrap
facets the plots based on a single factor
variable. The panels are wrapped around the plotting window.
facet_grid
forms a matrix of panels based on row and column facetting variables.
In the plot below, we facet scatter plots of bill_length_mm
versus body_mass_g
by species
.
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm)) + facet_wrap(~species)
Warning: Removed 2 rows containing missing values (geom_point).

By default, the same x and y axes are used in all panels. This is standard because it facilitates direct comparisons across panels. This can be customized using the scales
argument of the facet_*
functions. The allowable scales
argument values are:
fixed
: same x and y axes for all panels. The default.
free
: individual x and y axes for each panel
free_x
: individual x axes for each panel, common y axis
free_y
: individual y axes for each panel, common x axis
We now consider use the facet_grid
function to facet by two factor
variables, species
and sex
. Prior to plotting, we remove the observations with NA
for the sex
variable.
ggplot(subset(penguins, !is.na(sex))) + geom_point(aes(x = body_mass_g, y = bill_length_mm)) +
facet_grid(species ~ sex)

There appears to be differences in the sizes of the male and female penguins, bit it is difficult to compare the patterns across panels. It may be more useful to facet by species
but distinguish by sex
in a single panel.
ggplot(subset(penguins, !is.na(sex))) + geom_point(aes(x = body_mass_g, y = bill_length_mm,
color = sex, shape = sex)) + facet_wrap(~species)

It is clear now that across species, the males tend to have greater body mass and bill length in comparison to the females.
Position
The position
argument of a geom describes the positions of the objects produced. Most of the time this is pretty straightforward, but there can be times when it is helpful (or unhelpful) to customize position
.
Sometimes, data values/information overlap, which makes interpreting the plot more difficult. In that case, it can be useful to “jitter” the data so that the values/information are not overlapping. The data are not technically as accurate, but the large-scale distribution of the data is easier to interpret.
The Galton
data set in the HistData package (Friendly 2021) includes 928 observations of 2 variales. The variables are:
parent
: a numeric
variable indicating the average height of a child’s mother and father (in).
child
: a numeric
variable indicating the height of each child (in).
The data are rounded to the nearest 0.1 inch. Additional information can be found by running ?HistData::Galton
in the Console.
Unfortunately, the rounding the data values leads to overplotting. Below, we load the Galton
data and create a scatter plot of child height versus parents’ average height.
data(Galton, package = "HistData")
ggplot(Galton) + geom_point(aes(x = parent, y = child))

There is a lot of overplotting in this graphic. We use the position
argument to jitter
the data to make the overall distribution clearer. There’s not a lot of information gain from jittering the data in this particular example, but we can see that there are more observations in the central part of the plot.
ggplot(Galton) + geom_point(aes(x = parent, y = child), position = "jitter")

Stacking, filling, and dodging (Video: YouTube, Panopto)
I was tempted to title this section, “plots that make me emotional” because several of the graphics below cause me to go through a range of mostly unpleasant emotions when I view them.
In this section, we’ll see how stack
, fill
, and dodge
can be used in the position
argument. While these examples demonstrate their use, some of the results are an absolute mess. I will conclude with alternative recommendations.
When creating a bar chart, a factor
variable is mapped to the x
aesthetic. If a different factor
variable is mapped to the fill
aesthetic, then ggplot2 will stack
the bars for the fill
variable inside the bars for the x
variable. Consider the bar chart below in which we “stack” penguin sex
within the bars for the different species.
ggplot(penguins) + geom_bar(aes(x = species, fill = sex))

If you compare this bar chart with the original one that doesn’t fill by sex
, the bar height is identical. However, within each species, we fill
the bars to show the number of penguins with each level of sex
.
This plot is difficult to interpret. It takes a lot of mental energy to compute approximately how many male, female, and NA
observations there are within each species. While the updated bar chart compactly provides a lot of information, it doesn’t facilitate easy interpretation.
If we wanted to compare the proportion of male, female, and NA
penguins in each species, we can change the position
argument to fill
. In that case, the bars within each level of species
will be scaled to one, directly facilitating comparisons between the colors within each bar.
ggplot(penguins) + geom_bar(aes(x = species, fill = sex), position = "fill")

I still consider this a poor display of the data, but we can see that the proportion of males is close to 50% for all species.
Lastly, if we don’t want to stack
or fill
the bars, we can have them dodge
each other. In that case, the overlapping bars will “dodge” each other and sit next to each other in the plot.
ggplot(penguins) + geom_bar(aes(x = species, fill = sex), position = "dodge")

If you want to compare the number of penguins with each sex
within each species
, this graphic isn’t too bad. However, the bars representing male penguins (or alternatively, female penguins) are spread out and more difficult to compare directly because of the other bars between them.
While the above charts provide a fair amount of information compactly, to facilitate ease of interpretation we should construct our graphics to highlight the characteristic of importance.
e.g., If we want to compare the counts of each penguin species
within each sex
, it would be better to facet the data by sex
and display the count of each species.
ggplot(penguins) + geom_bar(aes(x = species, fill = species)) + facet_wrap(~sex)

Alternatively, if we want to highlight the composition of sex
within each species
, then I recommend creating a bar chart of sex
while facetting by species
. This plot is quite similar to the “dodge” plot above, but provides further distinction between the charts for each species.
ggplot(penguins) + geom_bar(aes(x = sex, fill = sex)) + facet_wrap(~species)

The examples above are shown for factor variables with only three levels of each factor, and you may think I’m overreacting (and perhaps I am). However, the issues mentioned above are magnified when our factor
variables have many levels. To illustrate this point, I present “stack” and “dodge” bar charts of the diamonds
data in the ggplot2 package, in which we compare the cut
and clarity
of diamonds. Each variable has at least 5 levels. The large number of levels for the cut
and clarity
variables make it difficult to quickly interpret the relationship between the two variables. The patchwork package is used to combine the two charts into a single graphic. (We also rotate the x-axis labels 90 degrees so they don’t overlap each other). How easy is it for you to interpret the information below?
ggp_stack <- ggplot(diamonds) + geom_bar(aes(x = cut, fill = clarity)) + ggtitle("stack") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggp_dodge <- ggplot(diamonds) + geom_bar(aes(x = cut, fill = clarity), position = "dodge") +
ggtitle("dodge") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggp_stack + ggp_dodge

Coordinate Systems (Video: YouTube, Panopto)
A coordinate system is used to define the position of coordinates relative to one another. By default, ggplot2 uses the Cartesian coordinate system (https://en.wikipedia.org/wiki/Cartesian_coordinate_system), which treats coordinates as points on a flat plane, which is how humans tend to think about coordinates in relationship to one another.
Other coordinates systems are commonly used for maps because points are actually on an ellipsoidal object (the earth!), which complicates their relationship.
We do not discuss coordinate systems in detail here. The only coordinate system I want to mentioned is the “flipped” coordinate system (coord_flip
), which flips the x and y axes. This is useful for rotating objects in a plot. Consider the boxplots below.
ggplot(penguins) + geom_boxplot(aes(x = species, y = body_mass_g, fill = species))
Warning: Removed 2 rows containing non-finite values (stat_boxplot).

ggplot(penguins) + geom_boxplot(aes(x = species, y = body_mass_g, fill = species)) +
coord_flip()
Warning: Removed 2 rows containing non-finite values (stat_boxplot).

---
title: "Basic data visualization with ggplot2"
author: "Joshua French"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
  bookdown::html_notebook2:
    number_sections: FALSE
bibliography:
- dwv.bib
- packages_basic_data_viz.bib
---

This document is intended to help you quickly learn how to perform basic data visualization using **ggplot2**.

An accompanying YouTube playlist that walks through this document is available by following the link [here](https://youtube.com/playlist?list=PLkrJrLs7xfbVbI1WHs8FfkOoFvtiI6EXd). I have also included direct, relevant video links throughout the file (e.g., immediately after the the relevant heading). An html version of this document can be downloaded to your current working direction (run `getwd()` in the Console to see where this is) by running the following command:
```{r, eval=FALSE}
download.file("https://raw.githubusercontent.com/jfrench/DataWrangleViz/master/03-basic-data-viz-with-ggplot2.nb.html",
              "03-basic-data-viz-with-ggplot2.nb.html")
```
The raw R Markdown code used to generate the html file can be obtained by toggling the "Code" box in the upper-right corner of the html file and selecting "Download Rmd".

```{r, include=FALSE}
knitr::opts_chunk$set(
  tidy = TRUE
)
```
```{r, include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
  .packages(), 'ggplot2', 'patchwork', 'palmerpenguins',
  'HistData', 'viridis'
), 'packages_basic_data_viz.bib')
```

## Introduction to basic data visualization with **ggplot2** (Video: [YouTube](https://youtu.be/lplgV5KT62s), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=f44a8a84-50de-41c6-8e9e-af0f01454961))

In this module we discuss how to perform basic data visualization.

```{r, echo = FALSE}
knitr::include_graphics("./figures/ggplot2_logo.png")
```

We will use **ggplot2** [@ggplot22016; @R-ggplot2] to construct the graphics in this module.

**ggplot2** has become increasingly popular for producing flexible graphics within an elegant framework.

We start by loading the **ggplot2** package.

```{r}
library(ggplot2)
```

```{r, include = FALSE}
library(palmerpenguins)
library(HistData)
```

## Tips for good graphics (Video: [YouTube](https://youtu.be/cPftNY-iKCE), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=f2130223-3a6c-4671-aadd-af0f014c11b6))

Data graphics should be as *informative* as possible.

- Graphics do *not*  need to contain as much *information* as possible.
- Make it as easy to understand the information you want to reveal about the data.
- Simpler graphics are easier to understand and should be preferred.
- Adding complexity to a graphic can sometimes reveal far more about the data than a simple graphic.
- Make sure labels and text are large enough to easily read.
- Make sure colors are distinguishable and colorblind friendly (e.g., using the **viridis** [@R-viridis] package).

You will have to decide what is needed in a graphic to make it as informative
as possible.

Complicated graphics may present far more information than we are able to process, such as in the example below.

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(palmerpenguins)
library(ggplot2)
library(plotly)
heavy_plot <- 
  ggplot(penguins) +
  geom_point(aes(x = body_mass_g,
                 y = bill_length_mm,
                 col = species,
                 shape = island,
                 size = flipper_length_mm)) +
  scale_size_continuous(range = c(0.5, 2.5)) +
  geom_smooth(data = penguins,
              mapping = aes(x = body_mass_g,
                  y = bill_length_mm,
                  col = species),
              formula = y ~ x,
              method = "loess") + 
  geom_smooth(data = penguins,
              mapping = aes(x = body_mass_g,
                  y = bill_length_mm,
                  linetype = sex),
              formula = y ~ x,
              method = "lm")
ggplotly(heavy_plot)
```

## Graphics ecosystems in R (Video: [YouTube](https://youtu.be/R12YAjSze0I), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=e0da966e-9758-4f86-8c16-af0f014e1e10))

There are three main graphics ecosystems in R:

1. **base**
2. **lattice**
3. **ggplot2**

**base** graphics are traditional S-like graphics.

* Run `?graphics-package` in the Console for more details.
* These are the graphics you get by default when you use the `plot` function, the `hist` function, the `boxplot` function, etc.

**lattice** graphics are an implementation of Trellis graphics [@BeckerClevelandShyu1996] for R.

* Run `?lattice` in the Console for additional information about the **lattice** package.
* The **lattice** package focuses on elegantly plotting multivariate data and makes it easy to distinguish different levels of `factor` data.
* **lattice** and **base** graphics do not interact with each other (changing aspects of one graphics system has no impact on the other.)

**ggplot2** is a layered graphical system based on implementing the Grammar of Graphics [@Wilkinson2005].

* It has gained widespread popularity because of its friendliness for visual exploration of data by data scientists.
* It provides an elegant approach for constructing complex plots in a systematic way.
* Run `?ggplot2-package` in the Console for more information.

We demonstrate use of the three plotting ecosystems using the `penguins` data set from the **palmerpenguins** package [@R-palmerpenguins], which will be discussed in more detail later. The code below creates grouped scatter plots comparing bill length (mm) and body mass (g) of different penguin species using each of the ecosystems.

```{r}
# load data
data(penguins, package = "palmerpenguins")
# base graphics
plot(bill_length_mm ~ body_mass_g, data = penguins, col = penguins$species)
legend(x = "topleft", legend = levels(penguins$species),
       col = c("black", "red", "green"), pch = 1)
# lattice graphics
lattice::xyplot(bill_length_mm ~ body_mass_g, data = penguins, group = species,
                auto.key = TRUE)
# ggplot2 graphics
ggplot2::ggplot(data = penguins) +
  ggplot2::geom_point(mapping = aes(x = body_mass_g, y = bill_length_mm,
                                    color = species, shape = species))
```

## Basic ingredients (Video: [YouTube ](https://youtu.be/k6a3vMoBXSw), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=574df5a9-3294-4d6f-8f9e-af0f014e8926))

There are 4 main components needed to produce a graphic using **ggplot2**.

1. A data frame containing your data.
    * Each column should be a variable and each row should be an observation of data.
2. A `ggplot` object.
    * This is initialized using the `ggplot` function.
3. A geometric object.
    * These are called "geoms" for short.
    * geoms indicate the geometric object used to visualize the data. E.g., points, lines, polygons etc. More generally, geoms indicate the type of plot that is desired, e.g., histogram, density, or boxplot, which aren't exactly a simple geometric argument.
4. An aesthetic.
    * An aesthetic mapping indicates what role a variable plays in the plot.
    * e.g., which variable will play the "x" variable in the plot, the "y" variable in the plot, control the "color" of the observations, etc.
  
In *R for Data Science*, @r4ds2017 provide the following template for creating graphics using **ggplot2**:
```{r, eval=FALSE, tidy=FALSE}
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```

`<DATA>`, `<GEOM_FUNCTION>`, and `<MAPPINGS>` are placeholders that you replace with the data frame, geometric object, and aesthetic mappings you want to use in your specific plot.

An explanation of the template above:

1. Every `ggplot` starts with a call to the `ggplot` function.
    * You can create a blank plot by running `ggplot()` in the Console.
2. Generally, you pass your data frame to the `ggplot` through the `data` argument.
    * You can also pass `data` inside the `<GEOM_FUNCTION>` when you intend to use multiple geometric objects with different data sources.
3. The `<GEOM_FUNCTION>` indicates the geometric object you want to use in the plot.
4. The `<MAPPINGS>` describes the aesthetics mappings you want to use for this particular geometry.
    * One of the reasons **ggplot2** is so powerful is that you can use different mappings for different geometric objects.
5. The `+` symbol is used to add a layer to your graphic.
    * If your code spans multiple lines, then you need to make sure all lines but the last end with the `+` operator to stack the layers properly.

### Some geometric objects

There are many geometric objects available in **ggplot2**. A complete list may be (currently) found at [https://ggplot2.tidyverse.org/reference/](https://ggplot2.tidyverse.org/reference/). A partial list of geometric objects that I frequently use is found below in Table \@ref(tab:geometry-table).

```{r, include=FALSE}
# create geometry table
data_dimensionality <- rep(c("1d", "2d", "3d", "NA"),
                            times = c(6, 7, 2, 1))
geometry <- c("`geom_bar`", "`geom_density`",
              "`geom_histogram`", "`geom_boxplot`",
              "`geom_violin`", "`geom_qq`", "`geom_point`",
              "`geom_path`, `geom_line`",
              "`geom_segment`",
              "`geom_curve`", "`geom_smooth`",
              "`geom_density2d`", "`geom_density2d_filled`",
              "`geom_contour`", "`geom_contour_filled`",
              "`geom_abline`, `geom_hline`, `geom_vline`")
purpose <- c("Draws a bar chart.", "Draws a density plot.",
             "Draws a histogram.", "Draws a boxplot.", "Draws a violin plot.",
             "Draws a quantile-quantile plot.",
             "Draws points. Used for scatter plots.",
             "Connects observations. Used for line plots.",
             "Draws straight lines between points.",
             "Draws curved lines between points.",
             "Draws a 'smooth' fitted model of the data.",
             "Draws 2d contours of density estimate for two variables.",
             "Draws 2d contours of density estimate for two variables with colors.",
             "Draws 2d contours of 3d data.",
             "Draws 2d contours of 3d data (colored).",
             "Draws diagonal, horizontal, and vertical lines.")
geometry_df <- data.frame(data_dimensionality, geometry, purpose)
```
```{r geometry-table, echo=FALSE}
knitr::kable(geometry_df, caption = "Common geometries provided by **ggplot2**.", )
```

### Some aesthetic mappings

Aesthetic mappings are unique to each geometric object. Some of the most common ones that show up in many geoms are provided below in Table \@ref(tab:aesthetic-table).

```{r, include=FALSE}
# create aesthetic table
aesthetic <- c("`x`", "`y`", "`alpha`", "`color`, `colour`",
               "`fill`", "`group`", "`linetype`", "`size`",
               "`shape`")
purpose <- c("Controls the x-variable in the plot.",
             "Controls the y-variable in the plot.",
             "Controls the transparency of the object.",
             "Controls the colors of the object.",
             "Controls the color of the interior of an object.",
             "Controls how the data are grouped.",
             "Controls the type of line used to draw the object.",
             "Controls the size of the drawn object.",
             "Controls the shape of the object.")
aesthetic_df <- data.frame(aesthetic, purpose)
```
```{r aesthetic-table, echo = FALSE}
knitr::kable(aesthetic_df, caption = "Common aesthetics provided by **ggplot2**.", )
```

## Simple examples

### The `penguins` data (Video: [YouTube](https://youtu.be/XpsV65um6Uc), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=8a1b8d91-8e90-482b-8512-af0f014e88e9))
We will use the `penguins` data set in the **palmerpenguins** package [@R-palmerpenguins] to illustrate some of the ways that **ggplot2** can be used. 

The `penguins` data set provides data related to various penguin species measured in the Palmer Archipelago (Antarctica), originally provided by @GormanEtAl2014. We start by loading the data into memory.

```{r}
data(penguins, package = "palmerpenguins")
```

The data set includes `r nrow(penguins)` observations of `r ncol(penguins)` variables. The variables are:

* `species`: a `factor` indicating the penguin species
* `island`: a `factor` indicating the island the penguin was observed
* `bill_length_mm`: a `numeric` variable indicating the bill length in millimeters
* `bill_depth_mm`: a `numeric` variable indicating the bill depth in millimeters
* `flipper_length_mm`: an `integer` variable indicating the flipper length in millimeters
* `body_mass_g`: an `integer` variable indicating the body mass in grams
* `sex`: a `factor` indicating the penguin sex (`female`, `male`)
* `year`: an integer denoting the study year the penguin was observed (`2007`, `2008`, or `2009`)

### Bar chart (Video: [YouTube](https://youtu.be/cBwC1SKz_gA), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=ef64b539-5b2c-4621-99aa-af0f014e8872))
We construct a bar chart of the `species`. A bar chart uses bars to indicate the number of values each level of a `factor` includes. Alternatively, the bars can indicate the relative frequency (i.e., proportion) of values having each level of the `factor`. 

```{r}
ggplot(penguins) + geom_bar(aes(x = species))
```

Relatively speaking, we can see that Adelie penguins were most frequently observed, followed by Gentoo penguins, and then Chinstrap penguins.

We can color the bars by `species` for visual clarity by specifying the `fill` aesthetic.

```{r}
ggplot(penguins) + geom_bar(aes(x = species, fill = species))
```

This doesn't add any new information to the graphic but does make it more visually appealing. We could remove the legend to simplify the graphic, if we wanted.

```{r}
ggplot(penguins) + geom_bar(aes(x = species, fill = species)) + theme(legend.position = "none")
```

### Histogram (Video: [YouTube](https://youtu.be/gFZpBO38HUg), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=5f323dbc-3d6c-42bd-ac62-af0f014e88aa))

A *histogram* is used to display the distribution of a continuous `numeric` variable. The range of the variable is partitioned into classes. The number of observations falling in each class is counted. A histogram draws a bar for each class with height corresponding to the number of observations in that class.

Let's construct a histogram of bill length.

```{r}
ggplot(penguins) + geom_histogram(aes(x = bill_length_mm))
```

The histogram is multimodal (i.e., has multiple prominent peaks), having 2 or 3 prominent peaks. There may be sub-populations we should distinguish. It isn't easy to identify sub-populations with histograms unless the sub-populations are split into separate histograms (which we will learn to do later with "facets") or to make the bars semi-transparent for the different sub-populations.

### Density plot (Video: [YouTube](https://youtu.be/UGeJ73k6BGs), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=1b782342-8f79-4493-a60d-af0f014e9438))

A density plot is often preferred to a histogram because it's more flexible, though it often provides similar information. A density plot is essentially a smoothed version of a histogram. 

A density plot is appropriate for displaying the distribution of a continuous `numeric` variable and indicates the values for which the data is more "dense". More specifically, density plots indicate the values of data you are most likely to observe. A higher density region means you are more likely to observe data with values in that region. There is a statistical definition we will not go into.

A simple density plot is shown below. We can't learn a lot because of the sub-populations previously mentioned.

```{r}
ggplot(penguins) + geom_density(aes(x = bill_length_mm))
```

There are at least two modes in the `bill_length_mm` density plot.

Let's create separate densities for the different penguin species by specifying the `color` aesthetic to *scale* different colors to the different levels of `species`.

```{r}
ggplot(penguins) + geom_density(aes(x = bill_length_mm, color = species))
```

There are clear differences in the bill length of Adelie, Chinstrap, and Gentoo penguins. The bill length for the Adelie penguins is about 35 mm and closer to 47 mm for the Gentoo penguins. The Chinstrap penguins have the greatest bill length, in general. It appears the Chinstrap penguins may have a further sub-population.

A weakness of the previous graphic is that the legend distinguishing the different `species` is a bit subtle. The colored line around the border may not be easy to distinguish for everyone. We could use a different `linetype` for each species or we could use different `color` and `linetype` for each species. We consider the different results below.

```{r}
ggplot(penguins) + geom_density(aes(x = bill_length_mm, linetype = species))
ggplot(penguins) + geom_density(aes(x = bill_length_mm, col = species, linetype = species))
```

An even better option is to `fill` the densities for each species with different colors. However, the density curves will overlap, masking some of the information.

```{r}
ggplot(penguins) + geom_density(aes(x = bill_length_mm, fill = species))
```

To address the masking issue, we can control the transparency of the colors using the `alpha` aesthetic. This value, between 0 and 1, controls how transparent the objects are. `0` means transparent, `1` means completely opaque. We will set `alpha = 0.3`. Notice that we specify this OUTSIDE the aesthetic mapping since we are controlling the aesthetic manually instead of having **ggplot2** *scale* the provided variable. 

```{r}
ggplot(penguins) + geom_density(aes(x = bill_length_mm, fill = species), alpha = 0.3)
```

### Scaling

When a variable is mapped to an aesthetic, **ggplot2** will assign a unique value to each unique level of the variable. This process is known as *scaling*.

Scaling is often not very exciting, but it can be important when customizing the look of a `ggplot`.

When a variable is mapped to a non `x` or `y` aesthetic, **ggplot2** will automatically add a legend to indicate how the variable was scaled.

### Boxplots (Video: [YouTube](https://youtu.be/EZjkg_hsh-w), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3f6365cb-94eb-4ef9-8271-af0f014e96a3))

A boxplot is a robust display of information for a continuous `numeric` variable. Boxplots are robust because they rely on robust statistics not easily affected by outliers. 

In general, a boxplot:

* Draws a box extending from Q1 (the 0.25 quantile) to Q3 (the 0.75 quantile) of the data.
* A line between Q1 and Q3 indicates the median (the 0.5 quantile) of the data.
* A "whisker" extends from Q1 to the smallest observation that is not an outlier.
* A "whisker" extends from Q3 to the largest observation that is not an outlier.
* An **outlier** is typically defined as an observation smaller than Q1 - 1.5 (Q3 - Q1) or larger than Q3 + 1.5 (Q3 - Q1).

Parallel boxplots are an effective tool for comparing differences between a variable across multiple levels of a `factor` variable.

Consider parallel boxplots of bill length distinguished by `species`.

```{r}
ggplot(penguins) + geom_boxplot(aes(y = bill_length_mm, x = species))
```

We can see that the typical responses for the Adelie penguins tend to be lower than for the Chinstrap and Gentoo penguins. The Chinstrap penguins tend to have slightly longer beaks that the Gentoo penguins. All three penguins species seem to have similar variability (i.e., the spread of the data is similar).

A weakness of boxplots is that they throw away a lot of information. e.g., We don't see the two sub-populations in the Chinstrap penguins that we noticed earlier when looking at density plots.

Adding a `fill` color to the boxplots (while removing the legend) might be a more visually appealing graphic.
```{r}
ggplot(penguins) + geom_boxplot(aes(y = bill_length_mm, x = species, fill = species)) + theme(legend.position = "none")
```

### Violin plots (Video: [YouTube](https://youtu.be/QbVWcjxYgHY), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=27b50ff6-2aaa-48ef-85ac-af0f014ea88d))

A violin plot is like a cross between a boxplot and a density plot. A violin plot unites a density with its mirror image and then displays the unified object like a boxplot. 

We construct violin plots of bill length for each penguin `species`.

```{r}
ggplot(penguins) + geom_violin(aes(x = species, y = bill_length_mm))
```

While the Adelie bill lengths are approximately symmetric and unimodal, both the Chinstrap and Gentoo bill lengths are bimodal and asymmetric.

### Scatter plots (Video: [YouTube](https://youtu.be/DjORoUDMLD4), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=8b15f3a4-aaa3-4a73-aebd-af0f014eab0a))

A scatter plot draws the locations of (x, y) positions based on two `numeric` variables. In general, we use scatter plots to learn information about the *association* between the two variables.

* A positive association is present when both variables tend to increase together.
* A negative association is present when one variable tends to decrease when the other increases.
* A linear association is present when the relationship between the two variables is a straight line.
* A nonlinear association is present when the relationship between the two variables is a curve.
* It is possible that no clear association is present in the data.

Let's consider a scatter plot of `bill_length_mm` versus `body_mass_g`.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm))
```

We see a positive association in the graphic above. As `body_mass_g` increases, `bill_length_mm` also tends to increase. However, it appears that there may be some observations in the upper left part of the graph that do not match the overall linear trend.

This could mean:

1. The data are *noisy*, which simply means the patterns aren't as strong and clear as we would like.
2. There is a third variable that not accounted for in our plot but that has a relationship with the other two variables.
    * When we are not accounting for this variable, it is called a *lurking variable*.
    * When we account for a lurking variable in our analysis, it becomes a *confounding variable*.
    * We are defining lurking and confounding variables in a general sense and not with statistical rigor.

Let's consider an analysis of `body_mass_g` and `bill_length_mm` that accounts for `species`. Specifically, we will use the `color` aesthetic so that each penguin species is shown in a different color.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species))
```

Accounting for `species` in the analysis shows clear patterns for the three penguin species. Chinstrap penguins tend to have smaller body mass but longer bill lengths. Adelie penguins have smaller body mass and bill lengths. Gentoo penguins tend to have larger body mass and bill lengths than the other two species.

As previously mentioned, we want our plots to be as clear as possible. Some individuals have difficulty interpreting colors in graphics. We can add another distinguishing characteristic for the different species. In this case, we will change the `shape` of the plotted points for each species. This provides another aid for correctly interpreting the graphic.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species, shape = species))
```

### Adding smooths (Video: [YouTube](https://youtu.be/P2_qYBV1Byc), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=dfeca0cc-4591-4306-9e85-af0f014eb95b))

A smooth in **ggplot2** is a model that that attempts to estimate the average relationship between a response variable (typically the `y` variable) and one or more predictor variables (typically the `x`  and possibly other variables)

Adding a "smooth" to a plot can make relationships between variables even clearer. There are many different smoothing methods.

A smoothing layer can be added to a plot using `geom_smooth`. Many methods can be used for the smooth. **ggplot2** takes the methods `lm` (linear model), `glm` (generalized linear model), `gam` (generalized additive model), `loess` (local fitting), or a function (e.g., `MASS::rlm` for a robust linear model). If you don't know what these are, do not add them to your plot because they won't help you to explain the data.

A simple linear regression model fits the straight line that minimizes the squared deviations between the y values and the estimated line. We add simple linear regression line to the scatter plot by adding a `geom_smooth` layer.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species, shape = species)) + geom_smooth(aes(x = body_mass_g, y = bill_length_mm), method = "lm")
```

This only adds a single line across all groups, which clearly doesn't explain the relationship when accounting for `species`. To add a separate line for each `species`, we need to include that in the aesthetic mapping for `geom_smooth`. If we scale the `species` variable using the `color` aesthetic, then a simple linear regression line will be estimated for each species in the color that matches the scatter plot.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = species, shape = species)) + geom_smooth(aes(x = body_mass_g, y = bill_length_mm, color = species), method = "lm")
```

### Using common aesthetics across geometries (Video: [YouTube](https://youtu.be/oFWHATa7AQw), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=4065f746-5cf2-45ef-a5d8-af0f014ebcc7))

Notice that we had to specify the aesthetic mappings in each geom. This seems redundant.

If you specify an aesthetic mapping in a geom, then the aesthetic mapping is *local* and only applies to that particular geom. If you want to specify a *global* aesthetic mapping that applies across all geoms, then you can specify the `mapping` in the `ggplot` function.

Here is a simpler version of the previous plot.

```{r}
ggplot(data  = penguins, mapping = aes(x = body_mass_g, y = bill_length_mm, color = species)) + geom_point(aes(shape = species)) + geom_smooth(method = "lm")
```

## Customizing plots (Video: [YouTube](https://youtu.be/Pe5r881J4p8), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0d6a421f-6f27-43bd-bdca-af0f014ebe3d))

Unsurprisingly, the graphics produced by **ggplot2** can be greatly customized. We consider some of the basic customizations below.

### Assigning a plot

One of the strengths of **ggplot2** is that you can easily build a plot layer by layer. In this case, it is useful to assign the current plot a name that can then be added to later. 

Note: When you assign a `ggplot` a name, it will NOT be displayed, as is typical when assigning objects in R. You have to then print the object using the `print` function or by simply running the object name in the Console. Let's assign the name `ggp` to the basic penguin scatter plot displaying bill length versus body mass, using colors and shapes to distinguish between different species.

```{r}
ggp <- ggplot(data  = penguins, mapping = aes(x = body_mass_g, y = bill_length_mm, color = species)) + geom_point(aes(shape = species))
ggp
```

We'll customize this plot below.

### Axis labels and titles
The x-axis labels, y-axis labels, and title of a plot are controlled using the `xlab`, `ylab`, and `ggtitle` functions, respectively. Let's customize the penguin scatter plot previously discussed. We'll update the `ggp` object to include our improvements then print the result.

```{r}
ggp <- ggp + xlab("body mass (g)") + ylab("bill length (mm)") + ggtitle("Penguin body characteristics")
ggp
```

### Axis limits
The x-axis and y-axis limits are controlled using the `xlim` and `ylim` functions. Each function takes the lower and upper limit you want to set, e.g., `xlim(0, 100)`. Let's change the axes for the penguin scatter plot previously discussed. There's nothing wrong with the previous limits, but we'll extend them a bit for demonstration.

```{r}
ggp + xlim(2000, 7000) + ylim(0, 80)
```

### Manual aesthetics
Aesthetics can be set manually in a geom by specifying the aesthetic directly in the geom but outside the `aes` function. Let's change the shape of the points for a scatter plot, increase the size of the points, and manually change the color of the points.

```{r}
ggplot(data  = penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm), shape = 15, size = 5, color = "blue")
```

### Scale customization

*Scaling* is the name of the process used by **ggplot2** to map a variable to a unique value. This is particularly important for `factor` variables. For example, color scaling is the process by which the the `color` aesthetic is mapped to particular colors. These can be customized using the various `scale_*` functions, where the "\*" is a placeholder for the rest of the function name.

Let's manually set the color scaling using the `scale_color_manual` function. The colors provided below come from the *Dark2* color palette of the *qualitative* type provided by the ColorBrewer website ([https://colorbrewer2.org/](https://colorbrewer2.org/)), which provides color advice. It's particularly great for identifying colorblind-friendly color palettes.

```{r}
ggp + scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3"))
```

Alternatively, we could have used `scale_color_brewer` function instead, which automatically includes the color palettes desigend by Color Brewer. Run `?scale_color_brewer` for more details.

```{r}
ggp + scale_color_brewer(type = "seq", palette = "Dark2")
```

Scalings can be customized in many ways, but it quickly becomes complicated, so we don't discuss them further.

### Legend customization

The legend can be customized in many ways. To move the location of the legend, you can specify the `legend.position` argument of the `theme` function.

```{r}
ggp + theme(legend.position = "bottom")
```

You can customize legend size, background, names, colors, etc. For specific implementation details, it is probably best to do a web search.

### Themes

A theme is a customized style that changes the overall appearance of your graphic. **ggplot2** provides a number of built-in themes. Some of the built-in themes include:

* `theme_gray`: the default theme. Gray background and white grid lines.
* `theme_bw`: white background and gray grid lines.
* `theme_classic`: white background and no grid lines.
* `theme_minimal`: white background, no grid lines, no axis lines.
* `theme_light`: white background, gray grid lines, and a box around the plotting region.
* `theme_dark`: essentially a "dark" opposite to `theme_light`.

We will use the **patchwork** package [@R-patchwork] to easily combine several plots into a single graphic. The **patchwork** package uses `()` to group a set of plots, `|` to separate plots horizontally, and `/` to stack sets of plots vertically.

```{r}
ggp_gray <- ggp + theme_gray() + ggtitle("theme_gray")
ggp_bw <- ggp + theme_bw() + ggtitle("theme_bw")
ggp_classic <- ggp + theme_classic() + ggtitle("theme_classic")
ggp_minimal <- ggp + theme_minimal() + ggtitle("theme_minimal")
ggp_light <- ggp + theme_light() + ggtitle("theme_light")
ggp_dark <- ggp + theme_dark() + ggtitle("theme_dark")
library(patchwork)
(ggp_gray | ggp_bw) / (ggp_classic | ggp_minimal) / (ggp_light | ggp_dark)

```

These themes can be customized further by changing the appropriate aspects of the plots. We leave the user to use web searches for the desired customizations.

## Other components of a `ggplot` 

### Facetting (Video: [YouTube](https://youtu.be/6NGrEJ7phTA), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=8b06b243-0bde-400d-b8d2-af0f0151c6a9))

*Facetting* a data set creates separate plotting panels for each level of one or more discrete variables. Facetting is useful for examining patterns for combinations of levels for one or more discrete variables. Technically, facetting can be performed for `numeric` variables with a relatively small number of unique values, but for practical reasons, facetting is most appropriate for `factor` variables. 

There are two functions for creating facetted graphics in **ggplot2**:

* `facet_wrap` facets the plots based on a single `factor` variable. The panels are wrapped around the plotting window.
* `facet_grid` forms a matrix of panels based on row and column facetting variables.

In the plot below, we facet scatter plots of `bill_length_mm` versus `body_mass_g` by `species`.

```{r}
ggplot(penguins) + geom_point(aes(x = body_mass_g, y = bill_length_mm)) + facet_wrap(~species)
```

By default, the same x and y axes are used in all panels. This is standard because it facilitates direct comparisons across panels. This can be customized using the `scales` argument of the `facet_*` functions. The allowable `scales` argument values are:

* `fixed`: same x and y axes for all panels. The default.
* `free`: individual x and y axes for each panel
* `free_x`: individual x axes for each panel, common y axis
* `free_y`: individual y axes for each panel, common x axis

We now consider use the `facet_grid` function to facet by two `factor` variables, `species` and `sex`. Prior to plotting, we remove the observations with `NA` for the `sex` variable.

```{r}
ggplot(subset(penguins, !is.na(sex))) + geom_point(aes(x = body_mass_g, y = bill_length_mm)) + facet_grid(species ~ sex)
```

There appears to be differences in the sizes of the male and female penguins, bit it is difficult to compare the patterns across panels. It may be more useful to facet by `species` but distinguish by `sex` in a single panel.

```{r}
ggplot(subset(penguins, !is.na(sex))) + geom_point(aes(x = body_mass_g, y = bill_length_mm, color = sex, shape = sex)) + facet_wrap(~ species)
```

It is clear now that across species, the males tend to have greater body mass and bill length in comparison to the females.

### Statistical transformations (Video: [YouTube](https://youtu.be/ia7CZEBgoKc), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=d8423414-cb23-4bf4-9875-af0f0151c66f))

*Statistical transformations* are the process that **ggplot2** uses to summarize the data before plotting. By default, every geom uses a default `stat` argument that summarizes the data before plotting. Perhaps surprisingly, every `stat_*` has a default `geom` argument to specify how the summarized data should be plotted.

In general, you can simply learn to use the geom that summarizes and plots the data to your liking. Depending on the data you have, sometimes you may want to specify a different `stat` argument. In general, custom `stat` arguments are beyond the scope of this tutorial.

However, we do consider a simple example involving bar charts.

By default, `geom_bar` uses `stat_count` to count the number of values having each level of a `factor` variable and then plots the height of each bar above the names of the `factor` levels.

```{r}
ggplot(penguins) + geom_bar(aes(x = sex))
```

However, let's suppose our data frame already has counts for the number of values in each level. In the case, we must change the `stat` argument for `geom_bar` to `"identity"` (for `stat_identity`). We do this in the example below, where we also have to clearly indicate the `x` variable and the `y` variable in the plot.

```{r}
# count values with each level
penguins_count <- as.data.frame(table(penguins$sex, useNA = "ifany"))
# rename variales
names(penguins_count) <- c("sex", "count")
# print data frame
penguins_count
# create plot
ggplot(penguins_count) + geom_bar(aes(x = sex, y = count), stat = "identity")
```

### Position 

The `position` argument of a geom describes the positions of the objects produced. Most of the time this is pretty straightforward, but there can be times when it is helpful (or unhelpful) to customize `position`.

#### Jittering data (Video: [YouTube](https://youtu.be/Rv3oNQQIKIw), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=fb9f1eb4-a86a-46b0-b8c5-af0f0151c63c))
Sometimes, data values/information overlap, which makes interpreting the plot more difficult. In that case, it can be useful to "jitter" the data so that the values/information are not overlapping. The data are not technically as accurate, but the large-scale distribution of the data is easier to interpret.

The `Galton` data set in the **HistData** package [@R-HistData] includes 928 observations of 2 variales. The variables are:

* `parent`: a `numeric` variable indicating the average height of a child's mother and father (in).
* `child`: a `numeric` variable indicating the height of each child (in).

The data are rounded to the nearest 0.1 inch. Additional information can be found by running `?HistData::Galton` in the Console. 

Unfortunately, the rounding the data values leads to overplotting. Below, we load the `Galton` data and create a scatter plot of child height versus parents' average height.

```{r}
data(Galton, package = "HistData")
ggplot(Galton) + geom_point(aes(x = parent, y = child))
```

There is a lot of overplotting in this graphic. We use the `position` argument to `jitter` the data to make the overall distribution clearer. There's not a lot of information gain from jittering the data in this particular example, but we can see that there are more observations in the central part of the plot.

```{r}
ggplot(Galton) + geom_point(aes(x = parent, y = child), position = "jitter")
```

#### Stacking, filling, and dodging (Video: [YouTube](https://youtu.be/q1tTRuReq7w), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=6238c99f-525d-4589-853d-af0f0151c6dc))

I was tempted to title this section, "plots that make me emotional" because several of the graphics below cause me to go through a range of mostly unpleasant emotions when I view them.

In this section, we'll see how `stack`, `fill`, and `dodge` can be used in the `position` argument. While these examples demonstrate their use, some of the results are an absolute mess. I will conclude with alternative recommendations.

When creating a bar chart, a `factor` variable is mapped to the `x` aesthetic. If a different `factor` variable is mapped to the `fill` aesthetic, then **ggplot2** will `stack` the bars for the `fill` variable inside the bars for the `x` variable. Consider the bar chart below in which we "stack" penguin `sex` within the bars for the different species.

```{r}
ggplot(penguins) +  geom_bar(aes(x = species, fill = sex))
```

If you compare this bar chart with the original one that doesn't fill by `sex`, the bar height is identical. However, within each species, we `fill` the bars to show the number of penguins with each level of `sex`.

This plot is difficult to interpret. It takes a lot of mental energy to compute approximately how many male, female, and `NA` observations there are within each species. While the updated bar chart compactly provides a lot of information, it doesn't facilitate easy interpretation.

If we wanted to compare the proportion of male, female, and `NA` penguins in each species, we can change the `position` argument to `fill`. In that case, the bars within each level of `species` will be scaled to one, directly facilitating comparisons between the colors within each bar.

```{r}
ggplot(penguins) +  geom_bar(aes(x = species, fill = sex), position = "fill")
```

I still consider this a poor display of the data, but we can see that the proportion of males is close to 50% for all species.

Lastly, if we don't want to `stack` or `fill` the bars, we can have them `dodge` each other. In that case, the overlapping bars will "dodge" each other and sit next to each other in the plot.

```{r}
ggplot(penguins) +  geom_bar(aes(x = species, fill = sex), position = "dodge")
```

If you want to compare the number of penguins with each `sex` within each `species`, this graphic isn't too bad. However, the bars representing male penguins (or alternatively, female penguins) are spread out and more difficult to compare directly because of the other bars between them.

While the above charts provide a fair amount of information compactly, to facilitate ease of interpretation we should construct our graphics to highlight the characteristic of importance.

e.g., If we want to compare the counts of each penguin `species` within each `sex`, it would be better to facet the data by `sex` and display the count of each species.

```{r}
ggplot(penguins) +  geom_bar(aes(x = species, fill = species)) + facet_wrap(~ sex)
```

Alternatively, if we want to highlight the composition of `sex` within each `species`, then I recommend creating a bar chart of `sex` while facetting by `species`. This plot is quite similar to the "dodge" plot above, but provides further distinction between the charts for each species.

```{r}
ggplot(penguins) +  geom_bar(aes(x = sex, fill = sex)) + facet_wrap(~ species)
```

The examples above are shown for factor variables with only three levels of each factor, and you may think I'm overreacting (and perhaps I am). However, the issues mentioned above are magnified when our `factor` variables have many levels. To illustrate this point, I present "stack" and "dodge" bar charts of the `diamonds` data in the **ggplot2** package, in which we compare the `cut` and `clarity` of diamonds. Each variable has at least 5 levels. The large number of levels for the `cut` and `clarity` variables make it difficult to quickly interpret the relationship between the two variables. The **patchwork** package is used to combine the two charts into a single graphic. (We also rotate the x-axis labels 90 degrees so they don't overlap each other). How easy is it for you to interpret the information below?

```{r}
ggp_stack <- ggplot(diamonds) + geom_bar(aes(x = cut, fill = clarity)) + ggtitle("stack") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggp_dodge <- ggplot(diamonds) + geom_bar(aes(x = cut, fill = clarity), position = "dodge") + ggtitle("dodge") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggp_stack + ggp_dodge
```

### Coordinate Systems (Video: [YouTube](https://youtu.be/naQPwZxkb2M), [Panopto](https://ucdenver.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=6e799963-500c-4fe6-9fa1-af0f0151d1ee))

A coordinate system is used to define the position of coordinates relative to one another. By default, **ggplot2** uses the Cartesian coordinate system ([https://en.wikipedia.org/wiki/Cartesian_coordinate_system](https://en.wikipedia.org/wiki/Cartesian_coordinate_system)), which treats coordinates as points on a flat plane, which is how humans tend to think about coordinates in relationship to one another.

Other coordinates systems are commonly used for maps because points are actually on an ellipsoidal object (the earth!), which complicates their relationship.

We do not discuss coordinate systems in detail here. The only coordinate system I want to mentioned is the "flipped" coordinate system (`coord_flip`), which flips the x and y axes. This is useful for rotating objects in a plot. Consider the boxplots below.


```{r}
ggplot(penguins) + geom_boxplot(aes(x = species, y = body_mass_g, fill = species))
ggplot(penguins) + geom_boxplot(aes(x = species, y = body_mass_g, fill = species)) + coord_flip()
```

## References
  
  
  