---
title: "Lesson 3: Exploring geom_ functions and customizing ggplots"
date: 2023-11-16
format:
    html:
        standalone: true
        embed-resources: true
        number-sections: true
        toc: true
editor: visual
---

# Prepare the R environment for this lesson

For this lesson we'll need four packages:

-   *medicaldata* (new package)
-   *palmerpenguins*
-   *ggplot2*
-   *dplyr*

First, we need to install the *medicaldata* package, using the
`install.packages()`{.r} function.

```{r}
#| eval: false

```

Now we use the `library()`{.r} function to load all of these packages.

```{r}

```


# The *medicaldata* Package and the *covid_testing* Dataset

In this lesson we'll be using a dataset from the *medicaldata* package. The
medicaldata package contains several real medical datasets, intended for use in
teaching R. We can browse the list of available datasets on the [medicaldata
package website](https://higgi13425.github.io/medicaldata/). We'll be using the
`covid_testing` dataset for this lesson. To get a feel for this dataset, take a
look at the "covid_testing" documentation pages on the medicaldata package
website.

We can also access a quick description of the covid_testing data frame through
R's help docs.

```{r}
#| eval: false
help("covid_testing")
```


# Data Visualization with the *ggplot2* Package

*ggplot2* is one of the core tidyverse packages. It is a design system allowing
us to assemble complex figures from simple components that we "layer" on top of
each other. The documentation for this package is excellent. We can access it
either through the RStudio interface, or through the [ggplot2 website](https://ggplot2.tidyverse.org/)

## Refresher

In the first lesson, we used ggplot2 to generate a scatterplot from the
penguin dataset. With this code block, we tell ggplot2 to use values from the
'bill_length_mm' column as x-axis coordinates, and values from the
'flipper_length_mm' column as y-axis coordinates. Remember, the `aes()`{.r} or
"aesthetic" function tells ggplot2 how to map the columns in our data to
visual components of the graph.

```{r}
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm,
                     y = flipper_length_mm)) +
    geom_point()
```

We can streamline this code by using the R pipe (`|>`{.r}) and by specifying
`data` and `mapping` as *positional arguments*. *Refer to the `ggplot()`{.r}
documentation for the order of its arguments.*

```{r}

```

Note, when we're constructing figures with ggplot2, we connect the ggplot2
functions using the '+' sign, instead of the R pipe. This is because ggplot2 was
written before there were any pipes in R. If we accidentally use an R pipe
instead of a '+' sign, we'll get an error

```{r}
#| error: true
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm)) |> 
    geom_point()
```

## Aesthetic Mapping

Let's modify the code to map the 'species' column to the 'color' aesthetic.

```{r}
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm)) +
    geom_point()
```

What happens if we map the 'body_mass_g' column to color instead?

```{r}
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm)) +
    geom_point()
```

Notice that we see a color gradient. When we used species to color the points,
we ended up with three discrete colors. This is because the values in the
body_mass_g and species columns have different **data types**, which affect how
R interacts with them.

::: callout-note
### Data Types

A value's data type determines how it's stored in memory and what type of
operations we can perform with it. R supports several basic variable types:

1.  **double** - A number or *numeric* type that has decimals. This is the
    default type for any number in R. We can use doubles with all arithmetic
    operators.
2.  **integer** - A number without any decimal places. If we want an integer, we
    need to create one explicitly using a function like `as.integer()`{.r}.
    Integers also work with all arithmetic operators.
3.  **character** - Text data, enclose by quotation marks, that can contain
    letters, numbers, and special characters. These are called "strings" in
    other programming languages. Characters are the most flexible type in terms
    of what they can store, but the don't work with arithmetic operators.
4.  **factor** - R's representation of categorical data. These are similar to
    characters in that they can contain letters, numbers, and special
    characters. This also means it's generally easy to make factors out of
    character data. When we create a factor variable or column , we define all
    possible categories, or "levels", that variable/column can assume. There's
    additional complexity to factors that we'll get into later.
5.  **logical** - A binary value that can only be `TRUE` or `FALSE`. We've seen
    logical values when we used the relational operators (e.g. ==, \>, !=). We
    can tell these apart from characters, because `TRUE` and `FALSE` aren't
    surrounded by any quotation marks. Lastly, we can use these with arithmetic
    operators. `TRUE` is treated like 1, and `FALSE` is treated like 0

We can check the type of any object in R using the `class()`{.r} function.
:::

Species is a factor type while body mass is a numeric type. Factors are discrete
variables, so ggplot2 maps them to discrete aesthetics. Numeric types are
continuous variables, so ggplot2 cannot map them to discrete values.

We can use also use expressions as aesthetic mappings.

```{r}
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm,
               color = bill_length_mm > 45)) +
    geom_point()
```


## Mapping vs Setting Aesthetics

Above we used the `aes()`{.r} function to map columns in our penguin data to the
aesthetics in our plots. Let's see what happens when we move the aesthetic
arguments outside of the `aes()`{.r} function.

```{r}
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm)) +
    geom_point(color = "darkorange")
```

This is called **setting** an aesthetic, because we're not mapping any of our
data to the aesthetic. When we set aesthetic arguments, we provide them with
scalar values. The specific values will depend upon the aesthetic (i.e. some are
numbers, some require known values).

Look at the `geom_point()`{.r} documentation, and scroll down to the Aesthetics
section. This lists all the different aesthetics this function supports. Try
adding some of these aesthetics to the above plot, both in and out of the
`aes()`{.r} function. *Hint: we can print a list of R color constants with the
`colors()`{.r} function*.

```{r}
penguins |> 
    ggplot(aes(x = bill_length_mm,
               y = flipper_length_mm)) +
    geom_point()
```

This **vignette** from the ggplot package describes all of the aesthetics and
the various values they can accept. We can find this vignette on the [ggplot2
CRAN page](https://cran.r-project.org/web/packages/ggplot2/), or by using the
`vignette()`{.r} function.
```{r}
#| eval: false
vignette("ggplot2-specs")
```

:::{.callout-note}
### Vignettes
Tutorials demonstrating the functionality of a package are called **vignettes**.
They're another way to familiarize ourselves with a new package. These are
downloaded to your system when you install a package and are accessible through
the package website (on CRAN or Bioconductor). Not every package comes with a
vignette, since they are not a required by CRAN or Bioconductor.
:::


# Data Exploration with `geom_` Functions

So far we've seen two examples of `geom_`{.r} functions: `geom_point()`{.r} and
`geom_smooth()`{.r}. These functions control how our data are represented, or
painted on the ggplot2 canvas. The ggplot2 package includes many different
`geom_`{.r} functions that draw different shapes and extract different types of
summary information from our data. Here are a few of the functions you'll use
most commonly for exploratory data analyses.


## Bar chart

The `geom_bar()`{.r} function generates bar charts, which are useful for
plotting the distributions of categorical data. Let's use this to get a breakdown
of the number of COVID-19 tests by their result (positive, negative, invalid).

```{r}
covid_testing |> 
    ggplot(aes(x = result)) +
    geom_bar()
```

When we use the `geom_bar()`{.r} function, we only provide it with an aesthetic
mapping for one of the axes (x or y). It automatically calculates the values for
the other axis, based on the distribution of our data across the axis we
specified.


## Histogram & Freqpoly

The `geom_histogram()`{.r} function calculates discrete density bins, given just
an x-coordinate mapping. Its useful for visualizing the distributions of
continuous numerical variables. Let's use the `geom_histogram()`{.r} function to
plot the distribution of COVID-19 tests over the first 100 days of the
`covid_testing` dataset.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_histogram()
```

Note, ggplot2 printed a status message with the histogram, pointing us toward
two arguments for the `geom_histogram()`{.r} function: `bins`, and `binwidth`.

The `bins` argument controls the number of bins it groups our data into. The
above plot sets the `bins` argument to a value of 30 (the default). Try
adjusting it below. What do you think will happen to the graph if we increase
the `bins` value? What if we decrease it?

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_histogram(bins = 30)
```

Alternatively, we can use the `binwidth` argument to set the size of the
interval on the x-axis that defines a bin. Whatever value we enter for the
`binwidth` uses the same units as the x-axis (days, in our case). What do you
think will happen to the histogram as we increase the `binwidth`? And if we
decrease it?

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_histogram(binwidth = 10)
```

The `geom_freqpoly()`{.r} function generates the same discrete distribution
graph as `geom_histogram()`{.r}, except it represents the data as lines, instead
of bars.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_freqpoly()
```

You can confirm these two functions work the same way by plotting them together.
Try that here:

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    
    
```

The line representation produced by `geom_freqpoly()`{.r} can make it easier to
plot multiple distributions together.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day, color = result)) +
    geom_freqpoly()
```

For comparison, here's the same data plotted as a histogram:

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day, fill = result)) +
    geom_histogram()
```


## Density

The `geom_density()`{.r} function generates a smoothed density estimate for the
distribution of a continuous variable. This is similar to the histogram and
freqpoly geoms, except those geoms generate binned representations of the
distribution, while `geom_density()`{.r} generates a continuous representation
of the distribution.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_density()
```

We can think of this like a probability distribution estimate for all the
possible values our x-axis variable can assume (the area under the curve should
sum approximately to 1).

:::{.callout-note}
### The `after_stat()`{.r} function

We can use the `after_stat()`{.r} function to access the derived stats that the
`geom_density()`{.r} function is calculating behind the scenes. Here we use
this function to change the y-value to the number of COVID-19 tests, instead of
the density value.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_density(aes(y = after_stat(count)))
```

We can refer to the the "Computed variables" section of a geom function's help
documentation to find the list of summary variables we can access with
`after_stat()`{.r}.
:::

Here we see that the `geom_density()`{.r} curve is effectively a smoothed
version of the `geom_freqpoly()`{.r} line.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_freqpoly(binwidth = 1) +
    geom_density(aes(y = after_stat(count)))
```

Why do you think these curves diverge when we increase the 'binwidth' for
`geom_freqpoly()`{.r}?

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_freqpoly(binwidth = 3) +
    geom_density(aes(y = after_stat(count)))
```


## Boxplot & Violin plot

The `geom_boxplot()`{.r} function offers another way to summarize the
distributions of continuous data. It calculates and plots the median and
quartiles of a continuous variable mapped to the x or y aesthetic.

```{r}
covid_testing |> 
    ggplot(aes(y = age)) +
    geom_boxplot()
```

However, what makes boxplots truly useful is their ability to represent the
relationship numerical and categorical variables. Here we break up the age
distribution we plotted above, by COVID-19 test result.

```{r}
covid_testing |> 
    ggplot(aes(x = result, y = age)) +
    geom_boxplot()
```

Looking at these boxplots, we see a subtle indication that patients with
positive COVID-19 tests tend to skew a little older than those with negative
tests (we'll need to actually apply a statistical test if we want to say
anything more conclusive).

Violin plots are a combination of the `geom_boxplot()`{.r} and
`geom_density()`{.r} functions. The `geom_violing()`{.r} function plots a
sideways, mirrored representation of a numeric variable's distribution.

```{r}
covid_testing |> 
    ggplot(aes(x = result, y = age)) +
    geom_violin()
```

Violin plots can reveal more subtle differences in relationships between
numerical and categorical variables than we can see with a boxplot, like
biomodality in the underlying distributions.

Here we'll graph the `geom_violin()`{.r} and `geom_boxplot()`{.r} plots
together. How are the boxplots and violin plots similar? How are they different?

```{r}
covid_testing |> 
    ggplot(aes(x = result, y = age)) +
    geom_violin() +
    geom_boxplot(alpha = 0.5)
```

Here we used the `alpha` aesthetic to control the transparency of the boxplot
geom. Try changing the `alpha` value to see how it affects the graph.


## Horizontal and Vertical Lines

The `geom_hline()`{.r} and `geom_vline()`{.r} functions plot horizontal and
vertical lines, respectively. They're useful for marking cutoffs and regions
of interest in our plots.

Here we use `geom_hline()`{.r} to add a line to our age distribution boxplots
that marks the division between patients under and over 5 years of age.

```{r}
covid_testing |> 
    ggplot(aes(x = result, y = age)) +
    geom_violin() +
    geom_hline(yintercept = 5)
```

In this case, we *set* the value for the `yintercept` aesthetic, rather than
mapping it to a column in our data.


## Coordinate Transformations

The `coord_cartesian()`{.r} function allows us to manually set the x- and y-axis
ranges of our graphs. This allows us to focus on particular regions of interest.

```{r}
covid_testing |> 
    ggplot(aes(x = pan_day)) +
    geom_histogram(binwidth = 1) +
    coord_cartesian(xlim = c(0, 20))
```

The `coord_flip()`{.r} function allows us to swap the x- and y-axes.

```{r}
covid_testing |> 
    ggplot(aes(x = result, y = age)) +
    geom_boxplot() +
    coord_flip()
```

There are many more `coord_` functions that give us fine control coordinate
systems we use to create our ggplots.


## Setting Color Scales

Thus far we've just used the default color schemes ggplot2 provides. They're
fine for prototyping our analyses, but we'll probably want to change these to
something that looks better if we want to publish these figures. Now that you've
gained some experience with the look and feel of the default settings for
ggplot2 figures, you'll start to notice them when you read papers.

To specify which colors get mapped to our data, we need to prepare a *vector*
containing all of the colors we want to use.

```{r}
colors_for_covid_result <- c("grey","dodgerblue","orangered")
```

::: callout-note
### Data Structures

In our work so far, we've largely worked with tabular data stored in a data
frame. A data frame is an example of a *data structure*. These are constructs in
a programming language that are specially designed to store and organize data.
We'll interact with several types of data structures over the course of these
lessons:

1.  **Vectors**: Store ordered sequences of values, all of which must have the
    same type. We create vectors using the `c()`{.r} function ("combine"). The
    individual values in a vector are called "vector elements".
2.  **Data frame**: Rectangular data structure where columns are vectors (i.e.
    all values in a column need to have the same type) that all have the same
    length.
3.  **Matrix**: Rectangular like a data frame, except every single values in a
    matrix must have the same type. We often store metadata (e.g. gene IDs,
    sample labels) as the row/column names, using the `rownames()`{.r} and
    `colnames()`{.r} functions. Matrices are used for performing matrix math and
    are often much faster for computations than data frames.
:::

With our color vector in hand, we use the `scale_fill_manual()`{.r} function to
match these colors to the three COVID-19 test results in our data. Try changing
the result mapping from fill to color and see what happens.

```{r}
covid_testing |> 
    filter(age < 100) |> 
    ggplot(aes(x = result, y = age, fill = result)) +
    geom_boxplot() +
    scale_fill_manual(values = colors_for_covid_result)
```

The colors are assigned to the results in alphabetical order.

Instead of manually specifying the colors, we can also use existing collections
of colors ("palettes") that come from various packages. On of the palettes
packaged with ggplot2 is Color Brewer. We need to enter the name for one of
Color Brewer's palettes. We can view the options with the
`RColorBrewer::display.brewer.all()`{.r} function.

```{r}
covid_testing |> 
    filter(age < 100) |> 
    ggplot(aes(x = result, y = age, fill = result)) +
    geom_boxplot() +
    scale_fill_brewer(palette = "Dark2")
```

Another way to view the Color Brewer palettes is through the Color Brewer
[website](https://colorbrewer2.org/). There are many options on this site that
can help you find a good palette, including a toggle to limit your options to
colorblind safe options.


## Facets

When we're plotting complex data, we often run into cases where it would be
helpful to break up a figure into smaller sub-figures, each displaying a portion
of the input data. We do this with the `facet_grid()`{.r} and `facet_wrap()`{.r}
functions. In this example, we use facets to compare the distribution of ages
by COVID-19 test results between the tests collected in the drive-thru or in the
clinic.

```{r}
covid_testing |> 
    filter(age < 100) |> 
    ggplot(aes(x = result, y = age, fill = result)) +
    geom_boxplot() +
    scale_fill_manual(values = colors_for_covid_result) +
    facet_wrap(facets = vars(drive_thru_ind))
```

The `vars()`{.r} function is like a version of the `aes()`{.r} function that's
specific to `facet_wrap()`{.r} and `facet_grid()`{.r}. They tell ggplot2 to map
a variable from our dataset to the facets. You can think of the facet functions
like graphical versions of the `group_by()`{.r} function we saw previously. They
group data and aesthetics according to another variable in our data, and
generate separate graph panels for the data in each group.

Facets give use the ability to represent complex datasets with multiple
variables in a more digestible manner. Here, we use `facet_grid()`{.r} to look
for differences in age distributions by COVID-19 test results, across patient
groups, and by whether or not the test samples as collected at a drive-thru.

```{r}
covid_testing |> 
    filter(age < 100) |> 
    ggplot(aes(x = result, y = age, fill = result)) +
    geom_boxplot() +
    scale_fill_manual(values = colors_for_covid_result) +
    facet_grid(rows = vars(demo_group),
               cols = vars(drive_thru_ind))
```


## Saving ggplots

So far we've only created figures inside these markdown (Quarto) documents. We
can render this file into an HTML or PDF output and then extract the images from
the output file, but that's a lot of extra work if we just want one figure. For
ggplot2 figures, we can use the `ggsave()`{.r} function to save figures to files
on disk. The help doc for `ggsave()`{.r} contains information about how we can
control the dimensions, scale, resolution, and format of the saved figure. Here,
we'll save one of our faceted boxplot figures to the `IMAGES/` directory as a
PNG.

```{r}
#| eval: false
covid_testing |> 
    filter(age < 100) |> 
    # The "0" / "1" labels for the drive-thru status are not particularly clear.
    # Here we create a new column with informative labels for drive-thru status.
    # We'll use this new column for faceting, so the facet labels are clearer
    # for potential readers.
    mutate(
        drive_thru_status =
            case_match(drive_thru_ind,
                       0 ~ "In clinic",
                       1 ~ "Drive-thru")
    ) |> 
    ggplot(aes(x = result, y = age, fill = result)) +
    geom_boxplot() +
    scale_fill_manual(values = colors_for_covid_result) +
    facet_grid(rows = vars(demo_group),
               cols = vars(drive_thru_status),
               scales = "free_y")
ggsave(filename = "IMAGES/COVID-test-result-age-distribution_By-patient-group-and-drive-thru-status_Boxplots.png",
       units = "in",
       width = 6,
       height = 6)
```

`ggsave()`{.r} automatically saves the last plot we generated. Alternatively, we
can save our ggplot graph to a variable, and pass that variable to the `plot`
argument of the `ggsave()`{.r} function.


# *ggplot2* cheatsheet

These are great reference materials for each of the main Tidyverse packages,
including ggplot2. For day-to-day programming, it's helpful to print some of
these out and keep them at your workstation. These offer a great way to quickly
look up the info you need to run a function.


# *ggplot2* extensions

There are many packages that extend the functionality of ggplot2, either by
adding additional functionality to ggplot2, or implementing new graphing
functions based on ggplot2's conventions and grammar. You can browse a gallery
of these extensions [here](https://exts.ggplot2.tidyverse.org/).


# R session information

Here we report the version number for R and the package versions we used to
perform the analyses in this document.

```{r}

```