--- title: "Lesson 1: Visualizing data with *ggplot2*" date: 2023-10-19 format: html: standalone: true embed-resources: true number-sections: true toc: true editor: visual --- # Prepare the R environment for this lesson When we start R, we only have access to the **base R** packages. In order to use any additional packages, we need to load them into memory with the `library()`{.r} function. For this lesson we'll need two packages: *ggplot2* and *palmerpenguins*. Run the following code chunk to load these two packages. Note, the second library command is empty. Modify the command so it loads the palmerpenguins package. ```{r} library(ggplot2) library(palmerpenguins) ``` It's good practice to start R code, stored in R scripts or quarto/Rmarkdown documents, by loading the packages we'll be using in the body of the code. It gives readers, including future versions of ourselves, a consistent place to check all of the packages required to run our code. # A note about functions Functions are how we do everything in R. This includes simple procedures, like calculating the sum of three numbers, and more complex operations, like performing a linear regression and plotting the resulting regression line against the input data. Above, we saw an example of the general convention for writing functions in R: `function_name(arguments)`{.r}. Generally speaking, a function's **arguments** provide it with input and can control how it runs. # R Documentation When we download and install a package from CRAN, we're also downloading documentation pages that describe how to run each function in the package. In RStudio, we can access the documentation through the *Help* tab, located in the lower right pane of the RStudio window. Let's use the `sum()`{.r} function as an example. To access the documentation pages for the `sum()`{.r} function: 1. Click the *Help* tab in the lower right pane of the RStudio window. 2. Click the search box in the upper right corner of the *Help* tab. 3. Type "sum" in the search bar. Note, we don't include the parentheses after the function name when we're searching for documentation. 4. Press "Enter" The documentation page for the `sum()`{.r} function should appear in the *Help* pane. It contains several sections describing what the function does, the names of its arguments and how they work, and the output of the function. The very end of the documentation page also contains example code that we can run to get a better feel for how the function works (these are very useful). We can also access these same help pages directly through the R console by using the `help()`{.r} function, or the "?" operator. ```{r} #| eval: false help("sum") ``` ```{r} #| eval: false ?sum ``` When learning about a new function, it's often most helpful to pay special attention to the usage, arguments, and examples sections of the function's documentation page. ::: callout-tip ## Keep Using the *Help* Tab As we proceed through these materials, don't hesitate to use the help feature whenever we encounter a new function. ::: # Plotting penguins ## Look at the Palmer Penguins data table After loading the *palmerpenguins* package, we now have access to table of penguin data. We can view the first 10 lines of this table directly in the R console: ```{r paged.print=FALSE} penguins ``` Each row in this table contains data from a single penguin measurement during the study period. Looking at the columns, we see all of the different information the researchers recorded about each penguin. We can also see that there are observations with missing information, represented by `NA`{.r} values. In R terminology, this is called a *data frame*. It is a natural way of representing rectangular, spreadsheet-style data. The penguin data are technically stored as a *tibble*, which is a special type of data frame used by the tidyverse. The *palmerpenguins* package also contains documentation for the penguin data frame. Look up "penguins" using the *Help* pane, or one of the other help functions. The documentation describes the contents of each column in the data frame. In addition to viewing the data frame in the R console, we can use the `View()`{.r} function to quickly examine its contents in a scrollable, spreadsheet-style window. ```{r} #| eval: false View(penguins) ``` Inside the view window, we can search the data frame for specific values, or re-order it by values in each of the columns. While the `View()`{.r} function is useful for smaller datasets, it is not generally suitable for data frames with more than a few thousand rows. ::: {.callout-note icon="false"} ### The "V" in the `View()`{.r} function is capitalized If we try to use the `View()`{.r} function with a lower case "v", we'll generally get an error: ```{r} #| error: true view(penguins) ``` ::: Lastly, we can use the `summary()`{.r} function to get summary statistics about the data in each column of our data frame. ```{r} summary(penguins) ``` These functions help us building intuition about new datasets and spot potential problems for downstream analyses. ## Visualize the penguin data with *ggplot2* {#visualize-w-ggplot2} The `penguins` data frame contains measurements of each penguin's body mass and flipper length. Here we will use the *ggplot2* package to visualize these data and examine the relationship between these two anatomical measurements. Below we'll review the code we need to create this graph: ![](../IMAGES/Lesson-01_Flipper-length-v-body-mass_Scatterplot-w-trend.png){fig-alt="Scatterplot comparing flipper length and body mass across three penguin species" fig-align="center"} ### How to create a basic scatterplot We start with the `ggplot()`{.r} function, which creates our ggplot and specifies the data we want to use. Think of this like setting up a blank canvas before we start painting. Note, we're giving the `ggplot()`{.r} function access to the penguin data frame with the `data` argument. ```{r} ggplot(data = penguins) ``` Next, we use the `mapping` argument to tell the `ggplot()`{.r} function *how* we want it to use the penguin data. In *ggplot2*, we always use the `aes()`{.r} function to define how to map the variables (columns) in our data to the visual properties (aesthetics) of the shapes we want to paint. In this example we want to plot body mass with the x-axis and flipper length with the y-axis. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) ``` Note how mapping the x/y aesthetics has affected the graph. The axes are now labeled and the `ggplot()`{.r} function has automatically set their ranges based on the range of body masses and flipper lengths in the penguin data frame. Now that we've prepared our canvas, we can start adding layers of paint. We paint shapes in *ggplot2* using **geom** functions. Since we want to create a scatterplot, we'll use the `geom_point()`{.r} function to add a layer of points to our plot. There is a whole family of different `geom_` functions that we can use to plot different types of graphs (e.g. scatter plots, line graphs, bar graphs). ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point() ``` Note that we combine the `geom_point()`{.r} function with the `ggplot()`{.r} function using the `+` sign. We can treat this like we're "adding" a layer of points on top of the canvas we created with `ggplot()`{.r}. Also, take note of the warning we get from the *ggplot2* code. Recall that we saw some columns with `NA`{.r} values when we were looking at the contents of the penguins data frame. The warning is telling us that two of the entries in the penguins data frame have `NA`{.r} values in either the `body_mass_g` or `flipper_length_mm` columns. The `geom_point()`{.r} function has no way of placing a point with a missing x- or y-coordinate, so *ggplot2* automatically excludes rows with missing values before plotting. This filter only applies to columns (variables) in our data frame that we're mapping to aesthetics. *ggplot2* will still use rows in the penguins data frame that have `NA`{.r} values in the `bill_length_mm` and `sex` columns, since we're not currently using those to plot anything. So the general procedure we follow for plotting data with *ggplot2* is to prepare our canvas with the `ggplot()`{.r} function, use the `aes()`{.r} function to tell *ggplot2* how we want to use our data to paint the canvas, and apply paint to our canvas with a `geom_`{.r} function (geom_point in the example above). ### Decorating our plots with more data Now that we have a basic scatterplot in hand, we can try to gain deeper insights into our data by incorporating more information into our plot. Our current scatterplot shows a positive relationship between flipper length and body mass. We know the penguin data frame contains measurements from three different penguin species. Let's map species to the color aesthetic to see how this trend looks across all three species. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species)) + geom_point() ``` Not only did *ggplot2* automatically assign a different color to each penguin species, it also added a legend to the graph. That means this same block of code can work for a penguin data frame that contains data from one, two, or ten species of penguin. To get a clearer picture of the trend between flipper length and body mass, we're no going to add a regression line to the plot. We'll do this with the `geom_smooth()`{.r} function. We'll use the `method = "lm"`{.r} argument to specify we want to plot the line we get from fitting our data with a **l**inear **m**odel. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species)) + geom_point() + geom_smooth(method = "lm") ``` It looks like `geom_smooth()`{.r} fit separate linear models for each penguin species and plotted the lines using the same colors as the points. While this is a useful feature (we'll make use of this below), we want to fit all of the data with a single linear model. When we define aesthetic mappings in the `ggplot()`{.r} function, they apply to *all* `geom_`{.r} functions in that plot. By mapping species to color, we told `geom_smooth()` we want it to color the smoothed line by species, which means it needs to fit a separate line for each species. To solve this problem we can specify an aesthetic mapping within a specific `geom_`{.r} function. Let's try moving the species to color mapping to `geom_point()`{.r} and `geom_smooth()`{.r}. Which option still colors the points by species, but fits all of the data with a single linear model? ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point() + geom_smooth(method = "lm") ``` Now that we've sorted out the trend line, let's return to how we're plotting the points. The colors improve the readability of this graph, but could pose a problem if this figure is rendered in black and white, or if a reader has certain types of colorblindness. One solution is to map the points from different penguin species to different shapes. Try adding an aesthetic mapping of `shape` to species. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(mapping = aes(color = species)) + geom_smooth(method = "lm") ``` Note that the legend automatically updates to reflect the new shape mapping. The are many more aesthetics beyond shape and color. The documentation for `geom_`{.r} functions contains a special section listing the aesthetic mappings supported by that geom. Look up the help page for "geom_point" and find the list of supported aesthetics. We're almost done recreating the figure we saw at the [beginning of this section](#visualize-w-ggplot2). The axis labels our version of the figure are not as clean, and we're still missing the plot title and subtitle. Collectively, these attributes are known as the graph's "labels", and we can modify them using the `labs()`{.r} function. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(mapping = aes(color = species, shape = species)) + geom_smooth(method = "lm") + labs(title = "Relationship between flipper length and body mass", subtitle = "Across three species of penguins studied at the Palmer Research Station", x = "Body mass (g)", y = "Flipper length (mm)") ``` Lastly, we'll change the color scheme to a different palette from the default. The *ggplot2* package includes a few additional color palettes we can apply to our plot using `scale_color_`{.r} functions. For now, we'll select a palette from the ColorBrewer collection that is good for representing qualitative data and is more colorblind safe than the default palette. ```{r} ggplot(data = penguins, mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(mapping = aes(color = species, shape = species)) + geom_smooth(method = "lm") + labs(title = "Relationship between flipper length and body mass", subtitle = "Across three species of penguins studied at the Palmer Research Station", x = "Body mass (g)", y = "Flipper length (mm)") + scale_color_brewer(palette = "Dark2") ``` For now, we're not going to worry about the specifics of the ColorBrewer collection of palettes (though feel free to examine the documentation for the `scale_color_brewer()`{.r} function). But we can see that we can further adjust how *ggplot2* displays our data using the `scale_`{.r} family of functions. With that last tweak, we've successfully recreated the figure we saw [above](#visualize-w-ggplot2). ### Creating our own scatterplot ![Artwork by @allison_horst](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/man/figures/culmen_depth.png){fig-alt="Diagram of penguin bill indicating which dimensions correspond to depth and length" fig-align="center" width="4in"} Now we're going to put everything we just learned into practice. Using the `penguins` data and the *ggplot2* functions, we're going to create a scatterplot of bill depth vs bill length. We can view the contents of `penguins` data frame directly, or consult the documentation to find the names of the columns that contain this information. Again, we'll add a regression line that we fit using all of the data. ```{r} ``` Looking at the graph we created, what does it suggest about a relationship between bill depth and bill length? How might species affect this? How does the graph change if we fit separate regression lines within each species? ```{r} ``` The difference between these two graphs is an example of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox). Briefly, this is a phenomenon where a trend between two variables at the population level disappears, emerges, or reverses when we divide the population into groups. We were able to observe these changes by experimenting with the way we visualized the penguin data. We’re only just scratching the surface of what we can do with *ggplot2*, but we’ve used it to rapidly visualize the penguin data and generate some interesting observations. In the next lesson, we will explore some of these observations further by working directly with the data to calculate summary statistics. # R session information Just as it's good practice for us to list all of the packages we load at the top of our code, it's equally important to report the version number for R and the package versions we used at the end of our code. This will make it much easier to reproduce our results in the future. By running the `sessionInfo()`{.r} function at the end of our code, we capture any changes to the environment that happened in the body of our code (e.g some function silently load additional packages when you run them). ```{r} sessionInfo() ```