--- title: "STAT 302 Statistical Computing" subtitle: "Lecture 4: Data Manipulation and Visualization" author: "Yikun Zhang (_Winter 2024_)" date: "" output: xaringan::moon_reader: css: ["uw.css", "fonts.css"] lib_dir: libs nature: highlightStyle: tomorrow-night-bright highlightLines: true countIncrementalSlides: false titleSlideClass: ["center","top"] --- ```{r setup, include=FALSE, purl=FALSE} options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(comment = "##") library(kableExtra) ``` # Outline 1. Using Packages in R 2. Data Manipulation via `tidyverse` 3. Basic Graphics in R 4. Data Visualization via `ggplot2` * Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic. --- class: inverse # Part 1: Using Packages in R --- # What is an R package? R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R. -- - There are 19,961+ official R packages available on [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/web/packages/available_packages_by_name.html). Apart from that, some unofficial R packages are also posted on [GitHub](https://github.com/). -- - These packages implement miscellaneous statistical methods using functions in R, which makes our programming and data analysis easier.

--- # How Can We Install R Packages? If a package is officially available on [CRAN](https://cran.r-project.org/web/packages/available_packages_by_name.html), like most packages we will use for this course, we can install it using ```{r eval=FALSE} install.packages("PACKAGE_NAME_IN_QUOTES") ``` Or, we can use the "_Packages_" tab in the lower right panel and click the "_Install_" button to install an official package in RStudio. - After a package is installed, it is saved on our computer until we update R, and we don't need to re-install it. - There is no need to include a call to `install.packages()` in any `.R` or `.Rmd` file! -- Occasionally, we may want to install an R package from a `.tar.gz` file downloaded from CRAN or elsewhere: ```{r eval=FALSE} install.packages("pkgname.tar.gz", repos = NULL, type ="source") ``` --- # How Can We Use R Packages? After a package is installed, we can load it into our current R session using `library()` or `require()` if it is inside our customized function: ```{r eval=FALSE} library(PACKAGE_NAME) # or library("PACKAGE_NAME") ``` - Unlike `install.packages()`, it is not necessary to include the package name in quotes. -- - Loading a package must be done with each new R session, so we should put calls to `library()` in our `.R` and `.Rmd` files whenever we use some R packages in our code. - In `.Rmd` files, we can load all the required packages in the opening chunk and set the parameter `include = FALSE` in that chunk to hide the messages and code. `r '' ````{r, include = FALSE} ``` --- # Install R Packages From Github There is an `install_github()` function to install R packages hosted on GitHub in the `devtools` package, though it requests developer's name. ```{r eval=FALSE} library(devtools) install_github("DeveloperName/PackageName") ``` Here is an example where we don't have to load the `devtools` package: ```{r eval=FALSE} devtools::install_github("zhangyk8/Debias-Infer", subdir = "R_Package") ``` -- The `githubinstall` package provides a function `githubinstall()`, which does not need developer's name. ```{r eval=FALSE} library(githubinstall) githubinstall("PackageName") ``` --- class: inverse # Part 2: Data Manipulation via `tidyverse` --- # What is `tidyverse`? The `tidyverse` is a coherent collection of packages in R for data science (and `tidyverse` itself is also a package that loads all its constituent packages). Packages include: - Data reading and saving: `readr`. - Data manipulation: `dplyr`, `tidyr`. - Iteration: `purrr`. - Visualization: `ggplot2`. We can install all of them using ```{r eval=FALSE} install.packages("tidyverse") ``` Note: We only need to do this once! --- # Why Do We Need `tidyverse`? - These packages have a very consistent API as well as an active developer and user community. - [Ranking CRAN R Packages by Number of Downloads](https://www.datasciencemeta.com/rpackages). -- - Function names and commands follow a focused grammar. - The functions are powerful and fast when working with data frames and lists (matrices, not so much, yet!). - Pipes (`%>%` operator) allows us to fluidly glue functionality together. - At its best, `tidyverse` code can be read like a story using the pipe operator! --- # Load `tidyverse` into R We can load all the `tidyverse` packages into our current R session using the `library()` function. ```{r} library(tidyverse) ``` --- # Conflicts in Using R Packages Recall that R packages encapsulate functions written by different R developers. - Occasionally, some of these functions in different packages may share the same name, which introduces a conflict. -- - Whichever package that we load more recently using `library()` will mask the old function, meaning that R will default to that version. -- - In general, this is fine, especially with `tidyverse`. The conflict message is to make sure that we are aware of conflicts. --- # Data Manipulation in a Tidy Way - The packages `dplyr` and `tidyr` are going to be our main workhorses for data manipulation. - The main data type used by these packages is the data frame (or tibble, but we won't go there). -- Why do we need to learn data manipulation through `tidyverse`? - Learning pipes `%>%` will facilitate our learning of the `dplyr` and `tidyr` verbs (or functions). - The functions in `dplyr` are analogous to SQL counterparts, so learning `dplyr` will get some SQL syntax for free! --- # Learning Pipes `%>%` Piping at its most basic level: - _It uses the `%>%` operator to take the output from a previous function call and "pipe" it through to the next function, in order to form a flow of results._ -- This can really help with the readability of code when we use multiple nested functions! - **Shortcut for typing `%>%`:** use `ctrl + shift + m` in RStudio. Note: In Linux and other related systems, we also have pipes, as in: ```{bash eval=FALSE} ls -l | grep tidy | wc -l ``` --- # The Logics of Pipes with Single Arguments Passing a single argument through pipes, we interpret the following code as $h(g(f(x)))$. ```{r eval=FALSE} x %>% f %>% g %>% h ``` Note: In our mind, when we see the `%>%` operator, we should read this as "and then". -- We can write `exp(1)` with pipes as `1 %>% exp`, and `log(exp(1))` as `1 %>% exp %>% log`. ```{r} 1 %>% exp 1 %>% exp %>% log ``` --- # The Logics of Pipes with Multiple Arguments For multi-arguments functions, we interpret the following code as $f(x,y)$. ```{r eval=FALSE} x %>% f(y) ``` -- We can subset top 1 row of the `mcars` data frame using the following pipes syntax. ```{r} # Syntax in basic R head(mtcars, 1) # Pipes syntax mtcars %>% head(1) ``` --- # The Logics of Pipes with Multiple Arguments The command `x %>% f(y)` can be equivalently written in **dot notation** as: ```{r eval=FALSE} x %>% f(., y) ``` -- What is the advantage of using dots? - Sometimes we may want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in: ```{r eval=FALSE} x %>% f(y, .) ``` which is equivalent to $f(y,x)$. --- # Some Examples with Pipes Let's interpret the following code without executing it first. ```{r eval=FALSE} state_df = data.frame(state.x77) state.region %>% tolower %>% tapply(state_df$Income, ., summary) ``` -- ```{r echo=FALSE} state_df = data.frame(state.x77) state.region %>% tolower %>% tapply(state_df$Income, ., summary) ``` --- # Some Examples with Pipes Let's interpret the following code without executing it first. ```{r eval=FALSE} x = "Data Manipulation with Pipes" x %>% strsplit(split = " ") %>% .[[1]] %>% # indexing nchar %>% max ``` -- ```{r echo=FALSE} x = "Data Manipulation with Pipes" x %>% strsplit(split = " ") %>% .[[1]] %>% # indexing nchar %>% max ``` --- # `dplyr` Functions Some of the most important `dplyr` verbs (functions): - `filter()`: subset rows based on a condition. - `group_by()`: define groups of rows according to a column or specific condition. - `summarize()`: apply computations across groups of rows. - `arrange()`: order rows by value of a column. - `select()`: pick out given columns. - `mutate()`: create new columns. - `mutate_at()`: apply a function to given columns. --- # `filter()` Function The `filter()` function is to subset rows based on a condition. ```{r} # Built-in data frame of cars data, 32 cars x 11 variables mtcars %>% head(2) ``` -- ```{r} mtcars %>% filter((mpg >= 20 & disp >= 200) | (drat <= 3)) ``` --- # `filter()` Function An alternative approach using `subset()` function in base R: ```{r} subset(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3)) ``` --- # `filter()` Function An alternative approaches using the basic R syntax: ```{r} mtcars[(mtcars$mpg >= 20 & mtcars$disp >= 200) | (mtcars$drat <= 3), ] ``` --- # `group_by()` Function - The `group_by()` function is to define groups of rows according to a column or specific condition. ```{r} # Grouped by number of cylinders mtcars %>% group_by(cyl) %>% head(2) ``` Note: The `group_by()` function doesn't actually change anything about the way that the data frame looks. Only difference is that when it prints, we know the groups. --- # `summarize()` Function The `summarize()` function is to apply computations across groups of rows. ```{r} # Ungrouped summarize(mtcars, mpg_avg = mean(mpg), hp_avg = mean(hp)) ``` -- ```{r} # Grouped by number of cylinders summarize(group_by(mtcars, cyl), mpg_avg = mean(mpg), hp_avg = mean(hp)) ``` Can we rewrite the above code using pipes? --- # `summarize()` Function The `summarize()` function is to apply computations across groups of rows. ```{r} mtcars %>% group_by(cyl) %>% summarize(mpg_avg = mean(mpg), hp_avg = mean(hp)) ``` Note: Using the `group_by()` function makes the difference here. --- # `arrange()` Function The `arrange()` function is to order rows by value of a column. ```{r} mtcars %>% arrange(mpg) %>% head(3) ``` -- ```{r} # Base R syntax mpg_inds = order(mtcars$mpg) head(mtcars[mpg_inds, ], 3) ``` --- # `arrange()` Function We can also do it in a descending order. ```{r} mtcars %>% arrange(desc(mpg)) %>% head(3) ``` -- ```{r} # Base R syntax mpg_inds_decr = order(mtcars$mpg, decreasing = TRUE) head(mtcars[mpg_inds_decr, ], 3) ``` --- # `arrange()` Function We can order by multiple columns as well. ```{r} mtcars %>% arrange(desc(gear), desc(hp)) %>% head(7) ``` --- # `select()` Function The `select()` function is to pick out given columns. ```{r} mtcars %>% select(cyl, disp, hp) %>% head(3) ``` -- ```{r} # Base R syntax head(mtcars[, c("cyl", "disp", "hp")], 3) ``` --- # Some Handy `select()` Helpers ```{r} mtcars %>% select(starts_with("d")) %>% head(3) ``` ```{r} # Base R syntax d_colnames = grep(x = colnames(mtcars), pattern = "^d") head(mtcars[, d_colnames], 3) ``` Note: We need to use the [regular expression](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) under the base R syntax. --- # Some Handy `select()` Helpers ```{r} mtcars %>% select(ends_with('t')) %>% head(3) mtcars %>% select(contains('ar')) %>% head(3) ``` More details about these `select()` helper functions can be found in [this web page](https://dplyr.tidyverse.org/reference/select.html#useful-functions). --- # `mutate()` Function The `mutate()` function is to create new columns. ```{r} mtcars = mtcars %>% mutate(hp_wt = hp/wt, mpg_wt = mpg/wt) # Base R mtcars$hp_wt = mtcars$hp/mtcars$wt mtcars$mpg_wt = mtcars$mpg/mtcars$wt ``` -- The newly created variables can be used immediately. ```{r} mtcars = mtcars %>% mutate(hp_wt_again = hp/wt, hp_wt_cyl = hp_wt_again/cyl) # Base R mtcars$hp_wt_again = mtcars$hp/mtcars$wt mtcars$hp_wt_cyl = mtcars$hp_wt_again/mtcars$cyl ``` --- # `mutate_at()` Function The `mutate_at()` function is to apply a function to one or several columns. ```{r} mtcars = mtcars %>% mutate_at(c("hp_wt", "mpg_wt"), log) # Base R mtcars$hp_wt = log(mtcars$hp_wt) mtcars$mpg_wt = log(mtcars$mpg_wt) ``` Note: - Calling `dplyr` functions always outputs a new data frame, and it does not alter the existing data frame. - To keep the changes, we have to reassign the data frame to be the output of the pipe! (See the example above). --- # Linking `dplyr` to SQL Learning `dplyr` also facilitates our understanding of SQL syntax. - For example, `select()` is SELECT, `filter()` is WHERE, `arrange()` is ORDER BY, `group_by()` is GROUP BY, etc. - This will make it easier for tasks that require using both R and SQL to manage data and build statistical models. -- - Another major link to SQL is through merging or joining data frames, via `left_join()` and `inner_join()` functions. - More details can be found in [this web page](https://dplyr.tidyverse.org/reference/mutate-joins.html) and [Chapter 13 of the book "R for Data Science"](https://r4ds.had.co.nz/relational-data.html). --- # `tidyr` Functions Recall the tidy data principle for data (or a data frame/table) that we discussed in [Lecture 2](https://zhangyk8.github.io/teaching/file_stat302/Lectures/Lecture2_Data_Structures.html#79): 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. -- There are two of the most important `tidyr` verbs (functions) that help us achieve the tidy data principle: - `pivot_longer()`: make "wide" data longer. - `pivot_wider()`: make "long" data wider. There are many other verbs, such as `spread()`, `gather()`, `nest()`, `unnest()`, etc. More details can be found in [this web page](https://tidyr.tidyverse.org/reference/index.html). --- # `pivot_longer()` Function ```{r message=FALSE} # devtools::install_github("rstudio/EDAWR") library(EDAWR) # Load some nice data sets EDAWR::cases ``` -- ```{r} EDAWR::cases %>% pivot_longer(names_to = "year", values_to = "n", cols = 2:4) ``` --- # `pivot_longer()` Function Here, we transposed columns 2:4 into a "year" column and put the corresponding count values into a column called "n". - The `pivot_longer()` function did all the heavy lifting of the transposing work, and we just had to specify the output. ```{r} # A different approach that does the same thing EDAWR::cases %>% pivot_longer(names_to = "year", values_to = "n", -country) ``` --- # `pivot_wider()` Function Here, we transposed to a wide format by "size" and tabulated the corresponding "amount" for each "size". - Note that `pivot_wider()` and `pivot_longer()` are inverses. ```{r} EDAWR::pollution EDAWR::pollution %>% pivot_wider(names_from = "size", values_from = "amount") ``` --- class: inverse # Part 3: Basic Graphics in R --- # Overview of Base R Plotting Functions Base R has a set of powerful plotting tools: - `plot()`: generic plotting function. - `points()`: add points to an existing plot. - `lines()`, `abline()`: add lines to an existing plot. - `text()`, `legend()`: add text to an existing plot. - `rect()`, `polygon()`: add shapes to an existing plot. - `hist()`, `image()`: histogram and heatmap. - `heat.colors()`, `topo.colors()`, etc: create a color vector. - `density()`: estimate density, which can be plotted. - `contour()`: draw contours, or add to existing plot. - `curve()`: draw a curve, or add to existing plot. --- # Scatter Plots To make a scatter plot of one variable versus another, we use `plot()`. ```{r fig.width=6.5, fig.align='center', fig.height=5.5} set.seed(123) x = sort(runif(50, min=-2, max=2)) y = x^3 + rnorm(50) plot(x, y) ``` --- # Plot Types The `type` argument controls the plot type. Default is "p" for points; set it to "l" for lines. If we want both points and lines, set it to "b". ```{r fig.width=7, fig.align='center', fig.height=5} plot(x, y, type="b") ``` More details can be found by `?plot`. --- # Plot Labels The `main` argument controls the title; `xlab` and `ylab` are the x and y labels. ```{r fig.width=6.5, fig.align='center', fig.height=5.5} plot(x, y, main="A noisy cubic", xlab="My x variable", ylab="My y variable") ``` --- # Point Types We use the `pch` argument to control point type. ```{r fig.width=7, fig.align='center', fig.height=6} plot(x, y, pch = 19) # Filled circles ``` --- # Line Types We use the `lty` argument to control the line type, and `lwd` to control the line width. ```{r fig.width=6.5, fig.align='center', fig.height=5.5} plot(x, y, type="l", lty=2, lwd=3) # Dashed line, 3 times as thick ``` --- # Colors We use the `col` argument to control the color. It can be: - An integer between 1 and 8 for basic colors. - A string for any of the 657 available named colors. The function `colors()` returns a string vector of the available colors ```{r fig.width=6.5, fig.align='center', fig.height=5.5} plot(x, y, pch=19, col="red") ``` --- # Multiple Plots To set up a plotting grid of arbitrary dimension, we use the `par()` function with the argument `mfrow`. ```{r fig.align='center'} par(mfrow=c(2,2)) # Grid elements are filled by row plot(x, y, main="Red cubic", pch=20, col="red") plot(x, y, main="Blue cubic", pch=20, col="blue") plot(rev(x), y, main="Flipped green", pch=20, col="green") plot(rev(x), y, main="Flipped purple", pch=20, col="purple") ``` --- # Margins of the Plots Default margins in R are large (and ugly); to change them, we use the `par()` function with the argument `mar`. ```{r fig.align='center'} par(mfrow = c(2,2), mar = c(4,4,2,0.5)) plot(x, y, main="Red cubic", pch=20, col="red") plot(x, y, main="Blue cubic", pch=20, col="blue") plot(rev(x), y, main="Flipped green", pch=20, col="green") plot(rev(x), y, main="Flipped purple", pch=20, col="purple") ``` --- # Saving Plots We use the `pdf()` function to save a pdf file of our plot in the current R working directory. ```{r} getwd() # This is where the pdf will be saved pdf(file="noisy_cubics.pdf", height=7, width=7) # Height, width are in inches par(mfrow=c(2,2), mar=c(4,4,2,0.5)) plot(x, y, main="Red cubic", pch=20, col="red") plot(x, y, main="Blue cubic", pch=20, col="blue") plot(rev(x), y, main="Flipped green", pch=20, col="green") plot(rev(x), y, main="Flipped purple", pch=20, col="purple") graphics.off() ``` Also, we use the `jpg()` and `png()` functions to save jpg and png files. --- # Adding to Plots The main tools for this are: - `points()`: add points to an existing plot. - `lines()`, `abline()`: add lines to an existing plot. - `text()`, `legend()`: add text to an existing plot. - `rect()`, `polygon()`: add shapes to an existing plot. Note: We should pay attention to **layers**---they work just like we are painting a picture by hand. --- # Plotting a Histogram Recall that we can plot a histogram of a numeric vector using `hist()`. ```{r fig.width=7, fig.align='center', fig.height=6} king_lines = readLines("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/king.txt") king_words = strsplit(paste(king_lines, collapse=" "), split="[[:space:]]|[[:punct:]]")[[1]] king_words = tolower(king_words[king_words != ""]) king_wlens = nchar(king_words) hist(king_wlens) ``` --- # Adding a Histogram to the Existing Plot To add a histogram to an existing plot (say, another histogram), we use `hist()` with `add=TRUE`. ```{r fig.width=7, fig.align='center', fig.height=6} hist(king_wlens, col="pink", freq=FALSE, breaks=0:20, xlab="Word length", main="King word lengths") hist(king_wlens + 5, col=rgb(0,0.5,0.5,0.5), freq=FALSE, breaks=0:20, add=TRUE) ``` --- # Adding a Density Curve to a Histogram To estimate a density from a numeric vector, we use the `density()` function; see [this note](http://faculty.washington.edu/yenchic/19A_stat535/Lec2_density.pdf) and [this tutorial](https://arxiv.org/pdf/1704.03924.pdf) for more details. ```{r} density_est = density(king_wlens, adjust=1.5) # 1.5 times the default bandwidth class(density_est) names(density_est) ``` --- # Adding a Density Curve to a Histogram The `density()` function returns a list that has components `x` and `y`, so we can call `lines()` directly on the returned object. ```{r fig.width=7, fig.align='center', fig.height=6} hist(king_wlens, col="pink", freq=FALSE, breaks=0:20, xlab="Word length", main="King word lengths") lines(density_est, lwd=3) ``` --- # Plotting a Heatmap To plot a heatmap of a numeric matrix, we use the `image()` function. ```{r} # Here, %o% gives for outer product (mat = 1:5 %o% 6:10) image(mat) # Red means high, white means low ``` --- # Orientation of `image()` The orientation of `image()` is to plot the heatmap according to the following order, in terms of the matrix elements: $$\begin{array}{cccc} (1,\text{ncol}) & (2, \text{ncol}) & \ldots & (\text{nrow},\text{ncol}) \\ \vdots & & & \\ (1,2) & (2,2) & \ldots & (\text{nrow},2) \\ (1,1) & (2,1) & \ldots & (\text{nrow},1) \end{array}$$ This is a *90 degrees counterclockwise* rotation of the "usual" printed order for a matrix: $$\begin{array}{cccc} (1,1) & (1,2) & \ldots & (1,\text{ncol}) \\ (2,1) & (2,2) & \ldots & (2,\text{ncol}) \\ \vdots & & & \\ (\text{nrow},1) & (\text{nrow},2) & \ldots & (\text{nrow},\text{ncol}) \end{array}$$ --- # Orientation of `image()` Therefore, if we want the displayed heatmap to follow the usual order, we must rotate the matrix** $90^{\circ}$ clockwise **before passing it in to `image()` (Equivalently, reverse the row order and take the transpose). ```{r} clockwise90 = function(a) { t(a[nrow(a):1,]) } # Handy rotate function image(clockwise90(mat)) ``` --- # Color Scale The default is to use a red-to-white color scale in `image()`, but the `col` argument can take any vector of colors. Built-in functions `gray.colors()`, `rainbow()`, `heat.colors()`, `topo.colors()`, `terrain.colors()`, `cm.colors()` all return contiguous color vectors of given lengths. ```{r} phi = dnorm(seq(-2,2,length=50)) normal.mat = phi %o% phi image(normal.mat, col=terrain.colors(20)) # Terrain colors ``` --- # Drawing Contour Lines To draw contour lines from a numeric matrix, we use the `contour()` function; to add contours to an existing plot (says, a heatmap), we use `contour()` with `add=TRUE`. ```{r echo=TRUE} image(normal.mat, col=terrain.colors(20)) contour(normal.mat, add=TRUE) ``` --- class: inverse # Part 4: Data Visualization via `ggplot2` --- # What is `ggplot2`? `ggplot2` is a R package for "declaratively" creating graphics. - We provide the data and tell `ggplot2` how to map variables to aesthetics and what graphical primitives to use. Then, it takes care of the details. - Plots in `ggplot2` are built sequentially using layers. - When using `ggplot2`, it is essential that our data are tidy! Let's work through how to build a plot layer by layer. --- # Step-by-step Practice with `ggplot2` First, let's initialize a plot. We use the `data` parameter to tell `ggplot` what data frame to use. * It should be tidy data, in either a `data.frame` or `tibble`! .pull-left[ ```{r, eval = FALSE} library(gapminder) ggplot(data = gapminder) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} library(gapminder) ggplot(data = gapminder) ``` ] --- # Step-by-step Practice with `ggplot2` Add an aesthetic using `aes()` within the initial `ggplot()` call. * It controls our axes variables as well as graphical parameters such as color, size, shape. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) #<< ``` ] --- # Step-by-step Practice with `ggplot2` Now `ggplot` knows what to plot, but it doesn't know how to plot it yet. Let's add some points with `geom_point()`. * This is a new layer! We always add layers using the `+` operator. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() #<< ``` ] --- # Step-by-step Practice with `ggplot2` Let's make our points smaller and red. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 0.75) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 0.75) #<< ``` ] --- # Step-by-step Practice with `ggplot2` Let's try switching them to lines. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_line(color = "red", linewidth = 0.75) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_line(color = "red", linewidth = 0.75) #<< ``` ] --- # Step-by-step Practice with `ggplot2` We want lines connected by country, not just in the order that they appear in the data. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) + #<< geom_line(color = "red", linewidth = 0.5) ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) + #<< geom_line(color = "red", linewidth = 0.5) ``` ] --- # Step-by-step Practice with `ggplot2` We can color by continent to explore differences across continents. * We use `aes()` because we want to color by something in our data. * Putting a color within `aes()` will automatically add a label. * We have to remove the color within `geom_line()`, or it will override the `aes()`. .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + #<< geom_line(linewidth = 0.5) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + #<< geom_line(linewidth = 0.5) #<< ``` ] --- # Step-by-step Practice with `ggplot2` Let's add another layer for the trend lines by continent! * We use a new `aes()` to group them differently than our lines (by continent). * We will make them stick out by having them thicker and darker. * We don't want error bars, so we will remove `se`. .pull-left[ ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, #<< color = "black", #<< method = "loess") #<< ``` ] --- # Step-by-step Practice with `ggplot2` The plot is cluttered and hard to read. Let's try separating by continents using **facets**! * We use `facet_wrap`, which takes in a **formula** object and uses a tilde `~` with the variable name. .pull-left[ ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) #<< ``` ] --- # Step-by-step Practice with `ggplot2` Now, we formalize the labels on our plot using `labs()`. * We can also edit labels one at a time using `xlab()`, `ylab()`, `ggmain()`, etc. * Unfortunately, we should do this in every graph that we present! It is unlikely that the text styling of our data frame matches our output. Changing the labels improves human readability! ```{r, eval=FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", #<< x = "Year", y = "Life Expectancy", legend = "Continent") #<< ``` --- # Step-by-step Practice with `ggplot2` ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", #<< x = "Year", y = "Life Expectancy", legend = "Continent") #<< ``` --- # Step-by-step Practice with `ggplot2` Let's center our title by adjusting `theme()`. * `element_text()` tells `ggplot()` how to display the text. * `hjust` is our horizontal alignment, we set it to one half ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent") + theme(plot.title = element_text(hjust = 0.5, face = "bold", #<< size = 14)) #<< ``` --- # Step-by-step Practice with `ggplot2` Indeed, the legend is redundant. Let's remove it. .middler[ ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") #<< ``` ] --- # Step-by-step Practice with `ggplot2` If we don't like the default gray background, then we always remove it by `theme_bw()`. * There are several other theme options! (Use `?theme_bw` to look them up.) ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") + theme_bw() #<< ``` --- # Step-by-step Practice with `ggplot2` ```{r fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") + theme_bw() #<< ``` --- # Step-by-step Practice with `ggplot2` We can increase all of our text proportionally using `base_size` within `theme_bw()` to increase readability. * We could also do this by adjusting `text` within `theme()`. * We don't need to manually adjust our title size. This will scale everything automatically. ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + #<< theme(plot.title = element_text(hjust = 0.5, face = "bold"), #<< legend.position = "none") ``` --- # Step-by-step Practice with `ggplot2` ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + #<< theme(plot.title = element_text(hjust = 0.5, face = "bold"), #<< legend.position = "none") ``` --- # Step-by-step Practice with `ggplot2` Now, our text is in a good size, but it overlaps. We consider rotating our text. ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) #<< ``` --- # Step-by-step Practice with `ggplot2` ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) #<< ``` --- # Step-by-step Practice with `ggplot2` Lastly, let's space out our panels by adjusting `panel.spacing.x`. ```{r, eval = FALSE, message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm")) #<< ``` --- # Step-by-step Practice with `ggplot2` ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm")) #<< ``` --- # Step-by-step Practice with `ggplot2` When the entire plot is ready, we can also store it as an object. ```{r} lifeExp_plot <- ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm")) ``` --- # Step-by-step Practice with `ggplot2` Then, we can plot it by just calling our object. ```{r, fig.height=5, fig.width=7, fig.align='center', message = FALSE} lifeExp_plot ``` --- # Step-by-step Practice with `ggplot2` We can also save it in our `figures` subfolder using `ggsave()`. * Set the `height` and `width` parameters to automatically resize the image. ```{r, eval = FALSE} ggsave(filename = "figures/lifeExp_plot.pdf", plot = lifeExp_plot, height = 5, width = 7) ``` Note: **Never** save figures from our analysis using screenshots or point-and-click! It will lead to lower quality and non-reproducible figures! --- # Some Comments on `ggplot`: * What we just made was a *very* complicated and fine-tuned plot! * It is very common that we have to Google how to adjust certain things all the time. -- * So does the creator of `ggplot2`:

--- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, #<< aes(x = lifeExp)) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, #<< aes(x = lifeExp)) #<< ``` ] --- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram() #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram() #<< ``` ] --- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1) #<< ``` ] --- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", #<< fill = "lightblue") #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", #<< fill = "lightblue") #<< ``` ] --- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20) #<< ``` ] --- # A Simpler Example: Histogram .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20) + labs(x = "Life Expectancy", #<< y = "Count") #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20) + labs(x = "Life Expectancy", #<< y = "Count") #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, #<< aes(x = continent, y = lifeExp)) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, #<< aes(x = continent, y = lifeExp)) #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot() #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot() #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", #<< x = "", #<< y = "") #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", #<< x = "", #<< y = "") #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = #<< element_text(hjust = 0.5)) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = #<< element_text(hjust = 0.5)) #<< ``` ] --- # A Simpler Example: Boxplots .pull-left[ ```{r, eval = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = element_text(hjust = 0.5)) + ylim(c(0, 85)) #<< ``` ] .pull-right[ ```{r, echo = FALSE, fig.height=4.5, fig.width=6, message = FALSE} ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = element_text(hjust = 0.5)) + ylim(c(0, 85)) #<< ``` ] --- # `ggplot2` Summary * Axes: `xlim()`, `ylim()`. * Legends: within initial `aes()`, edit within `theme()` or `guides()`. * `geom_point()`, `geom_line()`, `geom_histogram()`, `geom_bar()`, `geom_boxplot()`, `geom_text()`, etc. * `facet_grid()`, `facet_wrap()` for faceting. * `labs()` for labels. * `theme_bw()` to make things look nicer. * Graphical parameters: `color` for color, `alpha` for opacity, `lwd`/`size` for thickness, `shape` for shape, `fill` for interior color, etc. .pushdown[.center[[Here is a `ggplot2` cheat sheet!](https://rstudio.github.io/cheatsheets/html/data-visualization.html?_gl=1*m028c0*_ga*MTMwMzM1ODYzNC4xNjkwMTU1NDY5*_ga_2C0WZ1JHG0*MTY5NjcyMDAxNi4xNS4wLjE2OTY3MjAwMTYuMC4wLjA.)]] --- # Some Guidelines For Data Visualization .pull-left[## Don'ts * Deceptive axes. * Excessive/bad coloring. * Bad variable/axis names. * Unreadable labels. * Overloaded with information. * Pie charts (usually). ] .pull-right[## Do's * Simple, clean graphics * Neat and human readable text. * Appropriate data range (bar charts should *always* start from 0!). * Consistent intervals. * Roughly ~6 colors or less. * Size figures appropriately. ] --- # Which Plot Should We Use? Consider the following questions when we choose our plot: * What if we have one variable? Two variables? * What if we have numeric data? * How can we deal with those categorical or nominal variables? Let's see some examples! --- # One Numeric Variable: Histogram ### `geom_histogram()` ```{r echo = FALSE, message=FALSE, fig.height = 7, fig.align='center'} ggplot(gapminder, aes(x = lifeExp)) + geom_histogram() + labs(x = "Life Expectancy", y = "Count", title = "Distribution of Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # One Numeric Variable: Boxplot ### `geom_boxplot()` ```{r echo = FALSE, message=FALSE, fig.height = 7, fig.align='center'} ggplot(gapminder, aes(y = lifeExp)) + geom_boxplot() + labs(x = "", y = "Life Expectancy", title = "Distribution of Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_blank(), axis.text.x = element_blank()) + ylim(c(0,90)) ``` Note: We can also use the more sophisticated [letter-valued plot](https://vita.had.co.nz/papers/letter-value-plot.pdf) implemented in the package `lvplot`. --- # One Categorical Variable: Bar Chart ### `geom_bar()` ```{r echo = FALSE, message=FALSE, fig.height = 7, fig.align='center'} ggplot(gapminder, aes(x = continent)) + geom_bar() + labs(x = "Continent", y = "Count", title = "Distribution of Continent") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # One Numeric and One Categorical Variable ### `geom_boxplot()` Here, we have multiple observations for each category. ```{r echo = FALSE, message=FALSE, fig.height = 6, fig.align='center'} ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot() + labs(x = "Continent", y = "Life Expectancy", title = "Life Expectancy by Continent") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # One Numeric and One Categorical Variable ### `geom_bar()` (with argument `stat = "identity"`) Here, we have only one observation per category. ```{r echo = FALSE, message=FALSE, fig.height = 5.5, fig.align='center'} mean_lifeExp <- gapminder %>% group_by(continent) %>% summarise(mean_lifeExp = mean(lifeExp)) ggplot(mean_lifeExp, aes(x = continent, y = mean_lifeExp)) + geom_bar(stat = "identity") + labs(x = "Continent", y = "Mean Life Expectancy", title = "Mean Life Expectancy by Continent") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # Two Numeric Variables: Scatterplot ### `geom_point()` ```{r echo = FALSE, message=FALSE, fig.height = 7, fig.align='center'} ggplot(gapminder, aes(y = lifeExp, x = log(gdpPercap))) + geom_point() + labs(y = "Life expectancy", x = "Log GDP per capita", title = "Life Expectancy by Log GDP per capita") + ylim(c(0,90)) + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # Two Numeric Variables, One Time-based ### `geom_line()` Note: When making a line plot, we should use both `geom_point()` and `geom_line()`! ```{r echo = FALSE, message=FALSE, fig.height = 6, fig.align='center'} time_maxlifeExp <- gapminder %>% group_by(year) %>% summarise(maxlifeExp = max(lifeExp)) ggplot(time_maxlifeExp, aes(x = year, y = maxlifeExp)) + geom_line() + geom_point() + labs(y = "Maximum life expectancy", x = "Year", title = "Maximum Life Expectancy by Year") + ylim(c(0,85)) + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # Two Categorical Variables ### `geom_bar()` setting `x` and `fill` within `aes()` This is an example of bad visualization!! ```{r echo = FALSE, message=FALSE, fig.height = 5.5, fig.align='center'} gapminder <- gapminder %>% mutate(lifeExpCat = ifelse(lifeExp > 60, TRUE, FALSE)) ggplot(gapminder, aes(x = continent, fill = lifeExpCat)) + geom_bar() + xlab("Continent") + ylab("Count") + labs(fill = "Life Exp. > 60") + ggtitle("Distribution of Continent and Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` --- # Two Categorical Variables ### `geom_bar()` setting `x` and `fill` within `aes()` This one looks better by specifying `position = position_dodge()` in `geom_bar()`. ```{r echo = FALSE, message=FALSE, fig.height = 5.5, fig.align='center'} gapminder <- gapminder %>% mutate(lifeExpCat = ifelse(lifeExp > 60, TRUE, FALSE)) ggplot(gapminder, aes(x = continent, fill = lifeExpCat)) + geom_bar(position = position_dodge()) + # Make side-by-side bar plots xlab("Continent") + ylab("Count") + labs(fill = "Life Exp. > 60") + ggtitle("Distribution of Continent and Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5)) ``` Note: Never stack the bars unless it is necessary. --- # Three Variables * What if we have two numeric variables and one categorical? -- * Scatterplot or line plot colored by category. * Scatterplot or line plot faceted by category. Note: Recall our example in the step-by-step practice with `ggplot2`. - More details about the choices of plotting and other data visualization concepts can be found in [this notes](https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Chap3.pdf). Please spend some time reading this notes! --- # Summary - R packages provide us with numerous handy functions that have been written by other R developers. - The `tidyverse` is a collection of packages for common data science tasks. - Pipes `%>%` allow us to string together commands to get a flow of results. - The `dplyr` is a package for data wrangling with several key verbs (or functions). - The `tidyr` is a package for manipulating data frames in R. - Base R has a set of powerful plotting tools that help us quickly visualize our data. - The `ggplot2` is a package for creating more sophisticated plots. Submit Lab 4 on Gradescope by the end of Tuesday (February 13)!! Start earlier!!