--- title: "Class 3: Data wrangling with the tidyverse" author: - name: "Kent Riemondy" url: https://github.com/kriemo affiliation: "RNA Bioscience Initiative" affiliation_url: https://medschool.cuanschutz.edu/rbi orcid_id: 0000-0003-0750-1273 output: distill::distill_article: self_contained: false toc: true --- ```{r setup, include = FALSE} knitr::opts_chunk$set( R.options = list(width = 80) ) ``` *The Rmarkdown for this class is [on github]( https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-11-30-class-3-data-wrangling-with-the-tidyverse/class-3.Rmd)* ## Introduction to the tidyverse The [tidyverse](https://www.tidyverse.org/) is a collection of packages that share similar design philosophy, syntax, and data structures. The packages are largely developed by the same team that builds Rstudio. Some key packages that we will touch on in this course: `readr`: functions for data import and export `ggplot2`: plotting based on the "grammar of graphics" `dplyr`: functions to manipulate tabular data `tidyr`: functions to help reshape data into a tidy format `stringr`: functions for working with strings `tibble`: a redesigned data.frame ## loading R packages To use an R package in an analysis we need to load the package using the `library()` function. This needs to be done once in each R session and it is a good idea to do this at the beginning of your Rmarkdown. For teaching purposes I will however sometimes load a package when I introduce a function from a package. ```{r} library(readr) library(dplyr) library(tibble) ``` ## tibble versus data.frame A `tibble` is a re-imagining of the base R `data.frame`. It has a few differences from the `data.frame`.The biggest differences are that it doesn't have `row.names` and it has an enhanced `print` method. If interested in learning more, see the tibble [vignette](https://tibble.tidyverse.org/articles/tibble.html). Compare `data_df` to `data_tbl`. ```{r, eval = FALSE} data_df <- data.frame(a = 1:3, b = letters[1:3], c = c(TRUE, FALSE, TRUE), row.names = c("ob_1", "ob_2", "ob_3")) data_df data_tbl <- as_tibble(data_df) data_tbl ``` When you work with tidyverse functions it is a good practice to convert data.frames to tibbles. In practice many functions will work interchangeably with either base data.frames or tibble, provided that they don't use row names. ## Converting a base R data.frame to a tibble If a data.frame has row names, you can preserve these by moving them into a column before converting to a tibble using the `rownames_to_column()` from `tibble`. ```{r} head(mtcars) ``` ```{r} mtcars_tbl <- rownames_to_column(mtcars, "vehicle") mtcars_tbl <- as_tibble(mtcars_tbl) mtcars_tbl ``` If you don't need the rownames, then you can use the `as_tibble()` function directly. ```{r} mtcars_tbl <- as_tibble(mtcars) ``` ## Data import So far we have only worked with built in or hand generated datasets, now we will discuss how to read data files into R. The [`readr`](https://readr.tidyverse.org/) package provides a series of functions for importing or writing data in common text formats. `read_csv()`: comma-separated values (CSV) files `read_tsv()`: tab-separated values (TSV) files `read_delim()`: delimited files (CSV and TSV are important special cases) `read_fwf()`: fixed-width files `read_table()`: whitespace-separated files These functions are quicker and have better defaults than the base R equivalents (e.g. `read.table` or `read.csv`). These functions also directly output tibbles rather than base R data.drames The [readr checksheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf) provides a concise overview of the functionality in the package. To illustrate how to use readr we will load a `.csv` file containing information about airline flights from 2014. First we will download the data files. You can download this data manually from [github](https://github.com/arunsrinivasan/flights). However we will use R to download the dataset using the `download.file()` base R function. ```{r} # test if file exists, if it doesn't then download the file. if(!file.exists("flights14.csv")) { file_url <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" download.file(file_url, "flights14.csv") } ``` You should now have a file called "flights14.csv" in your working directory (the same directory as the Rmarkdown). To read this data into R, we can use the `read_csv()` function. The defaults for this function often work for many datasets. ```{r} flights <- read_csv("flights14.csv") flights ``` There are a few commonly used arguments: `col_names`: if the data doesn't have column names, you can provide them (or skip them). `col_types`: set this if the data type of a column is incorrectly inferred by readr `comment`: if there are comment lines in the file, such as a header line prefixed with `#`, you want to skip, set this to `#`. `skip`: # of lines to skip before reading in the data. `n_max`: maximum number of lines to read, useful for testing reading in large datasets. The readr functions will also automatically uncompress gzipped or zipped datasets, and additionally can read data directly from a URL. ```r read_csv("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv") ``` There are equivalent functions for writing data.frames from R to files: `write_csv`, `write_tsv`, `write_delim`. ## Data import/export for excel files The `readxl` package can read data from excel files and is included in the tidyverse. The `read_excel()` function is the main function for reading data. The `openxlsx` package, which is not part of tidyverse but is on [CRAN](https://ycphs.github.io/openxlsx/index.html), can write excel files. The `write.xlsx()` function is the main function for writing data to excel spreadsheets. ## Data import/export of R objects Often it is useful to store R objects as files on disk so that the R objects can be reloaded into R. These could be large processed datasets, intermediate results, or complex data structures that are not easily stored in rectangular text formats such as csv files. R provides the `saveRDS()` and `readRDS()` functions for storing and retrieving data in binary formats. ```{r} saveRDS(flights, "flights.rds") # save single object into a file df <- readRDS("flights.rds") # read object back into R df ``` If you want to save/load multiple objects you can use `save()` and `load()`. ```{r} save(flights, df, file = "robjs.rda") # save flight_df and df ``` `load()` will load the data into the environment with the same objects names used when saving the objects. ```{r} rm(flights, df) load("robjs.rda") ``` ## Exploring data `View()` can be used to open an excel like view of a data.frame. This is a good way to quickly look at the data. `glimpse()` or `str()` give an additional view of the data. ```r View(flights) str(flights) glimpse(flights) ``` Additional R functions to help with exploring data.frames (and tibbles): ```{r, eval = FALSE} dim(flights) # of rows and columns nrow(flights) ncol(flights) head(flights) # first 6 lines tail(flights) # last 6 lines colnames(flights) # column names rownames(flights) # row names (not present in tibble) ``` Useful base R functions for exploring values ```{r, eval = FALSE} summary(flights$distance) # get summary stats on column unique(flights$carrier) # find unique values in column cyl table(flights$carrier) # get frequency of each value in column cyl table(flights$origin, flights$dest) # get frequency of each combination of values ``` ## dplyr, a grammar for data manipulation ### Base R versus dplyr In the first two lectures we introduced how to subset vectors, data.frames, and matrices using base R functions. These approaches are flexible, succinct, and stable, meaning that these approaches will be supported and work in R in the future. Some criticisms of using base R are that the syntax is hard to read, it tends to be verbose, and it is difficult to learn. dplyr, and other tidyverse packages, offer alternative approaches which many find easier to use. Some key differences between base R and the approaches in dplyr (and tidyverse) * Use of the tibble version of data.frame * dplyr functions operate on data.frame/tibbles rather than individual vectors * dplyr allows you to specify column names without quotes * dplyr uses different functions (verbs) to accomplish the various tasks performed by the bracket `[` base R syntax * dplyr and related functions recognized "grouped" operations on data.frames, enabling operations on different groups of rows in a data.frame ### dplyr function overview `dplyr` provides a suite of functions for manipulating data in tibbles. Operations on Rows: - `filter()` chooses rows based on column values - `arrange()` changes the order of the rows - `distinct()` selects distinct/unique rows - `slice()` chooses rows based on location Operations on Columns: - `select()` changes whether or not a column is included - `rename()` changes the name of columns - `mutate()` changes the values of columns and creates new columns Operations on groups of rows: - `summarise()` collapses a group into a single row ### Filter rows Returning to our `flights` data. Let's use `filter()` to select certain rows. `filter(tibble, , ...)` ```{r} filter(flights, dest == "LAX") # select rows where the `dest` column is equal to `LAX ``` ```{r, eval = FALSE} filter(flights, arr_delay > 200) # flights with arr_delay > 200 filter(flights, distance < 100) # flights less than 100 miles filter(flights, year != 2014) # if no rows satisfy condition, then an empty tibble ``` Multiple conditions can be used to select rows. For example we can select rows where the `dest` column is equal to `LAX` and the `origin` is equal to `EWR`. You can either use the `&` operator, or supply multiple arguments. ```{r, eval = FALSE} filter(flights, dest == "LAX", origin == "EWR") filter(flights, dest == "LAX" & origin == "EWR") ``` We can select rows where the `dest` column is equal to `LAX` **or** the `origin` is equal to `EWR` using the `|` operator. ```{r, eval = FALSE} filter(flights, dest == "LAX" | origin == "EWR") ``` The `%in%` operator is useful for identifying rows with entries matching those in a vector of possibilities. ```{r, eval = FALSE} filter(flights, dest %in% c("LAX", "SLC", "SFO")) filter(flights, !dest %in% c("LAX", "SLC", "SFO")) # ! will negate ``` Try it out: - Use filter to find flights to DEN with a delayed departure (`dep_delay`). ```{r, eval = FALSE} ... ``` ### arrange rows `arrange()` can be used to sort the data based on values in a single column or multiple columns `arrange(tibble, )` For example, let's find the flight with the shortest amount of air time by arranging the table based on the `air_time` (flight time in minutes). ```{r} ``` ```{r, eval = FALSE} arrange(flights, air_time, distance) # sort first on air_time, then on distance # to sort in decreasing order, wrap the column name in `desc()`. arrange(flights, desc(air_time), distance) ``` Try it out: - Use arrange to determine which flight has the shortest distance? ```{r} ``` ## Column operations ### select columns `select()` is a simple function that subsets the tibble to keep certain columns. `select(tibble, )` ```{r} select(flights, origin, dest) ``` the `:` operator can select a range of columns, such as the columns from `air_time` to `hour`. The `!` operator selects columns not listed. ```{r, eval = FALSE} select(flights, air_time:hour) select(flights, !(air_time:hour)) ``` There is a suite of utilities in the tidyverse to help with select columns with names that: `matches()`, `starts_with()`, `ends_with()`, `contains()`, `any_of()`, and `all_of()`. `everything()` is also useful as a placeholder for all columns not explicitly listed. See help ?select ```{r, eval = FALSE} # keep columns that have "delay" in the name select(flights, contains("delay")) # select all columns except carrier select(flights, -carrier) # reorder columns so that distance and hour are first columns select(flights, starts_with("di"), ends_with("ay")) ``` ## When to quote or not quote? In general, when working with the tidyverse, you don't need to quote the names of columns. In the example above, we needed quotes because "delay" is not a column name in the flights tibble. ## Adding new columns with mutate `mutate()` allows you to add new columns to the tibble. `mutate(tibble, new_column_name = expression, ...)` ```{r} mutate(flights, total_delay = dep_delay + arr_delay) ``` We can't see the new column, so we add a select command to examine the columns of interest. ```{r} mutate(flights, total_delay = dep_delay + arr_delay) |> select(dep_delay, arr_delay, total_delay) ``` Multiple new columns can be made, and you can refer to columns made in preceding statements. ```{r, eval = FALSE} mutate(flights, delay = dep_delay + arr_delay, delay_in_hours = delay / 60) |> select(delay, delay_in_hours) ``` Try it out: - Calculate the flight time (`air_time`) in hours rather than in minutes, add as a new column. ```{r} mutate(flights, flight_time = air_time / 60) ``` ## Summarizing columns `summarize()` is a function that will collapse the data from a column into a summary value based on a function that takes a vector and returns a single value (e.g. mean(), sum(), median()). It is not very useful yet, but will be very powerful when we discuss grouped operations. ```{r} summarize(flights, avg_arr_delay = mean(arr_delay), med_air_time = median(air_time)) ``` ## Grouped operations All of the functionality described above can be easily expressed in base R syntax (see examples [here](https://dplyr.tidyverse.org/articles/base.html)). However, where dplyr really shines is the ability to apply the functions above to groups of data within each data frame. We can establish groups within the data using `group_by()`. The functions `mutate()`, `summarize()`, and optionally `arrange()` will instead operate on each group independently rather than all of the rows. Common approaches: group_by -> summarize: calculate summaries per group group_by -> mutate: calculate summaries per group and add as new column to original tibble `group_by(tibble, )` ```{r, eval = FALSE} group_by(flights, carrier) # notice the new "Groups:" metadata. # calculate average dep_delay per carrier group_by(flights, carrier) |> summarize(avg_dep_delay = mean(dep_delay)) # calculate average arr_delay per carrier at each airport group_by(flights, carrier, origin) |> summarize(avg_dep_delay = mean(dep_delay)) # calculate # of flights between each origin and destination city, per carrier, and average air time. # n() is a special function that returns the # of rows per group group_by(flights, carrier, origin, dest) |> summarize(n_flights = n(), mean_air_time = mean(air_time)) ``` Here are some questions that we can answer using grouped operations in a few lines of dplyr code. - What is the average flight `air_time` between each origin airport and destination airport? ```{r} group_by(flights, origin, dest) |> summarize(avg_air_time = mean(air_time)) ``` - Which cites take the longest (`air_time`) to fly between between on average? the shortest? ```{r} group_by(flights, origin, dest) |> summarize(avg_air_time = mean(air_time)) |> arrange(desc(avg_air_time)) |> head(1) group_by(flights, origin, dest) |> summarize(avg_air_time = mean(air_time)) |> arrange(avg_air_time) |> head(1) ``` Try it out: - Which carrier has the fastest flight (`air_time`) on average from JFK to LAX? ```{r, echo = FALSE} ``` - Which month has the longest departure delays on average when flying from JFK to HNL? ```{r, echo = FALSE} ``` ## String manipulation `stringr` is a package for working with strings (i.e. character vectors). It provides a consistent syntax for string manipulation and can perform many routine tasks: `str_c`: concatenate strings (similar to `paste()` in base R) `str_count`: count occurrence of a substring in a string `str_subset`: keep strings with a substring `str_replace`: replace a string with another string `str_split`: split a string into multiple pieces based on a string ```{r, eval = FALSE} library(stringr) some_words <- c("a sentence", "with a ", "needle in a", "haystack") str_detect(some_words, "needle") # use with dplyr::filter str_subset(some_words, "needle") str_replace(some_words, "needle", "pumpkin") str_replace_all(some_words, "a", "A") str_c(some_words, collapse = " ") str_c(some_words, " words words words", " anisfhlsdihg") str_count(some_words, "a") str_split(some_words, " ") ``` stringr uses [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to pattern match strings. This means that you can perform complex matching to the strings of interest. Additionally this means that there are special characters with behaviors that may be surprising if you are unaware of regular expressions. A useful resource when using regular expressions is https://regex101.com ```{r, eval = FALSE} complex_strings <- c("10101-howdy", "34-world", "howdy-1010", "world-.") # keep words with a series of #s followed by a dash, + indicates one or more occurrences. str_subset(complex_strings, "[0-9]+-") # keep words with a dash followed by a series of #s str_subset(complex_strings, "-[0-9]+") str_subset(complex_strings, "^howdy") # keep words starting with howdy str_subset(complex_strings, "howdy$") # keep words ending with howdy str_subset(complex_strings, ".") # . signifies any character str_subset(complex_strings, "\\.") # need to use backticks to match literal special character ``` Let's use dplyr and stringr together. Which destinations contain an "LL" in their 3 letter code? ```{r} library(stringr) filter(flights, str_detect(dest, "LL")) |> select(dest) |> unique() ``` Which 3-letter destination codes start with H? ```{r} filter(flights, str_detect(dest, "^H")) |> select(dest) |> unique() ``` Let's make a new column that combines the `origin` and `dest` columns. ```{r} mutate(flights, new_col = str_c(origin, ":", dest)) |> select(new_col, everything()) ```
Show session info ```{r code} sessionInfo() ```
## Acknowledgements and additional references {.appendix} The content of this class borrows heavily from previous tutorials: R code style guide: http://adv-r.had.co.nz/Style.html Tutorial organization: https://github.com/sjaganna/molb7910-2019 Other R tutorials: https://github.com/matloff/fasteR https://r4ds.had.co.nz/index.html https://bookdown.org/rdpeng/rprogdatascience/