--- title: "Data Transformation With dplyr" author: "Dr. Hua Zhou" date: "Jan 30, 2018" subtitle: Biostat M280 output: ioslides_presentation: smaller: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 5, fig.height = 3.5, fig.align = 'center', cache = TRUE) ``` ## A typical data science project:

## nycflights13 data - Available from the nycflights13 package. - 336,776 flights that departed from New York City in 2013: ```{r} library("tidyverse") ``` ```{r} library("nycflights13") flights ``` ## dplyr basics * Pick observations by their values: `filter()`. * Reorder the rows: `arrange()`. * Pick variables by their names: `select()`. * Create new variables with functions of existing variables: `mutate()`. * Collapse many values down to a single summary: `summarise()`. # Manipulate rows (cases) ## Filter rows with `filter()` - Flights on Jan 1st: ```{r} # same as filter(flights, month == 1 & day == 1) filter(flights, month == 1, day == 1) ``` ---- - Flights in Nov or Dec: ```{r} filter(flights, month == 11 | month == 12) ``` ## Remove rows with duplicate values - One row from each month: ```{r} distinct(flights, month, .keep_all = TRUE) ``` ## Sample rows - Randomly select `n` rows: ```{r} sample_n(flights, 10, replace = TRUE) ``` ---- - Randomly select fraction of rows: ```{r} sample_frac(flights, 0.1, replace = TRUE) ``` ## Select rows by position - Select rows by position: ```{r} slice(flights, 1:5) ``` ---- - Top `n` rows: ```{r} top_n(flights, 5) ``` ## Arrange rows with `arrange()` - Sort in ascending order: ```{r} arrange(flights, year, month, day) ``` ---- - Sort in descending order: ```{r} arrange(flights, desc(arr_delay)) ``` # Manipulate columns (variables) ## Select columns with `select()` - Select columns by variable name: ```{r} select(flights, year, month, day) ``` ---- - Select columns between two variables: ```{r} select(flights, year:day) ``` ---- - Select columns _except_ those between two variables: ```{r, } select(flights, -(year:day)) ``` ---- - Select columns by positions: ```{r} select(flights, seq(1, 10, by = 2)) ``` ---- - Move variables to the start of data frame: ```{r, } select(flights, time_hour, air_time, everything()) ``` ---- - Helper functions. * `starts_with("abc")`: matches names that begin with “abc”. * `ends_with("xyz")`: matches names that end with “xyz”. * `contains("ijk")`: matches names that contain “ijk”. * `matches("(.)\\1")`: selects variables that match a regular expression. * `num_range("x", 1:3)`: matches x1, x2 and x3. * `one_of()` ## Add new variables with `mutate()` - Add variables `gain` and `speed`: ```{r, results = 'hide'} flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time) ``` ```{r} mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` ---- - Refer to columns that you’ve just created: ```{r} mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ---- - Only keep the new variables by `transmute()`: ```{r} transmute(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ---- - `mutate_all()`: apply funs to all columns. ```{r, eval = FALSE} mutate_all(data, funs(log(.), log2(.))) ``` - `mutate_at()`: apply funs to specifc columns. ```{r, eval = FALSE} mutate_at(data, vars(-Species), funs(log(.))) ``` - `mutate_if()`: apply funs of one type ```{r, eval = FALSE} mutate_if(data, is.numeric, funs(log(.))) ``` # Summaries ## Summaries with `summarise()` - Mean of a variable: ```{r} summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ---- - Convert a tibble into a grouped tibble: ```{r} (by_day <- group_by(flights, year, month, day)) ``` ---- - Grouped summaries: ```{r} summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) ``` ## Pipe - Consider following analysis: ```{r, message = FALSE} by_dest <- group_by(flights, dest) delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) (delay <- filter(delay, count > 20, dest != "HNL")) ``` ---- - Cleaner code using pipe `%>%`: ```{r} (delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL")) ``` ---- - Unfortunately ggplot2 does **not** use pipe. ```{r} ggplot(data = delays, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ## Other summary functions - Location: `mean(x)`, `median(x)`. ```{r, result = 'hold'} not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) ``` ```{r} not_cancelled %>% group_by(year, month, day) %>% summarise( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay ) ``` ---- - Spread: `sd(x)`, `IQR(x)`, `mad(x)`. ```{r} not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) ``` ---- - Rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. ```{r} # When do the first and last flights leave each day? not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) ``` ---- - Position: `first(x)`, `nth(x, 2)`, `last(x)`. ```{r} not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) ``` ---- - Count: `n(x)`, `sum(!is.na(x))`, `n_distinct(x)`. ```{r} # Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) ``` ---- - ```{r} not_cancelled %>% count(dest) ``` ---- - ```{r} not_cancelled %>% count(tailnum, wt = distance) ``` ---- - ```{r} # How many flights left before 5am? (these usually indicate delayed # flights from the previous day) not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500)) ``` ---- - ```{r} # What proportion of flights are delayed by more than an hour? not_cancelled %>% group_by(year, month, day) %>% summarise(hour_perc = mean(arr_delay > 60)) ``` ## Grouped mutates (and filters) - Find the worst members of each group: ```{r} flights_sml %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) ``` ---- - Find all groups bigger than a threshold: ```{r} (popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365)) ``` ---- - Standardise to compute per group metrics: ```{r} popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) ``` # Combine tables ## nycflights13 has >1 tables: - We already know a lot about flights: ```{r} flights ``` ---- - airlines: ```{r} airlines ``` ---- - airports: ```{r} airports ``` ---- - planes: ```{r} planes ``` ---- - Weather: ```{r} weather ``` ## Relational data

## Keys - A **primary key** uniquely identifies an observation in its own table. - A **foreign key** uniquely identifies an observation in another table. # Combine variables (columns) ## Demo tables - ```{r} x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` ## Inner join - An **inner join** matches pairs of observations whenever their keys are equal:

- ```{r} inner_join(x, y, by = "key") ``` Same as ```{r, eval = FALSE} x %>% inner_join(y, by = "key") ``` ## Outer join - An **outer join** keeps observations that appear in at least one of the tables. - Three types of outer joins: - A **left join** keeps all observations in `x`. - A **right join** keeps all observations in `y`. - A **full join** keeps all observations in `x` or `y`.

## Duplicate keys - One table has duplicate keys.

---- - ```{r} x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 2, "x3", 1, "x4" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2" ) left_join(x, y, by = "key") ``` ---- - Both tables have duplicate keys. You get all possible combinations, the Cartesian product:

---- - ```{r} x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 2, "x3", 3, "x4" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 2, "y3", 3, "y4" ) left_join(x, y, by = "key") ``` ---- - Let's create a narrower table from the flights data: ```{r} flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) flights2 ``` ## Defining the key columns - `by = NULL` (default): use all variables that appear in both tables: ```{r} # same as: flights2 %>% left_join(weather) left_join(flights2, weather) ``` ---- - `by = "x"`: use the common variable `x`: ```{r} # same as: flights2 %>% left_join(weather) left_join(flights2, planes, by = "tailnum") ``` ---- - `by = c("a" = "b")`: match variable `a` in table `x` to the variable `b` in table `y`. ```{r} # same as: flights2 %>% left_join(weather) left_join(flights2, airports, by = c("dest" = "faa")) ``` # Combine cases (rows) ---- - Top 10 most popular destinations: ```{r} top_dest <- flights %>% count(dest, sort = TRUE) %>% head(10) top_dest ``` - How to filter the cases that fly to these destinations? ## Semi-join - `semi_join(x, y)` keesp the rows in `x` that have a match in `y`.

- Useful to see what will be joined. ---- - ```{r} semi_join(flights, top_dest) ``` ## Anti-join - `anti_join(x, y)` keeps the rows that don’t have a match.

- Useful to see what will not be joined. ---- - ```{r} flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE) ``` ## Set operations - Generate two tables: ```{r} df1 <- tribble( ~x, ~y, 1, 1, 2, 1 ) df2 <- tribble( ~x, ~y, 1, 1, 1, 2 ) ``` ---- - `bind_rows(x, y)` stacks table `x` one on top of `y`. ```{r} bind_rows(df1, df2) ``` - `intersect(x, y)` returns rows that appear in both `x` and `y`. ```{r} intersect(df1, df2) ``` ---- - `union(x, y)` returns unique observations in `x` and `y`. ```{r} union(df1, df2) ``` ---- - `setdiff(x, y)` returns rows that appear in `x` but not in `y`. ```{r} setdiff(df1, df2) ``` ```{r} setdiff(df2, df1) ```