---
title: "Data Transformation With dplyr"
author: "Dr. Hua Zhou @ UCLA"
date: "Jan 25, 2022"
subtitle: Biostat 203B
output:
# ioslides_presentation: default
html_document:
toc: true
toc_depth: 4
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', cache = TRUE)
```
A typical data science project:
## nycflights13 data
- Available from the nycflights13 package.
- 336,776 flights that departed from New York City in 2013:
```{r}
library("tidyverse")
```
```{r}
library("nycflights13")
flights
```
- To display more rows:
```{r}
flights %>% print(n = 20)
```
Note `%>%` is the pipe in tidyverse. Above command is equivalent to `print(flights, n = 20)`.
- To display all rows:
```{r, eval = FALSE}
flights %>% print(n = nrow(.))
```
Do **not** try this on the `flights` data, which has too many rows.
- To display more columns (variables):
```{r}
flights %>% print(width = Inf)
```
The `width` argument controls the screen width.
## dplyr basics
* Pick observations (rows) by their values: `filter()`.
* Reorder the rows: `arrange()`.
* Pick variables (columns) by their names: `select()`.
* Create new variables with functions of existing variables: `mutate()`.
* Collapse many values down to a single summary: `summarise()`.
# Manipulate rows (cases)
## Filter rows with `filter()`
- Flights on Jan 1st:
```{r}
# same as filter(flights, month == 1 & day == 1)
filter(flights, month == 1, day == 1)
```
----
- Flights in Nov or Dec:
```{r}
filter(flights, month == 11 | month == 12)
```
## Remove rows with duplicate values
- One row from each month:
```{r}
distinct(flights, month, .keep_all = TRUE)
```
- With `.keep_all = FALSE`, only distinct values of the variable are selected:
```{r}
distinct(flights, month)
```
## Sample rows
- Randomly select `n` rows:
```{r}
sample_n(flights, 10, replace = TRUE)
```
----
- Randomly select fraction of rows:
```{r}
sample_frac(flights, 0.1, replace = TRUE)
```
## Select rows by position
- Select rows by position:
```{r}
slice(flights, 1:5)
```
- First rows:
```{r}
slice_head(flights, n = 5)
```
- Last rows:
```{r}
slice_tail(flights, n = 5)
```
----
- Top `n` rows with the highest values:
```{r}
# deprecated: top_n(flights, 5, wt = time_hour)
# This function is quick
slice_max(flights, n = 5, order_by = time_hour)
```
- Bottom `n` rows with lowest values:
```{r, eval=F}
# deprecated: top_n(flights, -5, wt = time_hour)
# Why it takes REALLY long???
# slice_max(flights, n = 5, order_by = desc(time_hour)) # is fast
slice_min(flights, n = 5, order_by = time_hour)
```
----
- `slice_*` verbs apply to groups for grouped tibbles.
## Arrange rows with `arrange()`
- Sort in ascending order:
```{r}
arrange(flights, year, month, day)
```
----
- Sort in descending order:
```{r}
arrange(flights, desc(arr_delay)) %>%
print(width = Inf)
```
----
- By default, `arrange` ignores grouping in grouped tibbles. Set `.by_group = TRUE` to arrange within each group.
# Manipulate columns (variables)
## Select columns with `select()`
- Select columns by variable names:
```{r}
select(flights, year, month, day)
```
- Pull values of _one_ column as a vector:
```{r, eval = FALSE}
pull(flights, year)
```
----
- Select columns between two variables:
```{r}
select(flights, year:day)
```
----
- Select columns _except_ those between two variables:
```{r, }
select(flights, -(year:day))
```
----
- Select columns by positions:
```{r}
select(flights, seq(1, 10, by = 2))
```
----
- Move variables to the start of data frame:
```{r}
select(flights, time_hour, air_time, everything())
```
----
- Helper functions.
* `starts_with("abc")`: matches names that begin with “abc”.
* `ends_with("xyz")`: matches names that end with “xyz”.
* `contains("ijk")`: matches names that contain “ijk”.
* `matches("(.)\\1")`: selects variables that match a regular expression.
* `num_range("x", 1:3)`: matches x1, x2 and x3.
* `one_of()`
## Add new variables with `mutate()`
- Add variables `gain` and `speed`:
```{r}
flights_sml <-
select(flights, year:day, ends_with("delay"), distance, air_time)
flights_sml
```
```{r}
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
```
----
- Refer to columns that you’ve just created:
```{r}
mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
----
- Only keep the new variables by `transmute()`:
```{r}
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
----
- `mutate_all()`: apply funs to all columns.
```{r, eval = FALSE}
mutate_all(data, funs(log(.), log2(.)))
```
- `mutate_at()`: apply funs to specific columns.
```{r, eval = FALSE}
mutate_at(data, vars(-Species), funs(log(.)))
```
- `mutate_if()`: apply funs of one type
```{r, eval = FALSE}
mutate_if(data, is.numeric, funs(log(.)))
```
# Summaries
## Summaries with `summarise()`
- Mean of a variable:
```{r}
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```
----
- Convert a tibble into a grouped tibble:
```{r}
(by_day <- group_by(flights, year, month, day))
```
----
- Grouped summaries:
```{r}
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
```
## Pipe
- Consider following analysis (find destinations excluding `HNL` that have >20 flights, and calculate the average distances and arrival delay):
```{r, message = FALSE}
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
delay
```
----
- Cleaner code using pipe `%>%`:
```{r}
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
delays
```
----
- ggplot2 accepts pipe too.
```{r}
delays %>%
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE) +
labs(x = "Distance from NYC (miles)",
y = "Arrival delay (mins)")
```
## Other summary functions
- Location: `mean(x)`, `median(x)`.
```{r}
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled
```
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```
----
- Spread: `sd(x)`, `IQR(x)`, `mad(x)`.
```{r}
# destinations with largest variation in distance
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```
----
- Rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
```{r}
# Earliest and latest flights on each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
```
----
- Position: `first(x)`, `nth(x, 2)`, `last(x)`.
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
```
----
- Count: `n(x)`, `sum(!is.na(x))`, `n_distinct(x)`.
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
```
----
-
```{r}
# which destination has most flights from NYC?
not_cancelled %>%
count(dest) %>%
arrange(desc(n))
```
----
-
```{r}
# which aircraft flew most in 2013?
not_cancelled %>%
count(tailnum, wt = distance) %>%
arrange(desc(n))
```
----
-
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500)) %>%
arrange(desc(n_early))
```
----
-
```{r}
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60)) %>%
arrange(desc(hour_perc))
```
## Grouped mutates (and filters)
- Recall the `flights_sml` tibble created earlier:
```{r}
flights_sml
```
- Find the worst members of each group:
```{r}
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
```
----
- Find all groups bigger than a threshold:
```{r}
(popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365))
```
----
- Standardise to compute per group metrics:
```{r}
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay) %>%
print(width = Inf)
```
# Combine tables
nycflights13 package has >1 tables:
- We already know a lot about flights:
```{r}
flights %>% print(width = Inf)
```
----
- airlines:
```{r}
airlines
```
----
- airports:
```{r}
airports
```
----
- planes:
```{r}
planes
```
----
- Weather:
```{r}
weather %>%
print(width = Inf)
```
## Relational data
For the MIMIC-III data, the relation structure can be explored at .
## Keys
- A **primary key** uniquely identifies an observation in its own table.
- A **foreign key** uniquely identifies an observation in another table.
# Combine variables (columns)
## Demo tables
-
```{r}
(x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
))
```
```{r}
(y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
))
```
## Inner join
- An **inner join** matches pairs of observations whenever their keys are equal:
-
```{r}
inner_join(x, y, by = "key")
```
Same as
```{r, eval = FALSE}
x %>% inner_join(y, by = "key")
```
## Outer join
- An **outer join** keeps observations that appear in at least one of the tables.
- Three types of outer joins:
- A **left join** keeps all observations in `x`.
```{r}
left_join(x, y, by = "key")
```
- A **right join** keeps all observations in `y`.
```{r}
right_join(x, y, by = "key")
```
- A **full join** keeps all observations in `x` or `y`.
```{r}
full_join(x, y, by = "key")
```
## Duplicate keys
- One table has duplicate keys.
----
-
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
```
----
- Both tables have duplicate keys. You get all possible combinations, the Cartesian product:
----
-
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
```
----
- Let's create a narrower table from the flights data:
```{r}
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
```
- We want to merge with the `weather` table:
```{r}
weather
```
## Defining the key columns
- `by = NULL` (default): use all variables that appear in both tables:
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, weather)
```
----
- `by = "x"`: use the common variable `x`:
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, planes, by = "tailnum")
```
----
- `by = c("a" = "b")`: match variable `a` in table `x` to the variable `b` in table `y`.
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, airports, by = c("dest" = "faa"))
```
# Combine cases (rows)
----
- Top 10 most popular destinations:
```{r}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
```
- How to filter the cases that fly to these destinations?
## Semi-join
- `semi_join(x, y)` keesp the rows in `x` that have a match in `y`.
----
-
```{r}
semi_join(flights, top_dest)
```
## Anti-join
- `anti_join(x, y)` keeps the rows that don’t have a match.
- Useful to see what will not be joined.
----
-
```{r}
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
```
## Set operations
- Generate two tables:
```{r}
(df1 <- tribble(
~x, ~y,
1, 1,
2, 1
))
```
```{r}
(df2 <- tribble(
~x, ~y,
1, 1,
1, 2
))
```
----
- `bind_rows(x, y)` stacks table `x` one on top of `y`.
```{r}
bind_rows(df1, df2)
```
- `intersect(x, y)` returns rows that appear in both `x` and `y`.
```{r}
intersect(df1, df2)
```
----
- `union(x, y)` returns unique observations in `x` and `y`.
```{r}
union(df1, df2)
```
----
- `setdiff(x, y)` returns rows that appear in `x` but not in `y`.
```{r}
setdiff(df1, df2)
```
```{r}
setdiff(df2, df1)
```
# Cheat sheet
[RStudio cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf) is extremely helpful.