---
title: "Data Transformation With dplyr"
author: "Dr. Hua Zhou"
date: "Jan 30, 2018"
subtitle: Biostat M280
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width = 5, fig.height = 3.5, fig.align = 'center',
cache = TRUE)
```
##
A typical data science project:
## nycflights13 data
- Available from the nycflights13 package.
- 336,776 flights that departed from New York City in 2013:
```{r}
library("tidyverse")
```
```{r}
library("nycflights13")
flights
```
## dplyr basics
* Pick observations by their values: `filter()`.
* Reorder the rows: `arrange()`.
* Pick variables by their names: `select()`.
* Create new variables with functions of existing variables: `mutate()`.
* Collapse many values down to a single summary: `summarise()`.
# Manipulate rows (cases)
## Filter rows with `filter()`
- Flights on Jan 1st:
```{r}
# same as filter(flights, month == 1 & day == 1)
filter(flights, month == 1, day == 1)
```
----
- Flights in Nov or Dec:
```{r}
filter(flights, month == 11 | month == 12)
```
## Remove rows with duplicate values
- One row from each month:
```{r}
distinct(flights, month, .keep_all = TRUE)
```
## Sample rows
- Randomly select `n` rows:
```{r}
sample_n(flights, 10, replace = TRUE)
```
----
- Randomly select fraction of rows:
```{r}
sample_frac(flights, 0.1, replace = TRUE)
```
## Select rows by position
- Select rows by position:
```{r}
slice(flights, 1:5)
```
----
- Top `n` rows:
```{r}
top_n(flights, 5)
```
## Arrange rows with `arrange()`
- Sort in ascending order:
```{r}
arrange(flights, year, month, day)
```
----
- Sort in descending order:
```{r}
arrange(flights, desc(arr_delay))
```
# Manipulate columns (variables)
## Select columns with `select()`
- Select columns by variable name:
```{r}
select(flights, year, month, day)
```
----
- Select columns between two variables:
```{r}
select(flights, year:day)
```
----
- Select columns _except_ those between two variables:
```{r, }
select(flights, -(year:day))
```
----
- Select columns by positions:
```{r}
select(flights, seq(1, 10, by = 2))
```
----
- Move variables to the start of data frame:
```{r, }
select(flights, time_hour, air_time, everything())
```
----
- Helper functions.
* `starts_with("abc")`: matches names that begin with “abc”.
* `ends_with("xyz")`: matches names that end with “xyz”.
* `contains("ijk")`: matches names that contain “ijk”.
* `matches("(.)\\1")`: selects variables that match a regular expression.
* `num_range("x", 1:3)`: matches x1, x2 and x3.
* `one_of()`
## Add new variables with `mutate()`
- Add variables `gain` and `speed`:
```{r, results = 'hide'}
flights_sml <-
select(flights, year:day, ends_with("delay"), distance, air_time)
```
```{r}
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
```
----
- Refer to columns that you’ve just created:
```{r}
mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
----
- Only keep the new variables by `transmute()`:
```{r}
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
----
- `mutate_all()`: apply funs to all columns.
```{r, eval = FALSE}
mutate_all(data, funs(log(.), log2(.)))
```
- `mutate_at()`: apply funs to specifc columns.
```{r, eval = FALSE}
mutate_at(data, vars(-Species), funs(log(.)))
```
- `mutate_if()`: apply funs of one type
```{r, eval = FALSE}
mutate_if(data, is.numeric, funs(log(.)))
```
# Summaries
## Summaries with `summarise()`
- Mean of a variable:
```{r}
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```
----
- Convert a tibble into a grouped tibble:
```{r}
(by_day <- group_by(flights, year, month, day))
```
----
- Grouped summaries:
```{r}
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
```
## Pipe
- Consider following analysis:
```{r, message = FALSE}
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
(delay <- filter(delay, count > 20, dest != "HNL"))
```
----
- Cleaner code using pipe `%>%`:
```{r}
(delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL"))
```
----
- Unfortunately ggplot2 does **not** use pipe.
```{r}
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
```
## Other summary functions
- Location: `mean(x)`, `median(x)`.
```{r, result = 'hold'}
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
```
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```
----
- Spread: `sd(x)`, `IQR(x)`, `mad(x)`.
```{r}
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```
----
- Rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
```{r}
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
```
----
- Position: `first(x)`, `nth(x, 2)`, `last(x)`.
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
```
----
- Count: `n(x)`, `sum(!is.na(x))`, `n_distinct(x)`.
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
```
----
-
```{r}
not_cancelled %>%
count(dest)
```
----
-
```{r}
not_cancelled %>%
count(tailnum, wt = distance)
```
----
-
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
```
----
-
```{r}
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60))
```
## Grouped mutates (and filters)
- Find the worst members of each group:
```{r}
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
```
----
- Find all groups bigger than a threshold:
```{r}
(popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365))
```
----
- Standardise to compute per group metrics:
```{r}
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
```
# Combine tables
##
nycflights13 has >1 tables:
- We already know a lot about flights:
```{r}
flights
```
----
- airlines:
```{r}
airlines
```
----
- airports:
```{r}
airports
```
----
- planes:
```{r}
planes
```
----
- Weather:
```{r}
weather
```
## Relational data
## Keys
- A **primary key** uniquely identifies an observation in its own table.
- A **foreign key** uniquely identifies an observation in another table.
# Combine variables (columns)
## Demo tables
-
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
```
## Inner join
- An **inner join** matches pairs of observations whenever their keys are equal:
-
```{r}
inner_join(x, y, by = "key")
```
Same as
```{r, eval = FALSE}
x %>% inner_join(y, by = "key")
```
## Outer join
- An **outer join** keeps observations that appear in at least one of the tables.
- Three types of outer joins:
- A **left join** keeps all observations in `x`.
- A **right join** keeps all observations in `y`.
- A **full join** keeps all observations in `x` or `y`.
## Duplicate keys
- One table has duplicate keys.
----
-
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
```
----
- Both tables have duplicate keys. You get all possible combinations, the Cartesian product:
----
-
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
```
----
- Let's create a narrower table from the flights data:
```{r}
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
```
## Defining the key columns
- `by = NULL` (default): use all variables that appear in both tables:
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, weather)
```
----
- `by = "x"`: use the common variable `x`:
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, planes, by = "tailnum")
```
----
- `by = c("a" = "b")`: match variable `a` in table `x` to the variable `b` in table `y`.
```{r}
# same as: flights2 %>% left_join(weather)
left_join(flights2, airports, by = c("dest" = "faa"))
```
# Combine cases (rows)
----
- Top 10 most popular destinations:
```{r}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
```
- How to filter the cases that fly to these destinations?
## Semi-join
- `semi_join(x, y)` keesp the rows in `x` that have a match in `y`.
- Useful to see what will be joined.
----
-
```{r}
semi_join(flights, top_dest)
```
## Anti-join
- `anti_join(x, y)` keeps the rows that don’t have a match.
- Useful to see what will not be joined.
----
-
```{r}
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
```
## Set operations
- Generate two tables:
```{r}
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
```
----
- `bind_rows(x, y)` stacks table `x` one on top of `y`.
```{r}
bind_rows(df1, df2)
```
- `intersect(x, y)` returns rows that appear in both `x` and `y`.
```{r}
intersect(df1, df2)
```
----
- `union(x, y)` returns unique observations in `x` and `y`.
```{r}
union(df1, df2)
```
----
- `setdiff(x, y)` returns rows that appear in `x` but not in `y`.
```{r}
setdiff(df1, df2)
```
```{r}
setdiff(df2, df1)
```