dplyr
in brieflibrary(tidyverse)
Rarely will your data arrive in exactly the form you require in order to analyze it appropriately. As part of the data science workflow you will need to transform your data in order to analyze it. Just as we established a syntax for generating graphics (the layered grammar of graphics), so too will we have a syntax for data transformation.
From the same author of ggplot2
, I give you dplyr
! This package contains useful functions for transforming and manipulating data frames, the bread-and-butter format for data in R. These functions can be thought of as verbs. The noun is the data, and the verb is acting on the noun. All of the dplyr
verbs (and in fact all the verbs in the wider tidyverse
) work similarly:
dplyr
function() |
Action performed |
---|---|
filter() |
Subsets observations based on their values |
arrange() |
Changes the order of observations based on their values |
select() |
Selects a subset of columns from the data frame |
rename() |
Changes the name of columns in the data frame |
mutate() |
Creates new columns (or variables) |
group_by() |
Changes the unit of analysis from the complete dataset to individual groups |
summarize() |
Collapses the data frame to a smaller number of rows which summarize the larger data |
These are the basic verbs you will use to transform your data. By combining them together, you can perform powerful data manipulation tasks.
Hadley Wickham is from New Zealand. As such he (and base R) favours British spellings:
The holy grail: “For consistency, aim to use British (rather than American) spelling.” #rstats http://t.co/7qQSWIowcl. Colour is right!
— Hadley Wickham (@hadleywickham) November 27, 2013
While British spelling is perhaps the norm, let us recall a past statement by the president:
We have to make America great again!
— Donald J. Trump (@realDonaldTrump) November 7, 2012
Fortunately many R functions can be written using American or British variants:
summarize()
= summarise()
color()
= colour()
Therefore in this class I will generally stick to American spelling.
dplyr
never overwrites existing data. If you want a copy of the transformed data for later use in the program, you need to explicitly save it. You can do this by using the assignment operator <-
:
filter(diamonds, cut == "Ideal") # printed, but not saved
## # A tibble: 21,551 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
## 10 0.30 Ideal I SI2 61.0 59 405 4.30 4.33 2.63
## # ... with 21,541 more rows
diamonds_ideal <- filter(diamonds, cut == "Ideal") # saved, but not printed
(diamonds_ideal <- filter(diamonds, cut == "Ideal")) # saved and printed
## # A tibble: 21,551 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
## 10 0.30 Ideal I SI2 61.0 59 405 4.30 4.33 2.63
## # ... with 21,541 more rows
Do not use
=
to assign objects. Read this for more information on the difference between<-
and=
.
NA
represents an unknown value. Missing values are contagious, in that their properties will transfer to any operation performed on it.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
To determine if a value is missing, use the is.na()
function.
When filtering, you must explicitly call for missing values to be returned.
df <- tibble(x = c(1, NA, 3))
df
## # A tibble: 3 x 1
## x
## <dbl>
## 1 1
## 2 NA
## 3 3
filter(df, x > 1)
## # A tibble: 1 x 1
## x
## <dbl>
## 1 3
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
Or when calculating summary statistics, you need to explicitly ignore missing values.
df <- tibble(
x = c(1, 2, 3, 5, NA)
)
df
## # A tibble: 5 x 1
## x
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 5
## 5 NA
summarize(df, meanx = mean(x))
## # A tibble: 1 x 1
## meanx
## <dbl>
## 1 NA
summarize(df, meanx = mean(x, na.rm = TRUE))
## # A tibble: 1 x 1
## meanx
## <dbl>
## 1 2.75
As we discussed, frequently you need to perform a series of intermediate steps to transform data for analysis. If we write each step as a discrete command and store their contents as new objects, your code can become convoluted.
Drawing on this example from R for Data Science, let’s explore the relationship between the distance and average delay for each location. At this point, we would write it something like this:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
Decomposing the problem, there are three basic steps:
flights
by destination.The code as written is inefficient because we have to name and store each intermediate data frame, even though we don’t care about them. It also provides more opportunities for typos and errors.
Because all dplyr
verbs follow the same syntax (data first, then options for the function), we can use the pipe operator %>%
to chain a series of functions together in one command:
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
Now, we don’t have to name each intermediate step and store them as data frames. We only store a single data frame (delays
) which contains the final version of the transformed data frame. We could read this code as use the flights
data, then group by destination, then summarize for each destination the number of flights, the average disance, and the average delay, then subset only the destinations with at least 20 flights and exclude Honolulu.
Remember that with pipes, we don’t have to save all of our intermediate steps. We only use one assignment, like this:
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
Do not do this:
delays <- flights %>%
by_dest <- group_by(dest) %>%
delay <- summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
delay <- filter(count > 20, dest != "HNL")
Error: bad assignment:
summarize(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay,
na.rm = TRUE)) %>% delay <- filter(count > 20, dest != "HNL")
Or this:
delays <- flights %>%
group_by(flights, dest) %>%
summarize(flights,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(flights, count > 20, dest != "HNL")
## Error in grouped_df_impl(data, unname(vars), drop): Column `flights` is unknown
If you use pipes, you don’t have to reference the data frame with each function - just the first time at the beginning of the pipe sequence.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-03-13
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.0)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.1)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.1 2018-01-24 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.2 2018-02-06 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131.1 2018-02-16 CRAN (R 3.4.3)
## parallel 3.4.3 2017-12-07 local
## pillar 1.1.0 2018-01-14 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0 2018-02-20 cran (@0.2.0)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0 2017-08-24 cran (@0.5.0)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.16 2017-12-12 CRAN (R 3.4.3)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.