--- title: 'Going deeper with dplyr: New features in 0.3 and 0.4' output: html_document --- ## Introduction In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features. This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail. **If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc)**, I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial. ## Loading dplyr and the nycflights13 dataset Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency. ```{r eval=FALSE} # remove flights data if you just finished my previous tutorial rm(flights) ``` ```{r} # load packages suppressMessages(library(dplyr)) library(nycflights13) # print the flights dataset from nycflights13 flights ``` ## Choosing columns: select, rename ```{r} # besides just using select() to pick columns... flights %>% select(carrier, flight) # ...you can use the minus sign to hide columns flights %>% select(-month, -day) ``` ```{r results='hide'} # hide a range of columns flights %>% select(-(dep_time:arr_delay)) # hide any column with a matching name flights %>% select(-contains("time")) ``` ```{r} # pick columns using a character vector of column names cols <- c("carrier", "flight", "tailnum") flights %>% select(one_of(cols)) ``` ```{r} # select() can be used to rename columns, though all columns not mentioned are dropped flights %>% select(tail = tailnum) # rename() does the same thing, except all columns not mentioned are kept flights %>% rename(tail = tailnum) ``` ## Choosing rows: filter, between, slice, sample_n, top_n, distinct ```{r} # filter() supports the use of multiple conditions flights %>% filter(dep_time >= 600, dep_time <= 605) ``` ```{r results='hide'} # between() is a concise alternative for determing if numeric values fall in a range flights %>% filter(between(dep_time, 600, 605)) # side note: is.na() can also be useful when filtering flights %>% filter(!is.na(dep_time)) ``` ```{r} # slice() filters rows by position flights %>% slice(1000:1005) # keep the first three rows within each group flights %>% group_by(month, day) %>% slice(1:3) # sample three rows from each group flights %>% group_by(month, day) %>% sample_n(3) # keep three rows from each group with the top dep_delay flights %>% group_by(month, day) %>% top_n(3, dep_delay) # also sort by dep_delay within each group flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay)) ``` ```{r} # unique rows can be identified using unique() from base R flights %>% select(origin, dest) %>% unique() ``` ```{r results='hide'} # dplyr provides an alternative that is more "efficient" flights %>% select(origin, dest) %>% distinct() # side note: when chaining, you don't have to include the parentheses if there are no arguments flights %>% select(origin, dest) %>% distinct ``` ## Adding new variables: mutate, transmute, add_rownames ```{r} # mutate() creates a new variable (and keeps all existing variables) flights %>% mutate(speed = distance/air_time*60) # transmute() only keeps the new variables flights %>% transmute(speed = distance/air_time*60) ``` ```{r} # example data frame with row names mtcars %>% head() # add_rownames() turns row names into an explicit variable mtcars %>% add_rownames("model") %>% head() # side note: dplyr no longer prints row names (ever) for local data frames mtcars %>% tbl_df() ``` ## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup ```{r} # summarise() can be used to count the number of rows in each group flights %>% group_by(month) %>% summarise(cnt = n()) ``` ```{r results='hide'} # tally() and count() can do this more concisely flights %>% group_by(month) %>% tally() flights %>% count(month) ``` ```{r} # you can sort by the count flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) ``` ```{r results='hide'} # tally() and count() have a sort parameter for this purpose flights %>% group_by(month) %>% tally(sort=TRUE) flights %>% count(month, sort=TRUE) ``` ```{r} # you can sum over a specific variable instead of simply counting rows flights %>% group_by(month) %>% summarise(dist = sum(distance)) ``` ```{r results='hide'} # tally() and count() have a wt parameter for this purpose flights %>% group_by(month) %>% tally(wt = distance) flights %>% count(month, wt = distance) ``` ```{r} # group_size() returns the counts as a vector flights %>% group_by(month) %>% group_size() # n_groups() simply reports the number of groups flights %>% group_by(month) %>% n_groups() ``` ```{r} # group by two variables, summarise, arrange (output is possibly confusing) flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40) # ungroup() before arranging to arrange across all groups flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt)) ``` ## Creating data frames: data_frame `data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`: * You can use previously defined columns to compute new columns. * It never coerces column types. * It never munges column names. * It never adds row names. * It only recycles length 1 input. * It returns a local data frame (a tbl_df). ```{r} # data_frame() example data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse() # data.frame() example data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse() ``` ## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join ```{r} # create two simple data frames (a <- data_frame(color = c("green","yellow","red"), num = 1:3)) (b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L"))) # only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables) inner_join(a, b) # include observations found in either "a" or "b" full_join(a, b) # include all observations found in "a" left_join(a, b) # include all observations found in "b" right_join(a, b) # right_join(a, b) is identical to left_join(b, a) except for column ordering left_join(b, a) # filter "a" to only show observations that match "b" semi_join(a, b) # filter "a" to only show observations that don't match "b" anti_join(a, b) ``` ```{r} # sometimes matching variables don't have identical names b <- b %>% rename(col = color) # specify that the join should occur by matching "color" in "a" with "col" in "b" inner_join(a, b, by=c("color" = "col")) ``` ## Viewing more output: print, View ```{r} # specify that you want to see more rows flights %>% print(n = 15) ``` ```{r eval=FALSE} # specify that you want to see ALL rows (don't run this!) flights %>% print(n = Inf) ``` ```{r} # specify that you want to see all columns flights %>% print(width = Inf) ``` ```{r eval=FALSE} # show up to 1000 rows and all columns flights %>% View() # set option to see all columns and fewer rows options(dplyr.width = Inf, dplyr.print_min = 6) # reset options (or just close R) options(dplyr.width = NULL, dplyr.print_min = 10) ``` ## Resources * Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/) * [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/) * [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations * [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr * [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases) ## Data School * [Blog](http://www.dataschool.io/) * [Email newsletter](http://www.dataschool.io/subscribe/) * [YouTube channel](http://youtube.com/dataschool) < END OF DOCUMENT >