10.4 dplyr
The dplyr
package is a relatively new R package that allows you to do all kinds of analyses quickly and easily. It is especially useful for creating tables of summary statistics across specific groups of data. In this section, we’ll go over a very brief overview of how you can use dplyr to easily do grouped aggregation. Just to be clear - you can use dplyr to do everything the aggregate()
function does and much more! However, this will be a very brief overview and I strongly recommend you look at the help menu for dplyr for additional descriptions and examples.
To use the dplyr package, you first need to install it with install.packages()
and load it:
Programming with dplyr looks a lot different than programming in standard R. dplyr works by combining objects (dataframes and columns in dataframes), functions (mean, median, etc.), and verbs (special commands in dplyr
). In between these commands is a new operator called the pipe which looks like this: %>%
}. The pipe simply tells R that you want to continue executing some functions or verbs on the object you are working on. You can think about this pipe as meaning ‘and then…’
To aggregate data with dplyr
, your code will look something like the following code. In this example, assume that the dataframe you want to summarize is called my.df
, the variable you want to group the data by independent variables iv1, iv2
, and the columns you want to aggregate are called col.a
, col.b
and col.c
# Template for using dplyr
my.df %>% # Specify original dataframe
filter(iv3 > 30) %>% # Filter condition
group_by(iv1, iv2) %>% # Grouping variable(s)
summarise(
a = mean(col.a), # calculate mean of column col.a in my.df
b = sd(col.b), # calculate sd of column col.b in my.df
c = max(col.c)) # calculate max on column col.c in my.df, ...
When you use dplyr, you write code that sounds like: “The original dataframe is XXX, now filter the dataframe to only include rows that satisfy the conditions YYY, now group the data at each level of the variable(s) ZZZ, now summarize the data and calculate summary functions XXX…”
Let’s start with an example: Let’s create a dataframe of aggregated data from the pirates
dataset. I’ll filter the data to only include pirates who wear a headband. I’ll group the data according to the columns sex
and college
. I’ll then create several columns of different summary statistic of some data across each grouping. To create this aggregated data frame, I will use the new function group_by
and the verb summarise
. I will assign the result to a new dataframe called pirates.agg
:
pirates.agg <- pirates %>% # Start with the pirates dataframe
filter(headband == "yes") %>% # Only pirates that wear hb
group_by(sex, college) %>% # Group by these variables
summarise(
age.mean = mean(age), # Define first summary...
tat.med = median(tattoos), # you get the idea...
n = n() # How many are in each group?
) # End
# Print the result
pirates.agg
## # A tibble: 6 × 5
## # Groups: sex [3]
## sex college age.mean tat.med n
## <chr> <chr> <dbl> <dbl> <int>
## 1 female CCCC 26.0 10 206
## 2 female JSSFP 33.8 10 203
## 3 male CCCC 23.4 10 358
## 4 male JSSFP 31.9 10 85
## 5 other CCCC 24.8 10 24
## 6 other JSSFP 32 12 11
As you can see from the output on the right, our final object pirates.agg
is the aggregated dataframe we want which aggregates all the columns we wanted for each combination of sex
and college
One key new function here is n()
. This function is specific to dplyr and returns a frequency of values in a summary command.
Let’s do a more complex example where we combine multiple verbs into one chunk of code. We’ll aggregate data from the movies dataframe.
movies %>% # From the movies dataframe...
filter(genre != "Horror" & time > 50) %>% # Select only these rows
group_by(rating, sequel) %>% # Group by rating and sequel
summarise( #
frequency = n(), # How many movies in each group?
budget.mean = mean(budget, na.rm = T), # Mean budget?
revenue.mean = mean(revenue.all), # Mean revenue?
billion.p = mean(revenue.all > 1000)) # Percent of movies with revenue > 1000?
## # A tibble: 14 × 6
## # Groups: rating [7]
## rating sequel frequency budget.mean revenue.mean billion.p
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 G 0 59 41.2 234. 0
## 2 G 1 12 92.9 357. 0.0833
## 3 NC-17 0 2 3.75 18.5 0
## 4 Not Rated 0 84 1.74 55.5 0
## 5 Not Rated 1 12 0.667 66.1 0
## 6 PG 0 312 51.8 191. 0.00962
## 7 PG 1 62 77.2 372. 0.0161
## 8 PG-13 0 645 52.1 168. 0.00620
## 9 PG-13 1 120 124. 524. 0.117
## 10 R 0 623 31.4 109. 0
## 11 R 1 42 58.2 226. 0
## 12 <NA> 0 86 1.65 33.7 0
## 13 <NA> 1 15 5.51 48.1 0
## 14 <NA> NA 11 0 34.1 0
As you can see, our result is a dataframe with 14 rows and 6 columns. The data are summarized from the movie dataframe, only include values where the genre is not Horror and the movie length is longer than 50 minutes, is grouped by rating and sequel, and shows several summary statistics.
10.4.1 Additional dplyr help
We’ve only scratched the surface of what you can do with dplyr
. In fact, you can perform almost all of your R tasks, from loading, to managing, to saving data, in the dplyr
framework. For more tips on using dplyr, check out the dplyr vignette at https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html. Or open it in RStudio by running the following command:
There is also a very nice YouTube video covering dplyr
at https://goo.gl/UY2AE1. Finally, consider also reading R for Data Science written by Garrett Grolemund and Hadley Wickham, which teaches R from the ground-up using the dplyr framework.