10.3 aggregate()
: Grouped aggregation
Argument | Description |
---|---|
formula |
A formula in the form y ~ x1 + x2 + ... where y is the dependent variable, and x1, x2… are the independent variables. For example, salary ~ sex + age will aggregate a salary column at every unique combination of sex and age |
FUN |
A function that you want to apply to y at every level of the independent variables. E.g.; mean , or max . |
data |
The dataframe containing the variables in formula |
subset |
A subset of data to analyze. For example, subset(sex == "f" & age > 20) would restrict the analysis to females older than 20. You can ignore this argument to use all data. |
The first aggregation function we’ll cover is aggregate()
. Aggregate allows you to easily answer questions in the form: “What is the value of the function FUN
applied to a dependent variable dv
at each level of one (or more) independent variable(s) iv
?
# General structure of aggregate()
aggregate(x = dv ~ iv, # dv is the data, iv is the group
FUN = fun, # The function you want to apply
data = df) # The dataframe object containing dv and iv
Let’s give aggregate()
a whirl. No…not a whirl…we’ll give it a spin. Definitely a spin. We’ll use aggregate()
on the ChickWeight
dataset to answer the question “What is the mean weight for each diet?”
If we wanted to answer this question using basic R functions, we’d have to write a separate command for each supplement like this:
# The WRONG way to do grouped aggregation.
# We should be using aggregate() instead!
mean(ChickWeight$weight[ChickWeight$Diet == 1])
## [1] 103
If you are ever writing code like this, there is almost always a simpler way to do it. Let’s replace this code with a much more elegant solution using aggregate()
.For this question, we’ll set the value of the dependent variable Y to weight
, x1 to Diet
, and FUN to mean
# Calculate the mean weight for each value of Diet
aggregate(x = weight ~ Diet, # DV is weight, IV is Diet
FUN = mean, # Calculate the mean of each group
data = ChickWeight) # dataframe is ChickWeight
## Diet weight
## 1 1 103
## 2 2 123
## 3 3 143
## 4 4 135
As you can see, the aggregate()
function has returned a dataframe with a column for the independent variable Diet
, and a column for the results of the function mean
applied to each level of the independent variable. The result of this function is the same thing we’d got from manually indexing each level of Diet
individually – but of course, this code is much simpler and more elegant!
You can also include a subset
argument within an aggregate()
function to apply the function to subsets of the original data. For example, if I wanted to calculate the mean chicken weights for each diet, but only when the chicks are less than 10 weeks old, I would do the following:
# Calculate the mean weight for each value of Diet,
# But only when chicks are less than 10 weeks old
aggregate(x = weight ~ Diet, # DV is weight, IV is Diet
FUN = mean, # Calculate the mean of each group
subset = Time < 10, # Only when Chicks are less than 10 weeks old
data = ChickWeight) # dataframe is ChickWeight
## Diet weight
## 1 1 58
## 2 2 63
## 3 3 66
## 4 4 69
You can also include multiple independent variables in the formula argument to aggregate()
. For example, let’s use aggregate()
to now get the mean weight of the chicks for all combinations of both Diet
and Time
, but now only for weeks 0, 2, and 4:
# Calculate the mean weight for each value of Diet and Time,
# But only when chicks are 0, 2 or 4 weeks okd
aggregate(x = weight ~ Diet + Time, # DV is weight, IVs are Diet and Time
FUN = mean, # Calculate the mean of each group
subset = Time %in% c(0, 2, 4), # Only when Chicks are 0, 2, and 4 weeks old
data = ChickWeight) # dataframe is ChickWeight
## Diet Time weight
## 1 1 0 41
## 2 2 0 41
## 3 3 0 41
## 4 4 0 41
## 5 1 2 47
## 6 2 2 49
## 7 3 2 50
## 8 4 2 52
## 9 1 4 56
## 10 2 4 60
## 11 3 4 62
## 12 4 4 64