17.4 Loops over multiple indices with a design matrix
So far we’ve covered simple loops with a single index value - but how can you do loops over multiple indices? You could do this by creating multiple nested loops. However, these are ugly and cumbersome. Instead, I recommend that you use design matrices
to reduce loops with multiple index values into a single loop with just one index. Here’s how you do it:
Let’s say you want to calculate the mean, median, and standard deviation of some quantitative variable for all combinations of two factors. For a concrete example, let’s say we wanted to calculate these summary statistics on the age of pirates for all combinations of colleges and sex.
To do this, we’ll start by creating a design matrix. This matrix will have all combinations of our two factors. To create this design matrix matrix, we’ll use the expand.grid()
function. This function takes several vectors as arguments, and returns a dataframe with all combinations of values of those vectors. For our two factors college and sex, we’ll enter all the factor values we want. Additionally, we’ll add NA columns for the three summary statistics we want to calculate
design.matrix <- expand.grid("college" = c("JSSFP", "CCCC"), # college factor
"sex" = c("male", "female"), # sex factor
"median.age" = NA, # NA columns for our future calculations
"mean.age" = NA, #...
"sd.age" = NA, #...
stringsAsFactors = FALSE)
Here’s how the design matrix looks:
design.matrix
## college sex median.age mean.age sd.age
## 1 JSSFP male NA NA NA
## 2 CCCC male NA NA NA
## 3 JSSFP female NA NA NA
## 4 CCCC female NA NA NA
As you can see, the design matrix contains all combinations of our factors in addition to three NA columns for our future statistics. Now that we have the matrix, we can use a single loop where the index is the row of the design.matrix, and the index values are all the rows in the design matrix. For each index value (that is, for each row), we’ll get the value of each factor (college and sex) by indexing the current row of the design matrix. We’ll then subset the pirates
dataframe with those factor values, calculate our summary statistics, then assign them
for(row.i in 1:nrow(design.matrix)) {
# Get factor values for current row
college.i <- design.matrix$college[row.i]
sex.i <- design.matrix$sex[row.i]
# Subset pirates with current factor values
data.temp <- subset(pirates,
college == college.i & sex == sex.i)
# Calculate statistics
median.i <- median(data.temp$age)
mean.i <- mean(data.temp$age)
sd.i <- sd(data.temp$age)
# Assign statistics to row.i of design.matrix
design.matrix$median.age[row.i] <- median.i
design.matrix$mean.age[row.i] <- mean.i
design.matrix$sd.age[row.i] <- sd.i
}
Let’s look at the result to see if it worked!
design.matrix
## college sex median.age mean.age sd.age
## 1 JSSFP male 31 32 2.6
## 2 CCCC male 24 23 4.3
## 3 JSSFP female 33 34 3.5
## 4 CCCC female 26 26 3.4
Sweet! Our loop filled in the NA values with the statistics we wanted.