8.5 Slicing dataframes

Once you have a dataset stored as a matrix or dataframe in R, you’ll want to start accessing specific parts of the data based on some criteria. For example, if your dataset contains the result of an experiment comparing different experimental groups, you’ll want to calculate statistics for each experimental group separately. The process of selecting specific rows and columns of data based on some criteria is commonly known as slicing.

Slicing and dicing data. The turnip represents your data, and the knife represents indexing with brackets, or subsetting functions like subset(). The red-eyed clown holding the knife is just off camera.

Figure 8.4: Slicing and dicing data. The turnip represents your data, and the knife represents indexing with brackets, or subsetting functions like subset(). The red-eyed clown holding the knife is just off camera.

8.5.1 Slicing with `[, ]`

Just like vectors, you can access specific data in dataframes using brackets. But now, instead of just using one indexing vector, we use two indexing vectors: one for the rows and one for the columns. To do this, use the notation data[rows, columns], where rows and columns are vectors of integers.

# Return row 1
df[1, ]

# Return column 5
df[, 5]

# Rows 1:5 and column 2
df[1:5, 2]

Ah the ToothGrowth dataframe. Yes, one of the dataframes stored in R contains data from an experiment testing the effectiveness of different doses of Vitamin C supplements on the growth of guinea pig teeth. The images I found by Googling ``guinea pig teeth'' were all pretty horrifying, so let's just go with this one.

Figure 8.5: Ah the ToothGrowth dataframe. Yes, one of the dataframes stored in R contains data from an experiment testing the effectiveness of different doses of Vitamin C supplements on the growth of guinea pig teeth. The images I found by Googling ``guinea pig teeth’’ were all pretty horrifying, so let’s just go with this one.

Table 8.4: First few rows of the ToothGrowth dataframe.
len	supp	dose
4.2	VC	0.5
11.5	VC	0.5
7.3	VC	0.5
5.8	VC	0.5
6.4	VC	0.5
10.0	VC	0.5

Let’s try indexing the ToothGrowth dataframe. Again, the ToothGrowth dataframe represents the results of a study testing the effectiveness of different types of supplements on the length of guinea pig’s teeth. First, let’s look at the entries in rows 1 through 5, and column 1:

# Give me the rows 1-6 and column 1 of ToothGrowth
ToothGrowth[1:6, 1]
## [1]  4.2 11.5  7.3  5.8  6.4 10.0

Because the first column is len, the primary dependent measure, this means that the tooth lengths in the first 6 observations are 4.2, 11.5, 7.3, 5.8, 6.4, 10.

Of course, you can index matrices and dataframes with longer vectors to get more data. Now, let’s look at the first 3 rows of columns 1 and 3:

# Give me rows 1-3 and columns 1 and 3 of ToothGrowth
ToothGrowth[1:3, c(1,3)]
##    len dose
## 1  4.2  0.5
## 2 11.5  0.5
## 3  7.3  0.5

If you want to look at an entire row or an entire column of a matrix or dataframe, you can leave the corresponding index blank. For example, to see the entire 1st row of the ToothGrowth dataframe, we can set the row index to 1, and leave the column index blank:

# Give me the 1st row (and all columns) of ToothGrowth
ToothGrowth[1, ]
##   len supp dose
## 1 4.2   VC  0.5

Similarly, to get the entire 2nd column, set the column index to 2 and leave the row index blank:

# Give me the 2nd column (and all rows) of ToothGrowth
ToothGrowth[, 2]
##  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## [43] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC

Many, if not all, of the analyses you will be doing will be on subsets of data, rather than entire datasets. For example, if you have data from an experiment, you may wish to calculate the mean of participants in one group separately from another. To do this, we’ll use subsetting – selecting subsets of data based on some criteria. To do this, we can use one of two methods: indexing with logical vectors, or the subset() function. We’ll start with logical indexing first.

8.5.2 Slicing with logical vectors

Indexing dataframes with logical vectors is almost identical to indexing single vectors. First, we create a logical vector containing only TRUE and FALSE values. Next, we index a dataframe (typically the rows) using the logical vector to return only values for which the logical vector is TRUE.

For example, to create a new dataframe called ToothGrowth.VC containing only data from the guinea pigs who were given the VC supplement, we’d run the following code:

# Create a new df with only the rows of ToothGrowth
#  where supp equals VC
ToothGrowth.VC <- ToothGrowth[ToothGrowth$supp == "VC", ]

Of course, just like we did with vectors, we can make logical vectors based on multiple criteria – and then index a dataframe based on those criteria. For example, let’s create a dataframe called ToothGrowth.OJ.a that contains data from the guinea pigs who were given an OJ supplement with a dose less than 1.0:

# Create a new df with only the rows of ToothGrowth
#  where supp equals OJ and dose < 1

ToothGrowth.OJ.a <- ToothGrowth[ToothGrowth$supp == "OJ" &
                                ToothGrowth$dose < 1, ]

Indexing with brackets is the standard way to slice and dice dataframes. However, the code can get a bit messy. A more elegant method is to use the subset() function.

8.5.3 Slicing with `subset()`

Figure 8.6: The subset() function is like a lightsaber. An elegant function from a more civilized age.

The subset() function is one of the most useful data management functions in R. It allows you to slice and dice datasets just like you would with brackets, but the code is much easier to write: Table 8.5 shows the main arguments to the subset() function:

Table 8.5: Main arguments for the `subset()` function.
Argument	Description
`x`	A dataframe you want to subset
`subset`	A logical vector indicating the rows to keep
`select`	The columns you want to keep

Let’s use the subset() function to create a new, subsetted dataset from the ToothGrowth dataframe containing data from guinea pigs who had a tooth length less than 20cm (len < 20), given the OJ supplement (supp == "OJ"), and with a dose greater than or equal to 1 (dose >= 1):

# Get rows of ToothGrowth where len < 20 AND supp == "OJ" AND dose >= 1
subset(x = ToothGrowth,
      subset = len < 20 &
               supp == "OJ" &
               dose >= 1)
##    len supp dose
## 41  20   OJ    1
## 49  14   OJ    1

As you can see, there were only two cases that satisfied all 3 of our selection criteria.

In the example above, I didn’t specify an input to the select argument because I wanted all columns. However, if you just want certain columns, you can just name the columns you want in the select argument:

# Get rows of ToothGrowth where len > 30 AND supp == "VC", but only return the len and dose columns
subset(x = ToothGrowth,
    subset = len > 30 & supp == "VC",
    select = c(len, dose))
##    len dose
## 23  34    2
## 26  32    2