This chapter looks at how to most easily access your data once they are in R objects.

Note this uses “Base R” but you may prefer using dplyr() when workign with data.frames, which does similar in an (arguably) easier to understand syntax.

But, it’s worth knowing how to do this without relying on a package if possible.

Subsetting using [ ]

Most R objects can have their individual elements reached via their numeric position. These can be reached by using square brackets [ ]

a_vector <- letters
a_vector[1]
## [1] "a"
a_vector[5]
## [1] "e"
a_vector[1:5]
## [1] "a" "b" "c" "d" "e"
a_vector[c(1,5)]
## [1] "a" "e"

But when in a data.frame you also have to worry about the second dimension.

In this case, the [ ] notation is extended to include a comma: [ , ].

  • The position before the , indicates which row
  • The position after the , indicates which column

Note: This is sort of like R1C1 notation in Excel…except with a comma!

my_data <- mtcars

## the second row
my_data[2, ]
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
## the first column
my_data[ , 1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
## the second row, first colum
my_data[2, 1]
## [1] 21
## the second and third row, and first and forth colum
my_data[2:3, c(1,4)]
##                mpg  hp
## Mazda RX4 Wag 21.0 110
## Datsun 710    22.8  93

Character subsetting

Subsetting by numbers assumes the rows and columns are in the same order, which can be dangerous. Safer is to use the name, if you know it:

## First 5 rows of the "mpg" column
my_data[1:5, "mpg"]
## [1] 21.0 21.0 22.8 21.4 18.7

You can aso use multiple columns:

my_data[1:5, c("mpg","gear")]
##                    mpg gear
## Mazda RX4         21.0    4
## Mazda RX4 Wag     21.0    4
## Datsun 710        22.8    4
## Hornet 4 Drive    21.4    3
## Hornet Sportabout 18.7    3

And reorder or repeat columns (but it will rename them to avoid clashes via make.names())

my_data[1:5, c("gear","mpg", "mpg")]
##                   gear  mpg mpg.1
## Mazda RX4            4 21.0  21.0
## Mazda RX4 Wag        4 21.0  21.0
## Datsun 710           4 22.8  22.8
## Hornet 4 Drive       3 21.4  21.4
## Hornet Sportabout    3 18.7  18.7

Note that if you subset lists() or data.frames() with [ ] it will, by default, return a list or data.frame back.

If you want to instead return the column vector, then use [[ ]] which returns whats in the list/data.frame column.

This is confusing topic. It’s right up there with StringsAsFactors = FALSE. This is where the console comes in handy when you’re trying to make sure you have your syntax correct.

my_data[["mpg"]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Lookup Tables

Character subsetting is how the lookup examples work, the only differences being that, this time, it’s not a data.frame but a vector that is named using the names() function.

## example with some user Ids
lookup <- c("Bill","Ben","Sue","Linda","Gerry")
names(lookup) <- c("1231","2323","5353","3434","9999")
lookup
##    1231    2323    5353    3434    9999 
##  "Bill"   "Ben"   "Sue" "Linda" "Gerry"
## this is a big vector of Ids you want to lookup
big_list_of_ids <- c("2323","2323","3434","9999","9999","1231","5353","9999","2323","1231","9999")

## subset lookup with repeating columns by your data of ids
lookup[big_list_of_ids]
##    2323    2323    3434    9999    9999    1231    5353    9999    2323 
##   "Ben"   "Ben" "Linda" "Gerry" "Gerry"  "Bill"   "Sue" "Gerry"   "Ben" 
##    1231    9999 
##  "Bill" "Gerry"

$ operator

You can also find columns via the $ operator on lists and data.frames:

my_data$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

This $ is a shortcut to subsetting via a character name:

my_data[["mpg"]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

You can then apply subsetting to the result to get a specific number(s):

## first 5 elements of mpg column
my_data$mpg[1:5]
## [1] 21.0 21.0 22.8 21.4 18.7

Subsetting using logical

You can also subset using TRUE and FALSE. This is a good way to select rows.

For instance, to select all rows that are over 24 in the mpg column of mtcars.

We first construct the logical vector:

## we first make a TRUE or FALSE vector for every mpg element over 24
over_24 <- mtcars$mpg > 24
over_24
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

We can then pass this into the row selector for mycars:

## all rows over 24 and all columns
mtcars[over_24, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

This is often shortend to one line:

mtcars[mtcars$mpg > 24, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

And if you only wanted certain columns, you can add that to the line too:

mtcars[mtcars$mpg > 24, c("mpg","wt")]
##                 mpg    wt
## Merc 240D      24.4 3.190
## Fiat 128       32.4 2.200
## Honda Civic    30.4 1.615
## Toyota Corolla 33.9 1.835
## Fiat X1-9      27.3 1.935
## Porsche 914-2  26.0 2.140
## Lotus Europa   30.4 1.513

These can start to look pretty confusing, but, once you get comfortable with the basic syntax, you will see how things break down. And, it can be useful to build up the final syntax iteratively, much as was done in the example above.

A couple of additional notes on the conditional selections (the use of > above):

  • To set “equals to,” use a double equals sign: ==
  • To set “not equals to,” it is not “<>” like you might think: it’s !=.

Other methods

There is also the function which() that you may see around, but in general I would recommend not using this since it relies on numeric subsetting and can be difficult to debug.

And, if you are regular expression junkie, you can use grepl() in your row or column selections (typically, it’s in your row selection). There is a grep() function that actually returns the matched values, but, if you’re doing a selection, you actually want to return a logical vector (TRUEs and FALSEs) for your condition as to which rows you want to match…and that is what grepl() does.

If you have loaded dplyr() then it makes sense to use its select() for columns and filter() for rows instead.

Munging data

So, now you can subset at will, how does this apply to data munging?

Well, in many cases your data will come with elements you need to change that you need to filter down to. You can then reassign those values to what you prefer.

A few other functions are useful to know for these cases:

## Will return TRUE if a value is NA (e.g. imported incorrectly)
is.na(NA)
## [1] TRUE
a_vector <- c(1,2,3,NA,4)
is.na(a_vector)
## [1] FALSE FALSE FALSE  TRUE FALSE

Munging Example

Lets take the previous mtcars columns and say we want to set all the mpg values to 24 if they are over 24.

In this case we can filter to the elements we need like before, but this time modifying the data in place using the <- assignment command:

my_new_data <- mtcars
my_new_data[my_new_data$mpg > 24, "mpg"] <- 24
max(my_new_data[, "mpg"])
## [1] 24