This chapter looks at how to most easily access your data once they are in R objects.
Note this uses “Base R” but you may prefer using dplyr() when workign with data.frames, which does similar in an (arguably) easier to understand syntax.
But, it’s worth knowing how to do this without relying on a package if possible.
Most R objects can have their individual elements reached via their numeric position. These can be reached by using square brackets [ ]
a_vector <- letters
a_vector[1]
## [1] "a"
a_vector[5]
## [1] "e"
a_vector[1:5]
## [1] "a" "b" "c" "d" "e"
a_vector[c(1,5)]
## [1] "a" "e"
But when in a data.frame you also have to worry about the second dimension.
In this case, the [ ] notation is extended to include a comma: [ , ].
, indicates which row, indicates which columnNote: This is sort of like R1C1 notation in Excel…except with a comma!
my_data <- mtcars
## the second row
my_data[2, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
## the first column
my_data[ , 1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
## the second row, first colum
my_data[2, 1]
## [1] 21
## the second and third row, and first and forth colum
my_data[2:3, c(1,4)]
## mpg hp
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
Subsetting by numbers assumes the rows and columns are in the same order, which can be dangerous. Safer is to use the name, if you know it:
## First 5 rows of the "mpg" column
my_data[1:5, "mpg"]
## [1] 21.0 21.0 22.8 21.4 18.7
You can aso use multiple columns:
my_data[1:5, c("mpg","gear")]
## mpg gear
## Mazda RX4 21.0 4
## Mazda RX4 Wag 21.0 4
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 3
## Hornet Sportabout 18.7 3
And reorder or repeat columns (but it will rename them to avoid clashes via make.names())
my_data[1:5, c("gear","mpg", "mpg")]
## gear mpg mpg.1
## Mazda RX4 4 21.0 21.0
## Mazda RX4 Wag 4 21.0 21.0
## Datsun 710 4 22.8 22.8
## Hornet 4 Drive 3 21.4 21.4
## Hornet Sportabout 3 18.7 18.7
Note that if you subset lists() or data.frames() with [ ] it will, by default, return a list or data.frame back.
If you want to instead return the column vector, then use [[ ]] which returns whats in the list/data.frame column.
This is confusing topic. It’s right up there with StringsAsFactors = FALSE. This is where the console comes in handy when you’re trying to make sure you have your syntax correct.
my_data[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Character subsetting is how the lookup examples work, the only differences being that, this time, it’s not a data.frame but a vector that is named using the names() function.
## example with some user Ids
lookup <- c("Bill","Ben","Sue","Linda","Gerry")
names(lookup) <- c("1231","2323","5353","3434","9999")
lookup
## 1231 2323 5353 3434 9999
## "Bill" "Ben" "Sue" "Linda" "Gerry"
## this is a big vector of Ids you want to lookup
big_list_of_ids <- c("2323","2323","3434","9999","9999","1231","5353","9999","2323","1231","9999")
## subset lookup with repeating columns by your data of ids
lookup[big_list_of_ids]
## 2323 2323 3434 9999 9999 1231 5353 9999 2323
## "Ben" "Ben" "Linda" "Gerry" "Gerry" "Bill" "Sue" "Gerry" "Ben"
## 1231 9999
## "Bill" "Gerry"
You can also find columns via the $ operator on lists and data.frames:
my_data$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
This $ is a shortcut to subsetting via a character name:
my_data[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
You can then apply subsetting to the result to get a specific number(s):
## first 5 elements of mpg column
my_data$mpg[1:5]
## [1] 21.0 21.0 22.8 21.4 18.7
You can also subset using TRUE and FALSE. This is a good way to select rows.
For instance, to select all rows that are over 24 in the mpg column of mtcars.
We first construct the logical vector:
## we first make a TRUE or FALSE vector for every mpg element over 24
over_24 <- mtcars$mpg > 24
over_24
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
## [23] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
We can then pass this into the row selector for mycars:
## all rows over 24 and all columns
mtcars[over_24, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
This is often shortend to one line:
mtcars[mtcars$mpg > 24, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
And if you only wanted certain columns, you can add that to the line too:
mtcars[mtcars$mpg > 24, c("mpg","wt")]
## mpg wt
## Merc 240D 24.4 3.190
## Fiat 128 32.4 2.200
## Honda Civic 30.4 1.615
## Toyota Corolla 33.9 1.835
## Fiat X1-9 27.3 1.935
## Porsche 914-2 26.0 2.140
## Lotus Europa 30.4 1.513
These can start to look pretty confusing, but, once you get comfortable with the basic syntax, you will see how things break down. And, it can be useful to build up the final syntax iteratively, much as was done in the example above.
A couple of additional notes on the conditional selections (the use of > above):
==!=.There is also the function which() that you may see around, but in general I would recommend not using this since it relies on numeric subsetting and can be difficult to debug.
And, if you are regular expression junkie, you can use grepl() in your row or column selections (typically, it’s in your row selection). There is a grep() function that actually returns the matched values, but, if you’re doing a selection, you actually want to return a logical vector (TRUEs and FALSEs) for your condition as to which rows you want to match…and that is what grepl() does.
If you have loaded dplyr() then it makes sense to use its select() for columns and filter() for rows instead.
So, now you can subset at will, how does this apply to data munging?
Well, in many cases your data will come with elements you need to change that you need to filter down to. You can then reassign those values to what you prefer.
A few other functions are useful to know for these cases:
## Will return TRUE if a value is NA (e.g. imported incorrectly)
is.na(NA)
## [1] TRUE
a_vector <- c(1,2,3,NA,4)
is.na(a_vector)
## [1] FALSE FALSE FALSE TRUE FALSE
Lets take the previous mtcars columns and say we want to set all the mpg values to 24 if they are over 24.
In this case we can filter to the elements we need like before, but this time modifying the data in place using the <- assignment command:
my_new_data <- mtcars
my_new_data[my_new_data$mpg > 24, "mpg"] <- 24
max(my_new_data[, "mpg"])
## [1] 24