8.4 Dataframe column names
One of the nice things about dataframes is that each column will have a name. You can use these name to access specific columns by name without having to know which column number it is.
To access the names of a dataframe, use the function names()
. This will return a string vector with the names of the dataframe. Let’s use names()
to get the names of the ToothGrowth
dataframe:
# What are the names of columns in the ToothGrowth dataframe?
names(ToothGrowth)
## [1] "len" "supp" "dose"
To access a specific column in a dataframe by name, you use the $
operator in the form df$name
where df
is the name of the dataframe, and name
is the name of the column you are interested in. This operation will then return the column you want as a vector.
Let’s use the $
operator to get a vector of just the length column (called len
) from the ToothGrowth
dataframe:
# Return the len column of ToothGrowth
ToothGrowth$len
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4
## [26] 32.5 26.7 21.5 23.3 29.5 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3
## [51] 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0
Because the $
operator returns a vector, you can easily calculate descriptive statistics on columns of a dataframe by applying your favorite vector function (like mean()
or table()
) to a column using $
. Let’s calculate the mean tooth length with mean()
, and the frequency of each supplement with table()
:
If you want to access several columns by name, you can forgo the $ operator, and put a character vector of column names in brackets:
# Give me the len AND supp columns of ToothGrowth
head(ToothGrowth[c("len", "supp")])
## len supp
## 1 4.2 VC
## 2 11.5 VC
## 3 7.3 VC
## 4 5.8 VC
## 5 6.4 VC
## 6 10.0 VC
8.4.1 Adding new columns
You can add new columns to a dataframe using the $
and assignment <-
operators. To do this, just use the df$name
notation and assign a new vector of data to it.
For example, let’s create a dataframe called survey
with two columns: index
and age
:
# Create a new dataframe called survey
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
"age" = c(24, 25, 42, 56, 22))
survey
## index age
## 1 1 24
## 2 2 25
## 3 3 42
## 4 4 56
## 5 5 22
Now, let’s add a new column called sex
with a vector of sex data:
Here’s the result
# survey with new sex column
survey
## index age sex
## 1 1 24 m
## 2 2 25 m
## 3 3 42 f
## 4 4 56 f
## 5 5 22 m
As you can see, survey
has a new column with the name sex
with the values we specified earlier.
8.4.2 Changing column names
To change the name of a column in a dataframe, just use a combination of the names()
function, indexing, and reassignment.
# Change name of 1st column of df to "a"
names(df)[1] <- "a"
# Change name of 2nd column of df to "b"
names(df)[2] <- "b"
For example, let’s change the name of the first column of survey
from index
to participant.number
# Change the name of the first column of survey to "participant.number"
names(survey)[1] <- "participant.number"
survey
## participant.number age sex
## 1 1 24 m
## 2 2 25 m
## 3 3 42 f
## 4 4 56 f
## 5 5 22 m
Warning!!!: Change column names with logical indexing to avoid errors!
Now, there is one major potential problem with my method above – I had to manually enter the value of 1. But what if the column I want to change isn’t in the first column (either because I typed it wrong or because the order of the columns changed)? This could lead to serious problems later on.
To avoid these issues, it’s better to change column names using a logical vector using the format names(df)[names(df) == "old.name"] <- "new.name"
. Here’s how to read this: “Change the names of df
, but only where the original name was "old.name"
, to "new.name"
.
Let’s use logical indexing to change the name of the column survey$age
to survey$years
: