---
title: Factors
author: "Eric C. Anderson"
output:
html_document:
toc: yes
bookdown::html_chapter:
toc: no
layout: default_with_disqus
---
# Factors {#factor-lecture}
* Goals of this lecture:
1. Go into moderate detail on _factors_ (A tricky little data structure that
probably causes more problems than anything else in R.)
a. What they are / what they look like.
b. Why we talk about them with _data frames_
c. How they behave.
d. Ways that they are useful.
2. In the process we will look at the `table` function and have
some examples from the world of _genetic assignment_ of birds.
## Factor basics {#factor-basics}
Let's reiterate some points/examples from the previous session.
### Factors are vectors that record discrete _categories_
* Anything measured on a disrete scale can be said to fall into
one of a set of categories.
* The discrete scale could be a summary of a continuous scale
+ For example, the categories of _Small_, _Medium_, and _Large_ are (likely) summaries of
a continuous variable like weight or height.
* If you have measured fish and put them into _Small_, _Medium_, and _Large_, categories
you might have them in a data frame like this:
```{r}
set.seed(17)
sml <- data.frame(ID = paste("Fish", 1:15, sep="_"),
SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T)
)
# when you print it out it looks pretty normal
sml
```
### Underlying structure of a _factor_
* The "SizeCategory" column looks like a vector of strings (a character vector),
but it isn't.
* A factor is a class that contains:
1. A _levels_ attribute that maps $N$ categories to the integers $1,\ldots,N$
+ (This sounds more complex than it is. It is just a character vector that gives
an ordered collection of category names)
2. An integer vector of values between 1 and $N$ used to describe the occurrence of the
categories.
* What? If that's not clear, continuing with the `sml` example from above should help clarify things
### _sml_ data frame's SizeCategory
* We can access the _levels_ attribute of `sml$SizeCategory` like this:
```{r}
levels(sml$SizeCategory)
```
* The order these are in the _levels_ tells us that:
+ 1 = "Large"
+ 2 = "Medium"
+ 3 = "Small"
* And the integer vector part of `sml$SizeCategory` can be visualized by attaching it
on the right side of the `sml` data frame like this:
```{r}
cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory))
```
* (Note that, by default, if categories are named by characters, R sorts them
alphabetically to give them an order in the _levels_ of the factor.)
### How R prints factors
* R prints factors by showing the values as the __strings__ that they are.
* And, at the bottom it prints out the _levels_
+ Or if there are lots of levels (i.e. categories) then it prints a few of them
* It looks like this:
```{r}
sml$SizeCategory
```
* So, when you print something and it says `Levels:` on the last line,
you know you are dealing with a factor.
## A different example {#same-on-factors}
### Another example Data Frame
We can make some bogus data
```{r}
set.seed(1)
bogus <- data.frame(
students = rep(c("Devon", "Martha", "Hilary"), 3),
tests = rep(c("Sep","Oct", "Nov"), each = 3),
scores = as.integer(runif(9, min = 55, max = 98))
)
bogus # look at it
str(bogus) # see what the types are. Hey there are factors!
```
### Important Note
* The default behavior of R is to convert character vectors to factors when putting them into a data frame.
* The column you get in `bogus$students` is the same as is returned by
```{r}
factor(rep(c("Devon", "Martha", "Hilary"), 3))
```
* So, the function `factor()` takes a vector and makes a factor vector out of it
### What a factor consists of in R
* Somewhat more tersely and technically than before:
* A factor is a vector with class attribute of `factor` and with another attribute called `levels`
* For a factor `f`:
```{r, eval=FALSE}
levels(f) # returns the levels of f
levels(f) <- # can be used to set/modify the levels attribute of f
```
* `levels(f)` is a _character vector_, that will be sorted by default.
* The values of the factor variable itself are integers.
+ The i-th element of the factor vector tells us which level (or category) the i-th observation falls into.
### What a Factor Looks like Under the Hood
* One can use the `unclass` function to see the actual parts of an R object
without having them printed in a way that is specific to the object`s `class` attribute.
```{r}
bogus$students # printed as a factor
unclass(bogus$students) # printed generically
bogus$tests # printed as a factor
unclass(bogus$tests) # printed generically
```
## Issues and such with factors {#factor-issues}
### You can make R _not_ create factors of character data in data frames
* The `data.frame` function, as well as the `read.table` family of functions
accept a `stringsAsFactors` parameter.
* This can be a reasonable thing to do, since you can always explicitly
make certain columns factors if you want to, using the `factor` function
later.
### Why does R use factors?
* The idea of factors is central to the fitting of various statistical models.
* However R seems to go overboard by wanting to squash any character vector into a factor in a data frame.
+ Some of this relates to the fact that prior to a fairly late version of R, coding character vectors
as factors was more space efficient.
* There are numerous hassles and headaches involved in dealing with factors, but factors are here to
stay in R, so we had better get comfortable with them.
* There are also many good things about factors (see later).
### Factors, once made, restrict allowable levels
Example:
```{r}
studentsf <- bogus$students # this is a factor variable
studentsf # print it and see its values and levels
studentsf[c(1,4,7)] # return all the Devon values.
# note that the levels are still all three names
# Now, what if we want to change the name "Devon" to "The Dude"?
studentsf[c(1,4,7)] <- "The Dude" # R gets upset when you do this!
```
### How can you change values of factors?
* Two main ways:
1. Modify the levels. In this example we will change "Devon" to "The Dude"
```{r}
# Look at bogus$students
bogus$students
# Confirm that Devon is the first element of the levels
levels(bogus$students)
# Change that to "The Dude" using assignment-form indexing
levels(bogus$students)[1] <- "The Dude"
# Now look at the factor
bogus$students
```
2. Coerce the factor to a character vector, modify, and re-`factor()` it
```{r}
# let's change "Martha" to "Martha A"
# what happens when we coerce to character?
as.character(bogus$students)
# OK, so make a variable of that and then modify it
tmp <- as.character(bogus$students)
tmp[tmp == "Martha"] <- "Martha A" # change every occurrence of "Martha" to "Martha A"
# When we turn tmp back into a factor, what does it look like?
factor(tmp)
# OK, cool, we can assign that to bogus$students
bogus$students <- factor(tmp)
# Look at the result:
bogus
```
### Catenating two factors
* What if we have this scenario:
```{r}
# imagine you have two factors
boys_f <- factor(c("Joe", "Ted", "Fred", "Joe"))
girls_f <- factor(c("Anne", "Louise", "Louise", "Lucy", "Louise"))
```
and, further, imagine you want to bung them together into a factor of `kids_f`.
* This fails spectacularly:
```{r}
kids_f <- c(boys_f, girls_f)
kids_f
```
Yikes! It has just catenated the underlying integer vectors!
* To get what you want:
1. coerce each to character
2. catenate
3. re-`factor` it
i.e.:
```{r}
kids_f <- factor(c(as.character(boys_f), as.character(girls_f)))
kids_f
# check out the levels:
levels(kids_f)
```
### What about adding rows to data frames?
* Fortunately, if you want to add rows to a data frame,
you can do that with `rbind()` and it will update the factor columns appropriately:
```{r}
extra <- rbind(bogus,
data.frame(students = c("Hilary", "Eve"),
tests = c("Jan", "Sep"),
scores = c(88, 97)
)
)
# what was the result?
extra
# what do the levels look like:
levels(extra$students)
levels(extra$tests)
```
### Factor levels stick around
* Even if you delete all occurrences of a level in a factor vector,
the levels do not _automatically_ change:
```{r}
no.dude <- bogus[ bogus$students != "The Dude", ] # drop Devon (The Dude) and his dudeliness
no.dude # print it out...no "The Dude"
no.dude$students # print that column of students
# whoa-ho! The Dude is still a level...The Dude abides!
# check again
levels(no.dude$students)
```
* If you have subsetted a data frame and you want to get rid of
the extra levels of all the factors, you can do like this with `droplevels()`:
```{r}
no.dude2 <- droplevels(no.dude)
no.dude2 # print it
# check the levels
levels(no.dude2$students) # no The Dude!
```
* In many contexts you _will_ want the factor levels to stick around. In others you don't.
### Numeric/Character/Factor Disasters
The most common disaster that can happen with factors occurs when
you think you can get back to a numeric vector by coercing a factor to as.numeric:
```{r}
# here are some integers
my.nums <- c(1,4,8,10,1,8,8,8,10)
# make them a factor
numf <- factor(my.nums)
# try to recover the original integers
as.numeric(numf) # disaster
# 2 "correct" ways of doing it
as.numeric(as.character(numf)) # coerce to character first, then to numeric
as.numeric(levels(numf)[numf]) # slurp out the levels by numf and coerce
```
## Why factors are super useful! {#factor-utility}
* I am going to go through just one example that involves counting up occurrences
of different categories.
* When counting categories you usually will want to:
1. Record a zero for known categories that had no observations
2. List the categories in a particular order
* Both of these desires can be accommodated by judicious use of _factors_!
1. Because _levels_ "stick around" categories will be counted (as 0) even
if there are no observations of them
2. The _levels_ of a factor can be put in any order desired, and that order
will be used in reporting from many different functions.
### The _table()_ function
* `table(x)` gives the number of occurrence of each unique category in `x`.
```{r}
set.seed(2)
x <- sample(letters, size = 100, replace = TRUE)
x # print it
# count the number of each occurence
table(x)
```
* It also can count the number of occurrences of pairs of categories:
```{r}
set.seed(20)
x <- sample(letters[1:3], size = 10, replace = TRUE)
y <- sample(LETTERS[1:3], size = 10, replace = TRUE)
cbind(x,y) # think of lining up x and y together
# how often do you see the combination a,A or a,B or c,B etc.
table(list(x, y))
```
### Some sample data from birds
* Example from [Mapping migration in a songbird ...](http://onlinelibrary.wiley.com/doi/10.1111/mec.12977/abstract)
* 393 birds from various _locations_ in the breeding range of Wilson's warbler
![wilson's warbler](http://www.allaboutbirds.org/guide/PHOTO/LARGE/WilsonsWarbler-Vyn_090606_5182.jpg)
* These were genotyped, and _locations_ were lumped into _regions_
* Then we asked how well we could use the genetic data to assign individual
birds from each _location_ to the correct _region_
* Here is what the output looks like (a data frame)
```{r}
wiwa <- read.csv("data/bird-self-assignments.csv", row.names=1)
head(wiwa)
# here are the different locations
levels(wiwa$PopulationOfOrigin)
# here are the different regions
levels(wiwa$MaxColumn)
```
### Counting up self-assignments
* We can count how many birds from each location were assigned to which regions using `table()`
```{r}
table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn))
```
* That is all right, but the locations and regions are not
ordered very sensibly.
+ They are ordered alphabetically,
+ It would be better to order them geographically
* We can do this by resetting the levels in the order we want:
1. First, get vectors that have all the categories you want in the
order you want them in
```{r}
# a vector of regions in a geographically sensible order
regions_ordered <- c("AK.EastBC.AB", "Wa.To.NorCalCoast", "CentCalCoast", "CalSierra", "Basin.Rockies", "Eastern")
# get a vector of locations in a good order
locations_ordered <- c("wAKDE", "wAKYA", "wAKUG", "wAKJU", "wABCA", "wBCMH", "wWADA",
"wORHA", "wORMB", "wCAEU", "wCAHM", "wCABS", "wCASL", "wCATE", "wCACL", "wCAHU",
"wMTHM", "wOREL", "wCOPP", "wCOGM", "eQCCM", "eONHI", "eNBFR"
)
```
2. Then, this is the magical step: reset the levels to be the ordered vectors of categories you want.
You do this by passing in the ordered vector to the `levels` argument of the `factor()` function:
```{r}
# order the levels of the regions nicely
wiwa$MaxColumn <- factor(wiwa$MaxColumn, levels = regions_ordered)
# order the levels of the locations nicely
wiwa$PopulationOfOrigin <- factor(wiwa$PopulationOfOrigin, levels = locations_ordered)
```
+ __WARNING__ DO NOT DO THIS!
```{r}
levels(wiwa$MaxColumn) <- regions_ordered
levels(wiwa$PopulationOfOrigin) <- locations_ordered
```
You have to reconstitute is as a factor after changing the levels. Otherwise you can
get totally wrong values.
3. Then use table again, and note the ordering of the categories in the output:
```{r}
table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn))
```
* Many, many functions use the order of the levels of a factor to determine what order to
output things in (like drawning legends on plots, etc.). So knowing how to
set the order of the levels with `factor(my.factor, levels = my.ord)` is very useful.