--- title: Factors author: "Eric C. Anderson" output: html_document: toc: yes bookdown::html_chapter: toc: no layout: default_with_disqus --- # Factors {#factor-lecture} * Goals of this lecture: 1. Go into moderate detail on _factors_ (A tricky little data structure that probably causes more problems than anything else in R.) a. What they are / what they look like. b. Why we talk about them with _data frames_ c. How they behave. d. Ways that they are useful. 2. In the process we will look at the `table` function and have some examples from the world of _genetic assignment_ of birds. ## Factor basics {#factor-basics} Let's reiterate some points/examples from the previous session. ### Factors are vectors that record discrete _categories_ * Anything measured on a disrete scale can be said to fall into one of a set of categories. * The discrete scale could be a summary of a continuous scale + For example, the categories of _Small_, _Medium_, and _Large_ are (likely) summaries of a continuous variable like weight or height. * If you have measured fish and put them into _Small_, _Medium_, and _Large_, categories you might have them in a data frame like this: ```{r} set.seed(17) sml <- data.frame(ID = paste("Fish", 1:15, sep="_"), SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T) ) # when you print it out it looks pretty normal sml ``` ### Underlying structure of a _factor_ * The "SizeCategory" column looks like a vector of strings (a character vector), but it isn't. * A factor is a class that contains: 1. A _levels_ attribute that maps $N$ categories to the integers $1,\ldots,N$ + (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names) 2. An integer vector of values between 1 and $N$ used to describe the occurrence of the categories. * What? If that's not clear, continuing with the `sml` example from above should help clarify things ### _sml_ data frame's SizeCategory * We can access the _levels_ attribute of `sml$SizeCategory` like this: ```{r} levels(sml$SizeCategory) ``` * The order these are in the _levels_ tells us that: + 1 = "Large" + 2 = "Medium" + 3 = "Small" * And the integer vector part of `sml$SizeCategory` can be visualized by attaching it on the right side of the `sml` data frame like this: ```{r} cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory)) ``` * (Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the _levels_ of the factor.) ### How R prints factors * R prints factors by showing the values as the __strings__ that they are. * And, at the bottom it prints out the _levels_ + Or if there are lots of levels (i.e. categories) then it prints a few of them * It looks like this: ```{r} sml$SizeCategory ``` * So, when you print something and it says `Levels:` on the last line, you know you are dealing with a factor. ## A different example {#same-on-factors} ### Another example Data Frame We can make some bogus data ```{r} set.seed(1) bogus <- data.frame( students = rep(c("Devon", "Martha", "Hilary"), 3), tests = rep(c("Sep","Oct", "Nov"), each = 3), scores = as.integer(runif(9, min = 55, max = 98)) ) bogus # look at it str(bogus) # see what the types are. Hey there are factors! ``` ### Important Note * The default behavior of R is to convert character vectors to factors when putting them into a data frame. * The column you get in `bogus$students` is the same as is returned by ```{r} factor(rep(c("Devon", "Martha", "Hilary"), 3)) ``` * So, the function `factor()` takes a vector and makes a factor vector out of it ### What a factor consists of in R * Somewhat more tersely and technically than before: * A factor is a vector with class attribute of `factor` and with another attribute called `levels` * For a factor `f`: ```{r, eval=FALSE} levels(f) # returns the levels of f levels(f) <- # can be used to set/modify the levels attribute of f ``` * `levels(f)` is a _character vector_, that will be sorted by default. * The values of the factor variable itself are integers. + The i-th element of the factor vector tells us which level (or category) the i-th observation falls into. ### What a Factor Looks like Under the Hood * One can use the `unclass` function to see the actual parts of an R object without having them printed in a way that is specific to the object`s `class` attribute. ```{r} bogus$students # printed as a factor unclass(bogus$students) # printed generically bogus$tests # printed as a factor unclass(bogus$tests) # printed generically ``` ## Issues and such with factors {#factor-issues} ### You can make R _not_ create factors of character data in data frames * The `data.frame` function, as well as the `read.table` family of functions accept a `stringsAsFactors` parameter. * This can be a reasonable thing to do, since you can always explicitly make certain columns factors if you want to, using the `factor` function later. ### Why does R use factors? * The idea of factors is central to the fitting of various statistical models. * However R seems to go overboard by wanting to squash any character vector into a factor in a data frame. + Some of this relates to the fact that prior to a fairly late version of R, coding character vectors as factors was more space efficient. * There are numerous hassles and headaches involved in dealing with factors, but factors are here to stay in R, so we had better get comfortable with them. * There are also many good things about factors (see later). ### Factors, once made, restrict allowable levels Example: ```{r} studentsf <- bogus$students # this is a factor variable studentsf # print it and see its values and levels studentsf[c(1,4,7)] # return all the Devon values. # note that the levels are still all three names # Now, what if we want to change the name "Devon" to "The Dude"? studentsf[c(1,4,7)] <- "The Dude" # R gets upset when you do this! ``` ### How can you change values of factors? * Two main ways: 1. Modify the levels. In this example we will change "Devon" to "The Dude" ```{r} # Look at bogus$students bogus$students # Confirm that Devon is the first element of the levels levels(bogus$students) # Change that to "The Dude" using assignment-form indexing levels(bogus$students)[1] <- "The Dude" # Now look at the factor bogus$students ``` 2. Coerce the factor to a character vector, modify, and re-`factor()` it ```{r} # let's change "Martha" to "Martha A" # what happens when we coerce to character? as.character(bogus$students) # OK, so make a variable of that and then modify it tmp <- as.character(bogus$students) tmp[tmp == "Martha"] <- "Martha A" # change every occurrence of "Martha" to "Martha A" # When we turn tmp back into a factor, what does it look like? factor(tmp) # OK, cool, we can assign that to bogus$students bogus$students <- factor(tmp) # Look at the result: bogus ``` ### Catenating two factors * What if we have this scenario: ```{r} # imagine you have two factors boys_f <- factor(c("Joe", "Ted", "Fred", "Joe")) girls_f <- factor(c("Anne", "Louise", "Louise", "Lucy", "Louise")) ``` and, further, imagine you want to bung them together into a factor of `kids_f`. * This fails spectacularly: ```{r} kids_f <- c(boys_f, girls_f) kids_f ``` Yikes! It has just catenated the underlying integer vectors! * To get what you want: 1. coerce each to character 2. catenate 3. re-`factor` it i.e.: ```{r} kids_f <- factor(c(as.character(boys_f), as.character(girls_f))) kids_f # check out the levels: levels(kids_f) ``` ### What about adding rows to data frames? * Fortunately, if you want to add rows to a data frame, you can do that with `rbind()` and it will update the factor columns appropriately: ```{r} extra <- rbind(bogus, data.frame(students = c("Hilary", "Eve"), tests = c("Jan", "Sep"), scores = c(88, 97) ) ) # what was the result? extra # what do the levels look like: levels(extra$students) levels(extra$tests) ``` ### Factor levels stick around * Even if you delete all occurrences of a level in a factor vector, the levels do not _automatically_ change: ```{r} no.dude <- bogus[ bogus$students != "The Dude", ] # drop Devon (The Dude) and his dudeliness no.dude # print it out...no "The Dude" no.dude$students # print that column of students # whoa-ho! The Dude is still a level...The Dude abides! # check again levels(no.dude$students) ``` * If you have subsetted a data frame and you want to get rid of the extra levels of all the factors, you can do like this with `droplevels()`: ```{r} no.dude2 <- droplevels(no.dude) no.dude2 # print it # check the levels levels(no.dude2$students) # no The Dude! ``` * In many contexts you _will_ want the factor levels to stick around. In others you don't. ### Numeric/Character/Factor Disasters The most common disaster that can happen with factors occurs when you think you can get back to a numeric vector by coercing a factor to as.numeric: ```{r} # here are some integers my.nums <- c(1,4,8,10,1,8,8,8,10) # make them a factor numf <- factor(my.nums) # try to recover the original integers as.numeric(numf) # disaster # 2 "correct" ways of doing it as.numeric(as.character(numf)) # coerce to character first, then to numeric as.numeric(levels(numf)[numf]) # slurp out the levels by numf and coerce ``` ## Why factors are super useful! {#factor-utility} * I am going to go through just one example that involves counting up occurrences of different categories. * When counting categories you usually will want to: 1. Record a zero for known categories that had no observations 2. List the categories in a particular order * Both of these desires can be accommodated by judicious use of _factors_! 1. Because _levels_ "stick around" categories will be counted (as 0) even if there are no observations of them 2. The _levels_ of a factor can be put in any order desired, and that order will be used in reporting from many different functions. ### The _table()_ function * `table(x)` gives the number of occurrence of each unique category in `x`. ```{r} set.seed(2) x <- sample(letters, size = 100, replace = TRUE) x # print it # count the number of each occurence table(x) ``` * It also can count the number of occurrences of pairs of categories: ```{r} set.seed(20) x <- sample(letters[1:3], size = 10, replace = TRUE) y <- sample(LETTERS[1:3], size = 10, replace = TRUE) cbind(x,y) # think of lining up x and y together # how often do you see the combination a,A or a,B or c,B etc. table(list(x, y)) ``` ### Some sample data from birds * Example from [Mapping migration in a songbird ...](http://onlinelibrary.wiley.com/doi/10.1111/mec.12977/abstract) * 393 birds from various _locations_ in the breeding range of Wilson's warbler ![wilson's warbler](http://www.allaboutbirds.org/guide/PHOTO/LARGE/WilsonsWarbler-Vyn_090606_5182.jpg) * These were genotyped, and _locations_ were lumped into _regions_ * Then we asked how well we could use the genetic data to assign individual birds from each _location_ to the correct _region_ * Here is what the output looks like (a data frame) ```{r} wiwa <- read.csv("data/bird-self-assignments.csv", row.names=1) head(wiwa) # here are the different locations levels(wiwa$PopulationOfOrigin) # here are the different regions levels(wiwa$MaxColumn) ``` ### Counting up self-assignments * We can count how many birds from each location were assigned to which regions using `table()` ```{r} table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn)) ``` * That is all right, but the locations and regions are not ordered very sensibly. + They are ordered alphabetically, + It would be better to order them geographically * We can do this by resetting the levels in the order we want: 1. First, get vectors that have all the categories you want in the order you want them in ```{r} # a vector of regions in a geographically sensible order regions_ordered <- c("AK.EastBC.AB", "Wa.To.NorCalCoast", "CentCalCoast", "CalSierra", "Basin.Rockies", "Eastern") # get a vector of locations in a good order locations_ordered <- c("wAKDE", "wAKYA", "wAKUG", "wAKJU", "wABCA", "wBCMH", "wWADA", "wORHA", "wORMB", "wCAEU", "wCAHM", "wCABS", "wCASL", "wCATE", "wCACL", "wCAHU", "wMTHM", "wOREL", "wCOPP", "wCOGM", "eQCCM", "eONHI", "eNBFR" ) ``` 2. Then, this is the magical step: reset the levels to be the ordered vectors of categories you want. You do this by passing in the ordered vector to the `levels` argument of the `factor()` function: ```{r} # order the levels of the regions nicely wiwa$MaxColumn <- factor(wiwa$MaxColumn, levels = regions_ordered) # order the levels of the locations nicely wiwa$PopulationOfOrigin <- factor(wiwa$PopulationOfOrigin, levels = locations_ordered) ``` + __WARNING__ DO NOT DO THIS! ```{r} levels(wiwa$MaxColumn) <- regions_ordered levels(wiwa$PopulationOfOrigin) <- locations_ordered ``` You have to reconstitute is as a factor after changing the levels. Otherwise you can get totally wrong values. 3. Then use table again, and note the ordering of the categories in the output: ```{r} table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn)) ``` * Many, many functions use the order of the levels of a factor to determine what order to output things in (like drawning legends on plots, etc.). So knowing how to set the order of the levels with `factor(my.factor, levels = my.ord)` is very useful.