---
title: Data frames
author: "Eric C. Anderson"
output:
  html_document:
    toc: yes
  bookdown::html_chapter:
    toc: no
layout: default_with_disqus
---

```{r setup, echo=FALSE, eval=TRUE}
# PLEASE DO NOT EDIT THIS CODE BLOCK
library(knitr)
library(rrhw)


# tell knitr where to find the inserted file in case
# jekyll is building this in the top directory of the repo
opts_knit$set(child.path = paste(prj_dir_containing("rep-res-course.Rproj"), "extras/knitr_children/", sep=""))
opts_knit$set(collapse = TRUE)
opts_knit$set(comment = "#>")


init_homework("Data Frames and Factors Lecture")
rr_github_name <- NA
rr_pull_request_time <- NA
rr_question_chunk_name <- "NotSet"
rr_branch_name <- "ex-test"
rr_hw_file_name <- "exercises/not-important-here.rmd"
```


# Data frames and factors {#data-frame-lecture} 


* Goals of this lecture:
    1. Introduce _data frames_! (Possibly the most widely-used and useful data structure in R)
        a. What is a data frame?
        b. Making data frames
        e. Viewing data frames in RStudio
        b. Indexing data frames
        c. Reading in data frames
        d. Writing data frames
    2. Introduce, very briefly, _factors_ (A tricky little data structure that probably causes more problems than 
    anything else in R.)
        a. What they are / what they look like.
        b. Why we talk about them with _data frames_
        c. How they behave.
        d. Ways that they are useful.


## Data frames basics {#data-frame-basics}

### What's a data frame?

* A _data frame_ is a __list__ that:
    + has the _class_ `data.frame`
    + has _components_ that are all atomic vectors __of the same length__.
* Think of them as a _table_ of data.  Where:
    + The rows are _records_ and
    + The columns are the atomic vectors that contain values of variables.
* Probably 90% of the time (or more), what someone might call a _data set_ is something
that can be represented in R as a _data frame_.
* Example:
    ```{r}
    d <- data.frame(
      age = c(4, 6, 3, 4), 
      sex = c("MALE", "FEMALE", "FEMALE", "MALE"), 
      height.inches = c(40, 49, 38, 42), 
      favorite.sport.or.activity = c("soccer", "soccer", "martial_arts", "ballet")
    )
    # now, print it to the screen
    d
    ```
* This thing is shaped like a _matrix_ and can be indexed in special ways (below), but at its core it is a //list//.

### The _data.frame()_ function

* Syntactically, this is like the `list()` function, taking "key=value" pairs. 
    + For example, the first component has the "key", `age` and the "value" `c(4, 6, 3, 4)`. 
    + The _keys_ become the _names_ attribute of the data frame.
* But, returns a `data.frame`:
    ```{r}
    class(d)
    ```


### The _names_ / _colnames_ of a _data frame_

* The _names_ attribute of a data frame holds the "column headers"
    ```{r}
    names(d)
    ```
These can also be accessed as the `colnames` (column names):
    ```{r}
    colnames(d)
    ```
Which begs the question, are there _row_names of a data frame? Let's try:
    ```{r}
    rownames(d)
    ```

### The _rownames_ of a _data frame_

* You can assign _names_ to the rows of a _data frame_.
* Use the `rownames()` function. For example:
    ```{r}
    rownames(d) <- c("Jon", "Scarlett", "Nancy", "Terry")
    # then print it out again:
    d
    ```
* _rownames_ have to be unique!
    ```{r, error=TRUE}
    rownames(d) <- c("Jon", "Scarlett", "Nancy", "Jon")
    ```
* ...and the right length, too:
    ```{r, error=TRUE}
    rownames(d) <- c("Jon", "Scarlett")
    ```
* If you don't provide them, they will be integers `1:nrow(df)`

### Dimensions of a data frame

* A useful summary of the extent of a data frame is `dim`.  Likewise `ncol` and `nrow`
    ```{r}
    dim(d)
    nrow(d)
    ncol(d)
    ```

## Data frame indexing {#data-frame-indexing}

* _data frames_ can be indexed like _lists_ or like _matrices_

### Data frame indexing like a list

* Single-chome extractor `[  ]` with a _single vector_ and _no commas_ picks out the columns, and returns it as another data.frame:           
    ```{r}
    # index with integers
    d[c(1,3)]

    # index with colnames
    d[c("age", "sex")]
    ``` 
Note that the _rownames_ get carried along with the result.
* Two-chomp extractor `$` returns the vector itself. (Naked, not as part of a data frame)
    ```{r}
    d$age
    
    d$height.inches
    ```
*Two-chomp extractor `[[ ]]` does the same as the `$` but doesn't do prefix-matching
    ```{r}
    d[["age"]]
    ```
The _rownames_ don't come along with the result.

### Matrix-like indexing of _data frames_ 

* This is new thing! Subset with _two vectors_ separated by a _comma_!
* i.e., `[row, col]` where:
    + `rows` is an indexing vector for the rows of row indices or _rownames_ or _logical_ values
    + `cols` is an indexing vector for the columns indices, or _colnames_ or logical values
    + And...(big note!) the absence of `rows` or `cols` means "give me all of them" d[1:2,]
* `rows` and `cols` can be:
    + positive integer vectors,
    + negative interger vectors,
    + character vectors of names, 
    + logical vectors 
    + (or mixtures of the two, i.e. `rows` as one and `cols` as another
* Examples:
    ```{r}
    d[,]  # the whole data frame

    d[,1:3] # all rows, first three columns

    d[c(1,4), ] # first and fourth rows, all columns

    d[-1, -2] # all rows except 1 and all columns except 2

    d[d$sex == "MALE", c("age", "favorite.sport.or.activity")] # age and favorite activities of MALES

    d[d$sex == "FEMALE", c(1,3)] # ages and heights of  FEMALES

    d[d$age == 3, ] # all columns from the one three-year-old
    ```

### Whoa! What happens when _[rows, cols]_ picks out a single column?

* Beware, if your `[rows, cols]` extractor picks out just a _single column_, then by default,
R will just return a (unnamed) vector, not a data frame!
    ```{r}
    # ages of Jon and Terry... What! Where's my data frame?
    d[c("Jon", "Terry"), "age"]  
    ```
* When you want to get a one-column data frame rather than a naked vector, do this:
    ```{r}
    d[c("Jon", "Terry"), "age", drop = FALSE]
    ```
* This is __super-important__ if you are writing functions that grab variable
numbers of columns out of data frames (or matrices)

### Replacement form indexing

* All these indexing measures have replacement forms:
    ```{r}
    # change Terry's favorite activity to soccer
    d["Terry", 4] <- "paint-ball"
    d # print it

    # what if we tried to change it to "mushroom hunting"?
    d["Terry", 4] <- "mushroom hunting"
    d
    ```
Surprise!  What happened?  (Wait till we talk about _factors_ later.)
* Assigning values to columns will recycle to the right length:
    ```{r}
    # make them all five years old...
    d$age <- 5
    d
    ```
## Reading, viewing, and writing data frames {#read-view-write}

* Hooray! We are _finally_ learning what to do to get _our own_ data into R!
* We'll use some data from Big Creek for examples
    + You should pull the _master_ branch of https://github.com/eriqande/rep-res-course.git to get a file in the `data` directory.
    + Then go ahead and open up R Studio in that repository if you want to follow along.
* I have the first 100 lines of the big-creek data set in the `data` directory in both
    + `.xlsx` format (Ahhh! This is just here if you want to see it.  Remember, never house and manipulate the sole copy
    of your data in Excel!)
    + `.csv` format (comma separate values --- a decent format for reading into R)
* Rather than opening .csv files in Excel to look at them, it's possible to just look at them if they are on GitHub.
Try [this link](https://github.com/eriqande/rep-res-course/blob/master/data/big_creek_excerpt.csv). 
    
### _read.table()_ 

* A function that reads in "table-shaped" data and returns a _data frame_
* `read.table()` is a rather generic function, that lets you specify:
    + `file` : the name of the file
    + `header` : TRUE/FALSE depending on it the file has a header row for the columns
    + `sep` : the character used to separate columns
    + `row.names` : column number holding the values to be used for rownames
    + `na.strings` : what strings signify values that should be read as `NA`
And many, many others.  Do `?read.table` for the complete list.

### _read.csv()_

* A function identical to `read.table()` except that the default values are set up
to read in CSV files (like those produced by Excel...)
* Let's try it:
    ```{r}
    bc <- read.csv("data/big_creek_excerpt.csv", stringsAsFactors = FALSE, na.strings = c(""))
    ```
* We are using two extra options:
    + `stringsAsFactors = FALSE` (see next lecture)
    + `na.strings = c("")` : This means count empty cells as missing data
* Did that work?  Check the `dim` of bc:
    ```{r}
    dim(bc)
    ```
Sweet!

### Looking at our data frame

* To figure out what is in our data frame, there are several options.
    1. Just print it: `bc`.  If the data frame is large, this produces a bunch of hard to read output
        + All rows at as many columns as can fit on the screen...then the next set of columns, etc.
    2. use the `head` function.  i.e., `head(bc)`.  Prints just the first 10, rows.  With lots of columns,
    this is hard to read too.
    3. Use indexing to look at just a small part: i.e.:
        ```{r}
        bc[1:5, 1:4]
        ```
    4. Look at the names:
        ```{r}
        names(bc)
        ```
    That is a little cumbersome
    5. Perhaps the most information-rich way of looking at it is with the `str` function, which gives you the
    __str__ucture of an R object:
        ```{r}
        str(bc)
        ```
    6. Finally, RStudio offers the very useful `View` function. Try this: `View(bc)`
        + You can even pop that out into a separate window.
        + They really ought to find a way to keep the headers visible when scrolling.
        

### Writing a data frame back out to a .csv file

* There is a `write.table` function much like `read.table`
* And there is a `write.csv` function that is similar
* Here we pick out just the fish between 60 and 100 mm and write 
the resulting data frame back to a .csv file:
    ```{r}
    bc2 <- bc[ bc$LENGTH >= 6 & bc$LENGTH <= 100, ]

    write.csv(bc2, file = "~/Desktop/bc-bits.csv")
    ```
and you can open that with Excel, even.
    + Note that the numeric rownames are in there by default with no header. 
    + If you read it back in, you would want to use `row.names = 1`.
    + Read `?write.table` for more info.
    
## A tiny blurb about _factors_  {#factors-tiny-blurb}

* In `read.csv` we used the option `stringsAsFactors = FALSE`
    + What does that mean, and why did I use it?
* In all the `read.table` family of functions, columns with character data
(i.e. text strings) get converted to an object of class _factor_.
* In R you will see _factors_ everywhere.  
* The name derives from the idea of factors in experimental design, which is 
a shame (I think) since factors in R are useful in many ways.
* My suggestion: when you see _factor_ think _vector of __categories___

### Factors are vectors that record discrete _categories_

* Anything measured on a disrete scale can be said to fall into
one of a set of categories.
* The discrete scale could be a summary of a continuous scale
    + For example, the categories of _Small_, _Medium_, and _Large_ are (likely) summaries of
    a continuous variable like weight or height.
* If you have measured fish and put them into _Small_, _Medium_, and _Large_, categories
you might have them in a data frame like this:
    ```{r}
    set.seed(17)
    sml <- data.frame(ID = paste("Fish", 1:15, sep="_"),
                      SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T)
                      )

    # when you print it out it looks pretty normal
    sml                 
    ```

### Underlying structure of a _factor_

* The "SizeCategory" column looks like a vector of strings (a character vector),
but it isn't.
* A factor is a class that contains:
    1. A _levels_ attribute that maps $N$ categories to the integers $1,\ldots,N$
        + (This sounds more complex than it is.  It is just a character vector that gives
        an ordered collection of category names)
    2. An integer vector of values between 1 and $N$ used to describe the occurrence of the
    categories.
* What?  If that's not clear, continuing with the `sml` example from above should help clarify things

### _sml_ data frame's SizeCategory

* We can access the _levels_ attribute of `sml$SizeCategory` like this:
    ```{r}
    levels(sml$SizeCategory)
    ```
* The order these are in the _levels_ tells us that:
    + 1 = "Large" 
    + 2 = "Medium"
    + 3 = "Small"
* And the integer vector part of `sml$SizeCategory` can be visualized by attaching it
on the right side of the `sml` data frame like this:
    ```{r}
    cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory))
    ```
* (Note that, by default, if categories are named by characters, R sorts them
alphabetically to give them an order in the _levels_ of the factor.)

    
### Factors are immensely useful, but tricky

* We will continue talking about factors on Thursday.  
* Before that class, please download [The R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)
and read the Preface on page 8 and the first few
paragraphs of Chapter 1 (because it is fun
to do so---we have all been in R hell at one time or another), then read from section 8.2 through 8.2.8, which 
covers factor hell.

## Your mission {#mission-data-frame}

In lieu of homework on this topic, everyone should just do the following while 
this is fresh in your mind:

1. Read `?read.table`
2. Go get your own data sets that you want to work with (or are working with) and
read them into R and have a look around them. 
    + Look over their structure
    + print them to the console in various ways
    + `View()` them.
    + Change some values
    + Extract just a few, non-adjacent columns
    + Then save those non-adjacent columns to a new csv file.
3. If you don't have your own data and want some practice, play with more files that
I put in the`data` directory of the course repo:
    ```{r}
    # parentage assignments of hatchery salmon
    pbt <- read.table("data/snppit_output_ParentageAssignments.txt", header = TRUE, na.strings = "---")
    dim(pbt)

    # candidate genes involved in avian song development
    bird_genes <- read.table("data/candidate-genes.txt", header = TRUE, sep = "\t")
    ```