---
title: "CSSS 508, Lecture 4"
subtitle: "Data Structures"
author: "Michael Pearce
(based on slides from Chuck Lanfear)"
date: "April 19, 2023"
output:
xaringan::moon_reader:
lib_dir: libs
css: xaringan-themer.css
nature:
highlightStyle: tomorrow-night-bright
highlightLines: true
countIncrementalSlides: false
titleSlideClass: ["center","top"]
---
class:inverse
```{r setup, include=FALSE, purl=FALSE}
options(htmltools.dir.version = FALSE, width = 70)
knitr::opts_chunk$set(comment = "##")
library(dplyr)
```
# Preamble: R Data Structures
So far, we've been manipulating, summarizing, and making visuals out of data. That's pretty great!!
But now, we need to get more into the weeds of programming...
Today is all about *types of data* in R.
---
class: inverse
# Topics
Last time, we learned about,
1. Logical conditions
1. Subsetting data
1. Modifying data
1. Summarizing data
1. Merging together data
--
Today, we will cover,
1. Types of Data
1. Vectors
1. Matrices
1. Lists
---
class: inverse
# 1. Types of Data
+ `numeric`
+ `character`
+ `factor`
+ `logical`
+ `NA`, `NaN`, and `Inf`
---
## How is my data stored?
Under the hood, R stores different types of data in different ways.
--
+ e.g., R knows that `4.0` is a number, and that `"Michael"` is not a number.
--
So what exactly are the common data types, and how do we know what R is doing?
---
## Data Types
* **numeric**: `c(1, 10*3, 4, -3.14)`
--
* **character**: `c("red", "blue", "blue")`
--
* **factor**: `factor(c("red", "blue", "blue"))`
--
* **logical**: `c(FALSE, TRUE, TRUE)`
---
## Note on Factor Vectors
Factors are categorical data that encode a (modest) number of **levels**, like for experimental group or geographic region:
```{r}
test_group <- factor(c("Treatment", "Placebo",
"Placebo", "Treatment"))
test_group
```
--
Why use `factor` instead of `character`? Because `factor` data can go into a statistical model.1
.footnote[[1] Most R models will automatically convert character data to factors. The default reference is chosen alphabetically.]
---
## Note on Logical Vectors
Remember that `logical` data in R takes on boolean `TRUE` or `FALSE` values.
--
You can do math with logical values, because R makes `TRUE`=1 and `FALSE`=0:
```{r}
my_booleans <- c(TRUE, TRUE, FALSE, FALSE, FALSE)
sum(my_booleans)
mean(my_booleans)
```
---
## Missing or Infinite Data Types
Your data may otherwise be missing or infinite:
1. **Not Applicable** `NA`
+ Used when data simply is missing or "not available"
--
2. **Not a Number** `NaN`
+ Used when you try to perform a bad math operation, e.g., `0 / 0 `
--
3. **Infinite** `Inf`, `-Inf`
+ Used when you divide by 0, e.g., `-5/0` or `5/0`
---
## Checking Data Types
`class()` tells us what type of data we have:
```{r}
class4 <- class(4)
classAB <- class(c("A","B"))
classABFac<- class(factor("A","B"))
classTRUE <- class(TRUE)
c(class4,classAB,classABFac,classTRUE)
```
---
## Testing Data Types
There are also functions to **test** for certain data types:
```{r}
c(is.numeric(5), is.character("A"))
is.logical(TRUE)
c(is.infinite(-Inf), is.na(NA), is.nan(NaN))
```
Warning: `NA` is not `NaN`!!!
---
class: inverse
# 2. Vectors
+ Creating Vectors
+ Vector Math
+ Subsetting Vectors
---
## Making Vectors
In R, we call a set of values of the same type a **vector**. We can create vectors using the `c()` function ("c" for **c**ombine or **c**oncatenate).
```{r}
c(1, 3, 7, -0.5)
```
--
Vectors have one dimension: **length**
```{r}
length(c(1, 3, 7, -0.5))
```
--
All elements of a vector are the same type (e.g. numeric or character)!
If you mix character and numeric data, it will convert everything to *characters*!
---
## Generating Numeric Vectors
There are shortcuts for generating numeric vectors:
```{r}
1:10
```
--
```{r}
seq(-3, 6, by = 1.75) # Sequence from -3 to 6, increments of 1.75
```
--
```{r}
rep(c(0, 1), times = 3) # Repeat c(0,1) 3 times
rep(c(0, 1), each = 3) # Repeat each element 3 times
```
---
## Element-wise Vector Math
When doing arithmetic operations on vectors, R handles these *element-wise*:
```{r}
c(1, 2, 3) + c(4, 5, 6)
c(1, 2, 3, 4)^3 # exponentiation with ^
```
Common operations: `*`, `/`, `exp()` = $e^x$, `log()` = $\log_e(x)$
---
## Vector Recycling
If we work with vectors of different lengths, R will **recycle** the shorter one by repeating it to make it match up with the longer one:
```{r}
c(0.5, 3) * c(1, 2, 3, 4)
c(0.5, 3, 0.5, 3) * c(1, 2, 3, 4) # same thing
```
---
## Scalars as Recycling
A special case of recycling involves arithmetic with **scalars** (a single number). These are vectors of length 1 that are recycled to make a longer vector:
```{r}
3 * c(-1, 0, 1, 2) + 1
```
---
## Warning on Recycling
Recycling doesn't work so well with vectors of incommensurate lengths:
```{r, warning=TRUE}
c(1,2) + c(100,200,300)
```
Be careful!!
---
## Example: Standardizing Data
Let's say we had some test scores and we wanted to put these on a standardized scale:
$$z_i = \frac{x_i - \text{mean}(x)}{\text{SD}(x)}$$
--
```{r}
x <- c(97, 68, 75, 77, 69, 81)
z <- (x - mean(x)) / sd(x)
round(z, 2)
```
---
## Math with Missing Values
Even one `NA` "poisons the well": You'll get `NA` out of your calculations unless you add the extra argument `na.rm = TRUE` (in some functions):
--
```{r}
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
mean(vector_w_missing)
mean(vector_w_missing, na.rm=TRUE)
```
---
## Subsetting Vectors
We can **subset** a vector in a number of ways:
* Passing a single index or vector of entries to **keep**:
```{r}
first_names <- c("Andre","Brady","Cecilia",
"Danni","Edgar","Francie")
first_names[1]
first_names[c(1,2)]
```
--
* Passing a single index or vector of entries to **drop**:
```{r}
first_names[-3]
```
---
class: inverse
# 3. Matrices
---
## Matrices: Two Dimensions
**Matrices** extend vectors to two **dimensions**: **rows** and **columns**. We can construct them directly using `matrix`.
R fills in a matrix column-by-column (**not** row-by-row!)
```{r}
a_matrix <- matrix(first_names, nrow=2, ncol=3)
a_matrix
```
---
## Binding Vectors
We can also make matrices by *binding* vectors together with `rbind()` (**r**ow **bind**) and `cbind()` (**c**olumn **bind**).
```{r}
b_matrix <- cbind(c(1, 2), c(3, 4), c(5, 6))
b_matrix
c_matrix <- rbind(c(1, 2, 3), c(4, 5, 6))
c_matrix
```
---
## Subsetting Matrices
We subset matrices using the same methods as with vectors, except we index them with `[rows, columns]`:
```{r}
a_matrix[1, 2] # row 1, column 2
a_matrix[1, c(2,3)] # row 1, columns 2 and 3
```
--
We can obtain the dimensions of a matrix using `dim()`.
```{r}
dim(a_matrix)
```
---
## Matrices Becoming Vectors
If a matrix ends up having just one row or column after subsetting, by default R will make it into a vector.
```{r}
a_matrix[, 1]
```
--
You can prevent this behavior using `drop=FALSE`.
```{r}
a_matrix[, 1, drop=FALSE]
```
---
## Matrix Data Type Warning
Matrices can contain numeric, integer, factor, character, or logical. But just like vectors, *all elements must be the same data type*.
```{r}
bad_matrix <- cbind(1:2, c("Michael","Pearce"))
bad_matrix
```
In this case, everything was converted to characters!
---
## Matrix Dimension Names
We can access dimension names or name them ourselves:
```{r}
rownames(bad_matrix) <- c("First", "Last")
colnames(bad_matrix) <- c("Number", "Name")
bad_matrix
bad_matrix[ ,"Name", drop=FALSE]
```
---
## Matrix Arithmetic
Matrices of the same dimensions can have math performed entry-wise with the usual arithmetic operators:
```{r}
matrix(c(2,4,6,8),nrow=2,ncol=2) / matrix(c(2,1,3,1),nrow=2,ncol=2)
```
---
## "Proper" Matrix Math
To do matrix transpositions, use `t()`.
```{r}
e_matrix <- t(c_matrix)
e_matrix
```
--
To do actual matrix multiplication (not entry-wise), use `%*%`.
```{r}
f_matrix <- c_matrix %*% e_matrix
f_matrix
```
---
## "Proper" Matrix Math (cont.)
To invert an invertible square matrix, use `solve()`.
```{r}
g_matrix <- solve(f_matrix)
g_matrix
```
---
## Matrices vs. data.frames and tibbles
All of these structures display data in two dimensions
+ `matrix`
+ Base R
+ Single data type allowed
--
+ `data.frame`
+ Base R (default for data storage)
+ Stores multiple data types
--
+ `tibbles`
+ tidyverse
+ Stores multiple data types
+ Displays nicely
--
In practice, data.frames and tibbles are very similar!
---
## Creating `data.frame`s
We create a `data.frame` by specifying the columns separately:
```{r}
data.frame(Column1Name = c(1,2,3),
Column2Name = c("A","B","C"))
```
*Note:* `data.frame`s allow for **mixed data types!**
---
class: inverse
# 4. Lists
---
## What are Lists?
**Lists** are objects that can store multiple types of data.
```{r}
my_list <- list("first_thing" = 1:5,
"second_thing" = matrix(8:11, nrow = 2))
my_list
```
---
## Accessing List Elements
You can access a list element by its name or number in `[[ ]]`, or a `$` followed by its name:
```{r}
my_list[["first_thing"]]
my_list[[1]]
my_list$first_thing
```
---
## Why Two Brackets `[[` `]]`?
Double brackets get *the actual element*—as whatever data type it is stored as—in that location in the list.
```{r}
str(my_list[[1]])
```
--
If you use single brackets to access list elements, you get a **list** back.
```{r}
str(my_list[1])
```
---
## `names()` and List Elements
You can use `names()` to get a vector of list element names:
```{r}
names(my_list)
```
---
## Example: Regression Output
When you perform linear regression in R, the output is a list!
```{r}
lm_output <- lm(speed~dist,data=cars)
is.list(lm_output)
names(lm_output)
lm_output$coefficients
```
---
class:inverse
# Summary
1. Types of Data: `numeric`, `character`, `factor`, `logical`, `NA`, `NaN`, `Inf`
1. Vectors: `c()`
1. Matrices: `matrix`, `data.frame`, `tibble`
1. Lists: `list`
*Let's take a 10 minute break, then come back together for some practice!*
---
class:inverse
## Mini-Check 1: Types of Data
In each case, what will R return?
+ `is.numeric(3.14)`
--
+ `TRUE`
--
+ `is.numeric(pi)`
--
+ `TRUE`
--
+ `is.logical(FALSE)`
--
+ `TRUE`
--
+ `is.nan(NA)`
--
+ `FALSE`
---
class:inverse
## Mini-Check 2: Vectors
+ What does `sum(c(1,2,NA))` output?
--
+ `NA`. The code `sum(c(1,2,NA),na.rm=TRUE)` would output `3`.
--
+ What does `rep(c(0,1),times=2)` output?
--
+ `c(0,1,0,1)`
--
+ I want to get the first and second elements of my vector, `a_vector`. What's wrong with the code `a_vector[1,2]` ?
--
+ `a_vector[c(1,2)]`
---
class:inverse
## Activity: Matrices and Lists
1\. Write code to create the following matrix:
```{r echo=FALSE}
matrix(c("A","B","C","D","E","F"),nrow=2,byrow=TRUE)
```
2\. Write a line of code to extract the second column. Ensure the output is still a *matrix*.
```{r echo=FALSE}
matrix(c("A","B","C","D","E","F"),nrow=2,byrow=TRUE)[,2,drop=FALSE]
```
3\. Complete the following sentence: "Lists are to vectors, what data frames are to..."
4\. Create a list that contains 3 elements: "ten_numbers" (integers between 1 and 10), "my_name" (your name as a character), and "booleans" (vector of `TRUE` and `FALSE` alternating three times).
---
## My Answers
1\. Write code to create the following matrix:
```{r eval=FALSE}
mat_test <- matrix(c("A","B","C","D","E","F"),nrow=2,byrow=TRUE)
```
--
2\. Write a line of code to extract the second column. Ensure the output is still a *matrix*.
```{r eval=FALSE}
mat_test[,2,drop=FALSE]
```
---
## My Answers (cont.)
3\. Complete the following sentence: "Lists are to vectors, what data frames are to...**Matrices!**" *Lists and data frames can contain mixed data types, while vectors and matrices can only contain one data type.*
--
4\. Create a list that contains 3 elements:
```{r}
my_new_list <- list("ten_numbers"=1:10,
"my_name"="Michael Pearce",
"booleans"=rep(c(TRUE,FALSE),times=3))
my_new_list
```
---
class: inverse
## Homework 4
For Homework 4, you will fill in an RMarkdown template on my website that walks you through the process of creating, accessing, and manipulating R data structures. Enter values in the RMarkdown document and knit it to check your answers!
+ *Knit after entering each answer!!* If you get an error, check to see if undoing your last edit solves the problem. Coding an assignment to handle all possible mistakes is really hard!
+ This assignment is *long*, so *start early*.
On the due date, I will provide a key for the written answers. You will grade those answers as part of your peer review. In addition, you'll be asked to comment on the style of your peer's code and what you yourself did similarly/different. Please remember to provide a numerical grade (0-3), as always.