---
title: "Data structures in R"
author: Abhijit Dasgupta
date: BIOF 339
---
```{r setup, include=FALSE, child=here::here('slides/templates/setup.Rmd')}
```
layout: true
---
## A quick refresh
+ R is a scripting language for data analysis and statistics
+ R Markdown is a way of combining textual information and R code to produce reproducible documents
+ RStudio is an integrated environment that makes it easier to work with R
.pull-left[
You type commands (_code_) for R to run.
- objects like data (_nouns_)
- functions that do something to R objects (_verbs_)
]
.pull-right[
Examples
```{r 02-DataStructures-1, eval=F}
airquality
diamonds
summary(airquality)
```
]
---
# Objects in R
.pull-left[
Let's start with the `airquality` data.
- It is an object
- of class `class(airquality)` = `r class(airquality)`
How about each column?
Let's look at the Ozone and Wind columns
- We can access them using `airquality$Ozone` and `airquality$Wind`
- `class(airquality$Ozone)` = `r class(airquality$Ozone)`
- `class(airquality$Wind)` = `r class(airquality$Wind)`
]
.pull-right[
```{r 02-DataStructures-2, echo = F}
head(airquality, 25)
```
]
---
# Objects in R
```{r 02-DataStructures-3}
head(iris)
str(iris)
```
Now we see another type of object, a `factor`
---
# Objects in R
```{r 02-DataStructures-4}
library(ggplot2)
str(midwest)
```
Here we have a `character`.
---
# Objects in R
The most common types of data we see are `numeric`, `character`, `factor`. You can also see `Date` and `logical`
You can test to see if data is of a particular type, or convert from one data type to another
```{r 02-DataStructures-5, echo=F}
library(tibble)
d <- tribble(
~"Data type", ~"Test", ~"Convert",
'numeric', 'is.numeric','as.numeric',
'character','is.character','as.character',
'factor','is.factor','as.factor'
)
knitr::kable(d, format='html')
```
--
This conversion is important. Why?
```{r 02-DataStructures-6, echo=F}
countdown(minutes=5)
```
---
# Factors
Factors are uniquely an R thing.
They are meant to represent categorical data (gender, race, state, phenotype)
They look like character vectors, but internally act like integers, so you have to
be a bit careful with them
--
Whenever you're in doubt, convert them to characters using `as.character`.
We'll see the utility of factors when we do data munging, summaries and modeling
---
## Every object in R has a name
You give an object a name using the syntax `name <- object`
Naming conventions:
1. Snake_case or pothole_case
1. CamelCase
1. Some.people.use.periods
I'm partial to `snake_case`.
The point here is to make expressive names using English so you know what
is stored in the name.
---
## A silly exercise
From the iris dataset, save each column into a new object, giving it a name. Then
see what kind of data that object contains.
```{r 02-DataStructures-7, echo=F}
countdown(minutes = 5)
```
---
## Bigger objects
### Scalar -> vector (array) -> matrix (2-d array)
--
+ A scalar is a single number or word
+ A vector is a bunch of scalars arranged in a row or column
+ A matrix is a bunch of scalars arranged in rows and columns
#### Each of these **must** be of the same data type
---
## Examples
```{r 02-DataStructures-8}
2
```
--
```{r 02-DataStructures-9}
c(2,3,4,5,6)
```
> `c()` is the concatenate function
--
```{r 02-DataStructures-10}
matrix(c(1,2,3,4), byrow = T, nrow = 2)
```
---
class:middle,center
# Data comes in many flavors
---
## Heterogeneous data
From Excel, we are familiar with keeping different kinds of data together in a spreadsheet
+ Expression levels (numeric)
+ Gene names (character)
+ Date of experiment (Date)
In R, the objects that can hold heterogeneous data are `data.frame` and `list`
---
class:middle, center
# Data sets
---
## Typical data structure
+ Data is typically in a rectangular format
+ spreadsheet, database table
+ CSV (comma-separated values) or TSV (tab-separated values) files
+ Characteristic
+ Rows are observations
+ Columns are variables
+ Each column has the same number of observations
> [__Tidy data__](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) is a particularly amenable format for data analysis.
---
# The `data.frame`
Dataframes are the primary mode of storing datasets in R
They were revolutionary in that they kept heterogeneous data together
They share properties of both a __matrix__ and a __list__
```{r 02-DataStructures-11}
class(airquality)
```
> Technically, a data.frame is a list of vectors (or objects, generally) of the same length
---
class: middle, center
# Load some data
---
We'll load the `spine` dataset into R.
To do this, download the data from the web, and store it in the main folder in your
project.
Then, in the Environment pane, import it using the **Import Dataset** button. You will
use the `From text (readr)` option
```{r 02-DataStructures-12, echo=F}
data_spine <- read.csv('../data/Dataset_spine.csv', stringsAsFactors=F)
```
---
![](img/readr1.png)
---
![](img/readr2.png)
---
![](img/readr3.png)
---
class:center, middle
# A digression: Lists and Matrices
---
# Matrices
A __matrix__ is a rectangular array of data _of the same type_
```{r 02-DataStructures-13}
matrix(0, nrow=2, ncol=4)
```
```{r 02-DataStructures-14}
matrix(letters, nrow=2)
```
```{r 02-DataStructures-15}
matrix(letters, nrow=2, byrow=T)
```
---
# Matrices
You can create a matrix from a set of _vectors_ of the same length
```{r 02-DataStructures-16}
x <- c(1,2,3,4)
y <- c(10,20,30,40)
```
Put columns together
```{r 02-DataStructures-17}
cbind(c(1,2,3,4), c(10,20,30,40)) ## Column bind
```
---
# Matrices
You can create a matrix from a set of _vectors_ of the same length
```{r 02-DataStructures-18}
x <- c(1,2,3,4)
y <- c(10,20,30,40)
```
Put rows together
```{r 02-DataStructures-19}
example_matrix <- rbind(c(1,2,3,4), c(10,20,30,40)) ## Row bind
example_matrix
```
---
# Extracting elements
```{r 02-DataStructures-20}
example_matrix
example_matrix[1,] ## Extracts 1st row
example_matrix[,2:3] ## extracts 2nd & 3rd columns
example_matrix[1,4]
```
---
# Matrix properties
```{r 02-DataStructures-21}
example_matrix
nrow(example_matrix) ## Number of rows
ncol(example_matrix) ## Number of columns
dim(example_matrix) ## shortcut for above
```
---
# Matrix arithmetic
```{r 02-DataStructures-22}
example_matrix
example_matrix + 5 ## Add 5 to each element
example_matrix * 2 ## Multiply each element by 2
```
---
# Two matrices
```{r 02-DataStructures-23}
example_matrix
example_matrix2 <- rbind(3:6, 9:12)
example_matrix2
example_matrix + example_matrix2
```
---
# Two matrices
```{r 02-DataStructures-24}
example_matrix
example_matrix2
example_matrix * example_matrix2 ## Not matrix multiplication, but element-wise multiplication
```
---
# Two matrices
```{r 02-DataStructures-25}
rbind(example_matrix, example_matrix2)
cbind(example_matrix, example_matrix2)
```
---
# Two matrices
```{r 02-DataStructures-26}
dim(example_matrix2)
t(example_matrix2) ## Transpose of a matrix
example_matrix %*% t(example_matrix2) ## Matrix multiplication
```
---
# Lists
Lists are collections of arbitrary objects in R
```{r 02-DataStructures-27}
example_list <- list(c('Andy','Brian','Harry'),
c(12, 16, 16),
c(TRUE, TRUE, FALSE),
matrix(1, nrow=2, ncol=3))
example_list
```
---
# Extracting elements from lists
```{r 02-DataStructures-28}
example_list[[3]]
```
```{r 02-DataStructures-29}
example_list[1:2]
```
---
# Extracting elements from lists
```{r 02-DataStructures-30}
example_list[[4]]
class(example_list[[4]])
example_list[[4]][1,]
```
---
# Named lists
```{r 02-DataStructures-31}
example_named_list <- list('Names' = c('Andy','Brian','Harry'),
"YearsOfEducation" = c(12, 16, 16),
"Married" = c(TRUE, TRUE, FALSE),
'something' = matrix(1, nrow=2, ncol=3))
example_named_list[['Names']]
example_named_list$Names
example_named_list$Names[3]
```
---
class: middle, center
# Back to a Data Frame
---
# Data frames
A data.frame object is a __named list__ where each element is of the same length
You can use both _matrix_ and _list_ functions to operate on data.frame objects!!
---
# Data Frames
```{r 02-DataStructures-32}
head(data_spine)
```
---
# Data Frames
```{r 02-DataStructures-33}
dim(data_spine)
nrow(data_spine)
data_spine_small <- data_spine[1:4,] ## Matrix operation
```
---
# Data Frames
```{r 02-DataStructures-34}
data_spine_small[,2] ## Matrix extraction by position
data_spine_small[[2]] ## List extraction by position
```
---
# Data Frames
```{r 02-DataStructures-35}
data_spine_small[['Pelvic.tilt']] ## Named list extraction
data_spine_small[,'Pelvic.tilt'] ## Data frame named column extraction
data_spine_small$Pelvic.tilt ## Dollar sign extraction
```
---
# Data Frames
My preference is for
1. _data frame named column extraction_ `data_spine_small[,'Pelvic.tilt']`,
2. _named list extraction_ `data_spine_small[['Pelvic.tilt']]`
3. _Dollar-based extraction_ `data_spine_small$Pelvic.tilt`
---
# Data Frames
```{r 02-DataStructures-36}
names(data_spine_small)
data_spine_small[,c('Pelvic.tilt', 'Pelvic.slope','Class.attribute')]
```