---
title: 'STA130 - Class #3: How R You?'
author: "Nathan Taback"
date: '2018-01-22'
output:
ioslides_presentation:
widescreen: true
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
```
## Today's Class
- RStudio user interface
- R Objects
- R Functions
- R Scripts
- R Packages
- R Lists
- R Notation
- R Missing Data
- dplyr
## Announcements
- Tutorial grades will be assigned according to the following marking scheme.
| | Mark |
|------------------------------------|------|
| Attendance for the entire tutorial | 1 |
| Assigned homework completion^a^ | 1 |
| In-class exercises | 4 |
| Total | 6 |
- You will learn about the mentorship program in this week's tutorial (3% of final grade).
## RStudio User Interface
## R Objects
- R lets you save data by storing it inside an R object.
- What’s an object? Just a name that you can use to call up stored data.
```{r}
x <- 1
x
```
## Environment Pane in RStudio
- When you create an object, the object will appear in the environment pane of RStudio.
## Functions
- R comes with many functions that you can use to do sophisticated tasks like random sampling.
- For example, you can round a number with the round function `round()`, or calculate its absolute value with `abs()`.
- Write the name of the function and then the data you want the function to operate on in parentheses:
```{r}
round(-2.718282, 2)
abs(-5)
abs(round(-2.718282, 2))
```
## Function Constructor
> - Every function in R has three basic parts: a name, a body of code, and a set of arguments.
> - To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function.
> - To do this, call `function()` and follow it with a pair of braces, `{}`: `my_function <- function() {}`
## Function Constructor
- We can simulate rolling a pair of dice and adding the result with the code:
```{r}
die <- 1:6
dice <- sample(die, size = 2, replace = TRUE)
sum(dice)
```
## Function Constructor
- We can create our own function with
```{r, cache=TRUE}
roll <- function() {
die <- 1:6
dice <- sample(die, size = 2, replace = TRUE)
sum(dice)
}
```
Call the function `roll()`
```{r}
roll() # call the function. NB: result will differ with every call
```
## Function Arguments
Instead of rolling one die consider rolling four or ten dice then adding the results of all the rolls together.
```{r,cache=TRUE}
roll2 <- function(numrolls) { # x is the argument of the function roll2
die <- 1:6
dice <- sample(die, size = numrolls, replace = TRUE) # the size of the sample
sum(dice) # add up the roll results
}
```
`numrolls` is called an _argument_ of the function `roll2()`.
Let's simulate rolling ten dice and adding the results together.
```{r}
roll2(10)
```
## Scripts
- If we want to edit the function `roll2()` then we will want to save it in a script.
- To do this in RStudio File > New File > R script in the menu bar.
## Packages
- You’re not the only person writing your own functions with R.
- Many professors, programmers, and statisticians use R to design tools that can help people analyze data.
- They then make these tools free for anyone to use.
- To use these tools, you just have to download them. They come as preassembled collections of functions and objects called packages.
- We have already used two packages `ggplot2` and `dplyr`.
## Packages
To install the package `tidyverse` in RStudio go to the Packages tab in RStudio and click Install.
To load a package type
```{r,eval=FALSE}
library(tidyverse)
```
## RStudio IDE
- IDE: Integrated Development Environment.
- The RStudio IDE has many features that we will not use in the course.
- The __console__ is where you can type an R command at the prompt and the result is returned.
- Write code in an R script, R Markdown document, or R Notebook.
- Run a script or R chunks from an R Markdown or R Notebook by pushing the run button in the chunk.
## R Objects
- R stores data in objects such as vectors, arrays, and matricies.
- In most applications we will ususally load data from an external file.
## R Objects - Atomic Vectors
You can make an atomic vector by grouping some values of data together with c:
```{r}
die<-c(1,2,3,4,5,6)
die
is.vector(die)
length(die)
```
## R Objects - Atomic Vectors
You can also make an atomic vector with just one value. R saves single values as an atomic vector of length 1:
```{r}
two <- 2
two
```
## R Objects - Atomic Vectors: Integer and Character
- Each atomic vector can only store one type of data. You can save different types of data in R by using different types of atomic vectors.
- R recognizes six basic types of atomic vectors: doubles, integers, characters, logicals, complex, and raw.
- We will not be using complex or raw types in STA130.
- Integer vectors included a capital L with input, and character vectors have input surounded by quotation marks.
## R Objects - Atomic Vectors: Integer and Character
```{r, error=TRUE}
mynums <- c(2L,3L)
courses <- "STA130"
courses <- c("STA130", "MAT137")
sum(mynums)
sum(courses)
sum(courses == "STA130")
```
## R Objects - Double Vectors
- A double vector stores real numbers. Doubles are often called numerics.
```{r}
die <- c(1,2,3,4,5,6)
typeof(die)
```
## R Objects - Logical Vectors
- Logical vectors store TRUEs and FALSEs, R’s form of Boolean data. Logicals are very helpful for doing things like comparisons:
```{r}
3 > 4
```
- TRUE or FALSE in capital letters (without quotation marks) will be treated as logical data. R also assumes that T and F are shorthand for TRUE and FALSE.
```{r}
logic <- c(TRUE, FALSE, TRUE)
logic
```
## R Objects - Atomic Vectors: `dim()`
You can transform an atomic vector into an n-dimensional array by giving it a dimen‐ sions attribute with dim.
```{r}
die <- c(1,2,3,4,5,6)
dim(die) <- c(2,3) # a 2x3 matrix
die
```
```{r}
die <- c(1,2,3,4,5,6)
dim(die) <- c(3,2) # a 3x2 matrix
die
```
R always fills up each matrix by columns, instead of by rows unless you use `matrix()` or `array()`.
## Factors
- Factors are R’s way of storing categorical information, like ethnicity or eye color.
- A factor as something like sex since it can only have certain values.
- Factors very useful for recording the treatment levels of a categorical variable.
```{r}
sex <- factor(c("male", "female", "female", "male"))
typeof(sex)
unclass(sex) # shows how R is storing the factor vector
```
## Coercion
R always follows the same rules when it coerces data types. Once you are familiar with these rules, you can use R’s coercion behavior to do surprisingly useful things.
For example `sum(c(TRUE, TRUE, FALSE, FALSE))` will become `sum(c(1, 1, 0, 0))`.
```{r}
sum(c(TRUE, TRUE, FALSE, FALSE))
```
## Lists
- Lists are like atomic vectors because they group data into a one-dimensional set.
- Lists do not group together individual values.
- Lists group together R objects, such as atomic vectors and other lists.
- For example, you can make a list that contains a numeric vector of length 31 in its first element, a character vector of length 1 in its second element, and a new list of length 2 in its third element.
```{r}
list1 <- list(1:31, "Prof. Taback", list(TRUE, FALSE))
list1
```
## Data Frames
- Data frames are the two-dimensional version of a list.
- They are the most useful storage structure for data analysis
- A data frame is R’s equivalent to the Excel spreadsheet because it stores data in a similar format.
## Data Frames
- Data frames group vectors together into a two-dimensional table.
- Each vector becomes a column in the table.
- As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data.
## Data Frames
```{r,cache=TRUE}
student_num <- c(1, 2, 3, 4)
name <- c("Nadia", "Shiyi", "Yizhe", "Wei")
mydat <- data.frame(obsnum = student_num, student_name = name)
mydat
```
- Creating a data frame by hand takes a lot of typing, but you can do it with the `data.frame()` function.
- Give `data.frame()` any number of vectors, each separated with a comma.
- Each vector should be set equal to a name that describes the vector.
- `data.frame()` will turn each vector into a column of the new data frame.
## Data Frames
You can view a data frame in RStudio by clicking on the data frame name in the Environment tab
## R Notation - [ , ]
- To extract a value or set of values from a data frame, write the data frame’s name followed by a pair of square brackets with a comma [ , ].
```{r, eval=FALSE}
mydat[ , ]
```
## R Notation - [ , ]
```{r}
mydat
mydat[1,2] # the value in row 1 and column 2
mydat[c(1,2),2] # all values in rows 1 and 2 in second column
```
## R Notation - $
The `$` tells R to return all of the values in a column as a vector.
```{r}
mydat$student_name
vec <- mydat$student_name # assign it to vec
attributes(vec) # info associated with object vec
vec[2] # get second element of vector
```
## R Notation - combine [,] and $
```{r}
mydat[mydat$obsnum == 1,] # first row of data frame and all columns
mydat[mydat$obsnum == 1 | mydat$obsnum == 4 ,] # first and fourth rows of data frame and all columns
```
## Missing Data - `NA`
- Missing information problems happen frequently in data science.
- For example a value is mising because the measurement was lost, corrupted, or never recorded.
- The `NA` character is a special symbol in R. It stands for “not available” and can be used as a placeholder for missing information.
```{r, error=TRUE}
1 + NA
```
## Missing Data - `na.rm()`
- Suppose you collected the ages of five students, but you forgot to record the fifth students age.
```{r, error=TRUE}
age <- c(19, 20, 17, 20, NA)
mean(age) # mean will be NA
```
```{r}
age <- c(19, 20, 17, 20, NA)
mean(age, na.rm = TRUE) # R will ignore missing values
```
## Identify and Set Missing Data - `is.na()`
```{r}
age <- c(19, 20, 17, 20, NA)
is.na(age) # check which elements of age are missing
age[1] <- NA # set the first element of age to NA
age
```
## Summary of R Data Structures
## Tidyverse
[https://www.tidyverse.org](https://www.tidyverse.org)
```{r,eval=TRUE, cache=TRUE, eval=TRUE, echo=FALSE}
# Uncomment next line if the rvest package is not installed
# install.packages("rvest")
library(rvest)
library(tidyverse)
url <- "https://www.canada.ca/en/public-health/services/surveillance/respiratory-virus-detections-canada/2017-2018/respiratory-virus-detections-isolations-week-1-ending-january-6-2018.html"
# download and read table into flu_dat
flu_dat <- url %>%
read_html() %>%
html_nodes(xpath = '/html/body/main/div[1]/div[2]/details[1]/table') %>%
html_table()
# clean up the file
fludat <- flu_dat[[1]]
dat <- as.data.frame(sapply(select(fludat,2:23), as.numeric))
fludat <- cbind(`Reporting Laboratory` = fludat[,1],dat)
fludat_prov <- fludat %>%
filter(row_number() < 42 & row_number() %in% c(1, 2, 3, 4, 12, 29, 30, 33, 34, 36, 37,38, 39)) %>%
select(prov = `Reporting Laboratory`, testpop_size = `Flu Tested`, fluA = `Total Flu A Positive`)
write_csv(fludat_prov,"fludat_prov.csv")
fludat_prov$prov <- recode(fludat_prov$prov, "Province of Québec" = "Quebec", "Province of Ontario" = "Ontario", "Province of Saskatchewan" = "Saskatchewan", "Province of Alberta" = "Alberta")
popurl <- "https://en.wikipedia.org/wiki/List_of_Canadian_provinces_and_territories_by_population_growth_rate"
popdat <- popurl %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
html_table()
popdat <- popdat[[1]]
popdat <- popdat %>%
select(prov = `Province/Territory`, prov_pop_size = `2016 Census`) %>%
filter(row_number() < 14)
# remove comma and coerce to numeric
popdat$prov_pop_size <- as.numeric(gsub(",([[:digit:]])", "\\1", popdat$prov_pop_size))
popdat$prov[popdat$prov=="Newfoundland and Labrador"] <- "Newfoundland"
popdat$region <- c("Territories",NA,"West","Territories","West","West","East", NA,"Atlantic","Atlantic","Territories","Atlantic","Atlantic")
write_csv(popdat,"popdat.csv")
```
## Canadian Flu Rates with `dplyr`
The provincial rates for the week ending January 6, 2018 are in the file fludat_prov.csv and the the size of the population in each province is in the file popdat.csv. The code below reads the files into R data frames.
```{r, cache=TRUE}
library(tidyverse)
fludat_prov <- read_csv("fludat_prov.csv") # import data from file
popdat <- read_csv("popdat.csv") # import data from file
```
## Canadian Flu Rates with `dplyr`
```{r}
head(fludat_prov) # head shows the first six rows of a data frame
head(popdat)
```
## Canadian Flu Rates with `dplyr`
How many Provinces/Territories are in the fludat_prov data frame?
```{r}
fludat_prov %>% summarise(numprov = n()) # n() counts the number of rows in the data frame
```
## Canadian Flu Rates with `dplyr`
Do any variables in fludat or popdat have missing values?
```{r}
fludat_prov %>% filter(is.na(prov) == TRUE | is.na(testpop_size) == TRUE | is.na(fluA) == TRUE)
popdat %>% filter(is.na(prov) == TRUE | is.na(prov_pop_size) == TRUE | is.na(region) == TRUE)
```
## Canadian Flu Rates with `dplyr`
Recode specific values using R data frame notation [,] and $.
```{r}
popdat$region[popdat$prov == "Alberta"] <- "West" #recode only the region value for Alberta
popdat$region[popdat$prov == "Quebec"] <- "East" #recode only the region value for Alberta
popdat$region #print region variable in popdat data
```
## Canadian Flu Rates with `dplyr` - Joining Two Tables with `inner_join()`
We can join two data frames with `inner_join(x,y)`: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
```{r}
fludat_prov %>% inner_join(popdat, by = "prov")
```
Why are there only 9 observations when there are 13 Provinces/Territories?
## Canadian Flu Rates with `dplyr` - Joining Two Tables with `inner_join()`
```{r}
fludat_prov$prov
popdat$prov
```
Province needs to be recoded. Exercise on this week's practice problems.
## Canadian Flu Rates with `dplyr` - Joining Two Tables with `inner_join()`