---
title: "Investigating objects and data patterns using base R" # potentially push to header
subtitle:  "EDUC 260A: Managing and Manipulating Data Using R"
author: 
date: 
classoption: dvipsnames  # for colors
fontsize: 8pt
urlcolor: blue
output:
  beamer_presentation:
    keep_tex: true
    toc: false
    slide_level: 3
    theme: default # AnnArbor # push to header?
    number_sections: true
    #colortheme: "dolphin" # push to header?
    #fonttheme: "structurebold"
    highlight: tango # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header
    df_print: default #default # tibble # push to header?    
    latex_engine: xelatex #  Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts.
    includes:
      in_header: ../beamer_header.tex
      #after_body: table-of-contents.txt 
---


```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
#knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
  #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##"
```

```{r, echo=FALSE, include=FALSE}
#THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamer_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE
download.file(url = 'https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/beamer_header.tex', 
              destfile = '../beamer_header.tex',
              mode = 'wb')
```

```{r, echo=FALSE, include=FALSE, eval = FALSE}
# Download images saved on github site
imgs <- c('transform-logical.png','fp1.JPG', 'fp2.JPG')
for (i in imgs) {
  if(!file.exists(i)){
  download.file(url = paste0('https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/patterns_base_r/', i), 
                destfile = i,
                mode = 'wb')
  }
}

# download images from Advanced R book

  # the 3 carriage train
  download.file(url = 'https://d33wubrfki0l68.cloudfront.net/1f648d451974f0ed313347b78ba653891cf59b21/8185b/diagrams/subsetting/train.png', 
                destfile = 'three_carriage_train.png',
                mode = 'wb')
  
  # the 1 car train vs. contents of car 1
  download.file(url = 'https://d33wubrfki0l68.cloudfront.net/aea9600956ff6fbbc29d8bd49124cca46c5cb95c/28eaa/diagrams/subsetting/train-single.png', 
                destfile = 'one_carriage_train_vs_contents.png',
                mode = 'wb')
  
  
  # different versions of smaller trains
  download.file(url = 'https://d33wubrfki0l68.cloudfront.net/ef5798a60926462b9fc080afb0145977eca70b83/039f5/diagrams/subsetting/train-multiple.png', 
                destfile = 'smaller_trains.png',
                mode = 'wb')
  

```


### Lecture outline

\tableofcontents
```{r, eval=FALSE, echo=FALSE}
#Use this if you want TOC to show level 2 headings
\tableofcontents
#Use this if you don't want TOC to show level 2 headings
\tableofcontents[subsectionstyle=hide/hide/hide]
```


# Investigate objects, base R

### Load .Rdata data frames we will use today

Data on off-campus recruiting events by public universities

- Data frame object `df_event`
    - One observation per university, recruiting event
- Data frame object `df_school`
    - One observation per high school (visited and non-visited)

```{r}
rm(list = ls()) # remove all objects in current environment

getwd()
#load dataset with one obs per recruiting event
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
#load("../../data/recruiting/recruit_event_somevars.Rdata")

#load dataset with one obs per high school
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_somevars.RData"))
#load("../../data/recruiting/recruit_school_somevars.Rdata")
```

## Functions to describe objects

### Simple base R functions to describe objects

This section introduces some base R functions to describe objects (some of these you have seen before)

- list objects, `list.files()` and `ls()`
- remove objects, `rm()`
- object type, `typeof()`
- object length (number of elements), `length()`
- object structure, `str()`
- number of rows and columns, `ncol()` and `nrow()`

I use the functions `typeof()`, `length()`, `str()` anytime I encounter a new object

- Helps me understand the object before I start working with it

### Listing objects

__Files in your working directory__

`list.files()` function lists files in your current working directory

- if you run this code from .Rmd file, working directory is location .Rmd file is stored
```{r}
getwd() # what is your current working directory
list.files()
```

### Objects currently open in your R session

__Listing objects currently open in your R session__

`ls()` function lists objects currently open in R
```{r}
x <- "hello!"
ls() # Objects open in R
```

__Removing objects currently open in your R session__

`rm()` function removes specified objects open in R
```{r}
rm(x)
ls()
```

Command to remove all objects open in R (I don't run it)
```{r, eval=FALSE}
#rm(list = ls())
```

### Base R functions to describe objects, `typeof()`

`typeof()` function determines the the internal storage type of an object (e.g., logical vector, integer vector, list)

- syntax
  - `tyepof(x)`
- arguments
  - `x`: any R object
- help:
```{r, eval = FALSE}
?typeof
```

Examples

- Recall that a data frame is an object where __type__ is a list

```{r}
typeof(c(TRUE,TRUE,FALSE,NA))
typeof(df_event)
typeof(x = df_event)
```
### Base R functions to describe objects, `length()`

`length()` function determines the length of an R object

- for atomic vectors and lists, `length()` is the number of elements in the object
- syntax
  - `length(x)`
- arguments
  - `x`: any R object
- help:
```{r, eval = FALSE}
?length
```
  
Example, length of an atomic vector is
```{r}
length(c(TRUE,TRUE,FALSE,NA))
```
Example, length of a list or data frame

- length of a list is the number of elements
- data frame is a list
- length of a data frame = number of elements = number of variables
```{r}
length(df_event) # = num elements = num columns
```

### Base R functions to describe objects, `str()`


`str()` function compactly displays the structure of an R object

- "structure" includes type, length, and attribute of object and also nested objects
- syntax: `str(object)`
- arguments (partial)
  - `object`: any R object
  - `max.level`: max level of nesting to display nested structures; default `NA` = all levels
- help: `?str`
```{r, eval = FALSE, include = FALSE}
?str
```

Example, atomic vectors
```{r}
str(c(TRUE,TRUE,FALSE,NA))
str(object = c(TRUE,TRUE,FALSE,NA))
```
Example, lists/data frames (output omitted)
```{r, results = "hide"}
x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list
str(x)

str(df_event) # data frame
```


### Base R functions to describe objects, `ncol()` and `nrow()`

`ncol()` `nrow()`  and `dim()` functions

- Description
  - `ncol()` = number of columns; `nrow()` = number of rows
- syntax: `ncol(x)` `nrow(x)` `dim(x)`
- arguments 
  - `x`: a vector, array, data frame, or NULL
- value/return:
  - if object `x` is an atomic vector: `ncol()` and `nrow()` returns `NULL` 
  - if object `x` is a list but not a data frame: `ncol()` and `nrow()` returns `NULL` 
  - if object `x` is a data frame: `ncol()` and `nrow()` returns integer of length 1


Example, object is a data frame

```{r}
ncol(df_event) # num columns = num elements = num variables
nrow(df_event) # num rows = num observations
# can wrap ncol() or nrow() within str() to see what functions return
#str(ncol(df_event))
```
Example, object is atomic vector or list that is not a data frame (output omitted)
```{r, results = "hide"}
ncol(c(TRUE,TRUE,FALSE,NA)) # atomic vector
x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list
nrow(x)
```

### Base R functions to describe objects, `dim()`

`dim()` function returns the dimensions of an object (e.g., number of rows and columns)

- syntax: `dim(x)`
- arguments 
  - `x`: a vector, array, data frame, or NULL
- value/return:
  - if object `x` is a data frame: `dim()` returns integer of length 2
    - first element = number of rows; second element = number of columns
  - if object `x` is an atomic vector: `dim()` returns `NULL` 
  - if object `x` is a list but not a data frame: `dim()` returns `NULL` 


Example, object is a data frame

```{r}
dim(df_event) # shows number rows by columns

str(dim(df_event)) # can wrap dim() within str() to see what functions return
```

Example, object is atomic vector or list that is not a data frame (output omitted)
```{r, results = "hide"}
dim(c(TRUE,TRUE,FALSE,NA)) # atomic vector
x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list
dim(x)
```

## Variables names

### `names()` function

`names()` function gets or sets the names of elements of an object

- syntax: 
  - get the names of an object: `names(x)`
  - set the names of an object: `names(x) <- value`
- arguments (partial)
  - `x`: an R object
  - `value`: a character vector with same length as object `x` or `NULL`
- value/return
  - `names(x)` returns a character vector of length = `length(x)` in which each element is the name of the element of `x`

Example, get names (of atomic vector)
```{r}
a <- c(v1=1,v2=2,3,v4="hi!") # named atomic vector
a 
length(a)
names(a)
length(names(a)) # investigate length of object names(a)
str(names(a)) # investigate structure of object names(a)
```
### `names()` function

`names()` function gets or sets the names of elements of an object

- syntax: 
  - get the names of an object: `names(x)`
  - set the names of an object: `names(x) <- value`
- arguments (partial)
  - `x`: an R object
  - `value`: a character vector with same length as object `x` or `NULL`
- value/return
  - `names(x)` returns a character vector of legnth = `length(x)` in which each element is the name of the element of `x`

Example, set names (of atomic vector)
```{r}
names(a) <- NULL # set names of vector a to NULL
a
names(a)

names(a) <- c("var1","var2","var3","var4") # set names of vector a
a
names(a)
```


### Applying `names()` function to a data frame


Recall that a data frame is an object where __type__ is a __list__ and each __element__ is __named__

- each element is a variable
- each element name is a variable name

Example (output omitted)
```{r, results = "hide"}
names(df_event)
```

Investigate the object `names(df_event)`
```{r}
typeof(names(df_event)) # type = character vector
length(names(df_event)) # length = number of variables in data frame
str(names(df_event)) # structure of names(df_event)
```
We can even assign a new object based on `names(df_event)`
```{r}
names_event <- names(df_event)
typeof(names_event) # type = character vector
length(names_event) # length = number of variables in data frame
str(names_event) # structure of names(df_event)
```

### Variable names

Refer to specific named elements of an object using this syntax:

- `object_name$element_name`

When object is data frame, refer to specific variables using this syntax: 

- `data_frame_name$varname`
- __This approach to isolating variables is very useful for investigating data__

```{r}
#df_event$instnm
typeof(df_event$instnm)
typeof(df_event$med_inc)
```
### Variable names

\medskip Data frames are lists with the following criteria:

- each element of the list is (usually) a vector; each element of list is a variable
- length of data frame = number of variables
```{r}
length(df_event)
nrow(df_event)
#str(df_event)
```

- each element of the list (i.e., variable) has the same length
    - Length of each variable is equal to number of observations in data frame

```{r}
typeof(df_event$event_state)
length(df_event$event_state)
str(df_event$event_state)

typeof(df_event$med_inc)
length(df_event$med_inc)
str(df_event$med_inc)
```

### Variable names

The object `df_school` has one obs per high school

- variable `visits_by_100751` shows number the of visits by University of Alabama to each high school
- like all variables in a data frame, the var `visits_by_100751` is just a vector
```{r}
typeof(df_school$visits_by_100751)
length(df_school$visits_by_100751) # num elements in vector = num obs
str(df_school$visits_by_100751)
sum(df_school$visits_by_100751) # sum of values of var across all obs
```
We perform calculations on a variable like we would on any vector of same type
```{r}
v <- c(2,4,6)
typeof(v)
length(v)
sum(v)
```

## View and print data

### Viewing and printing, data frames

Many ways to view/print a data frame object. Here are three ways:

1. Simply type the object name (output omitted)

    - number of observations and rows printed depend on YAML header settings and on object "attributes" (attributes discussed in future week)
```{r, results="hide"}
df_event
```
    
2. Use the `View()` function to view data in a browser
```{r eval=FALSE}
View(df_event)
```
3. `head()` to show the first _n_ rows. The default is 6 rows.
```{r results="hide"}
#?head
#head(df_event)
head(df_event, n=5)
```

### Viewing and printing, data frames

`obj_name[<rows>,<cols>]` to print specific rows and columns of data frame

- particularly powerful when combined with sequences (e.g., `1:10`)

\medskip Examples (output omitted):

- Print first five rows, all vars
```{r results="hide"}
df_event[1:5, ]
```
- Print first five rows and first three columns
```{r results="hide"}
df_event[1:5, 1:3]
```
- Print first three columns of the 100th observation
```{r results="hide"}
df_event[100, 1:3]
```
- Print the 50th observation, all variables
```{r results="hide"}
df_event[50,]
```
### Viewing and printing, variables within data frames

Recall that:

- `obj_name$var_name` print specifics elements (i.e., variables) of a data frame
```{r results="hide"}
df_event$zip
```
- each element (i.e., variable) of data frame is an __atomic vector__ with  __length__ = number of observations

```{r}
typeof(df_event$zip)
length(df_event$zip)
```
- each element of a variable is the value of the variable for one observation

\medskip

Print specific elements (i.e., observations) of variable based on element position

- syntax: `obj_name$var_name[<element position>]`
- vectors don't have "rows" or "columns"; they just have elements
- syntax combined with sequences (e.g., print first 10 observations)
```{r}
df_event$event_state[1:10] # print obs 1-10 of variable "event_state"
df_event$event_type[6:10] # print obs 6-10 of variable "event_type"
```

### Viewing and printing, variables within data frames

Print specific elements (i.e., observations) of variable based on element position

- syntax: `obj_name$var_name[<element position>]`

Example, print individual elements
```{r}
df_event$zip[1:5] # print obs 1-5 of variable for event zip code
df_event$zip[1] # print obs 1 of variable for event zip code
df_event$zip[5] # print obs 5 of variable for event zip code
df_event$zip[c(1,3,5)] # print obs 5 of variable for event zip code
```

Print specific elements of multiple variables using combine function `c()`

- syntax: `c(obj_name$var1_name[<element position>], obj_name$var2_name[<element position>],...)`
- Example: print first five observations of variables `"event_state"` and `"event_type"`
```{r}
c(df_event$event_state[1:5],df_event$event_type[1:5])
```


### Exercise


Printing exercise using the df_school data frame

1. Use the `obj_name[<rows>,<cols>]` syntax to print the first 5 rows and 3 columns of the `df_school` data frame
1. Use the `head()` function to print the first 4 observations
1. Use the `obj_name$var_name[1:10]` syntax to print the first 10 observations of a variable in the `df_school` data frame
1. Use combine() to print the first 3 observations of variables "school_type" & "name"

### Solution

1. Use the `obj_name[<rows>,<cols>]` syntax to print the first 5 rows and 3 columns of the `df_school` data frame
```{r}
df_school[1:5,1:3]
```

### Solution
2. Use the `head()` function to print the first 4 observations
```{r}
head(df_school, n=4)
```

### Solution
3. Use the `obj_name$var_name[1:10]` syntax to print the first 10 observations of a variable in the `df_school` data frame
```{r}
df_school$name[1:10]
```

### Solution
4. Use combine() to print the first 3 observations of variables "school_type" & "name"
```{r}
c(df_school$school_type[1:3],df_school$name[1:3])
```


## Missing values

### Missing values

Missing values have the value `NA`

- `NA` is a special keyword, not the same as the character string `"NA"`

use `is.na()` function to determine if a value is missing

- `is.na()` returns a logical vector
```{r}
is.na(5)
is.na(NA)
is.na("NA")
typeof(is.na("NA")) # example of a logical vector

nvector <- c(10,5,NA)
is.na(nvector)
typeof(is.na(nvector)) # example of a logical vector

svector <- c("e","f",NA,"NA")
is.na(svector)
```

### Missing values are "contagious"

What does "contagious" mean?

- operations involving a missing value will yield a missing value

```{r}
7>5
7>NA
sum(1,2,NA)
0==NA
2*c(0,1,2,NA)
NA*c(0,1,2,NA)
```
###  Functions and missing values example, `table()`

`table()` function is useful for investigating categorical variables
```{r}
str(df_event$event_type)
table(df_event$event_type)
```
###  Functions and missing values example, `table()`

By default `table()` ignores `NA` values
```{r}
#?table
str(df_event$school_type_pri)
table(df_event$school_type_pri)
```

`useNA` argument controls if table includes counts of `NA`s. Allowed values:

- never ("no") [DEFAULT VALUE]
- only if count is positive ("ifany");
- even for zero counts ("always")"
```{r}
nrow(df_event)
table(df_event$school_type_pri, useNA="always")
```
Broader point: Most functions that create descriptive statistics have options about how to treat missing values`

- When investigating data, good practice to _always_ show missing values


# Subsetting using subset operators

### Subsetting to Extract Elements 

"Subsetting" refers to isolating particular elements of an object 

\medskip
Subsetting operators can be used to select/exclude elements (e.g., variables, observations)

- there are three subsetting operators: `[]`, `$` , `[[]]` 
- these operators function differently based on vector types (e.g, atomic vectors, lists, data frames)

### Wichham refers to number of "dimensions" in R objects

An atomic vector is a 1-dimensional object that contains n elements
```{r}
x <- c(1.1, 2.2, 3.3, 4.4, 5.5)
str(x)
```
    
Lists are multi-dimensional objects

- Contains n elements; each element may contain a 1-dimensional atomic vector or a multi-dimensional list. Below list contains 3 dimensions
```{r}
list <- list(c(1,2), list("apple", "orange"))
str(list)
```
Data frames are 2-dimensional lists

- each element is a variable (dimension=columns)
- within each variable, each element is an observation (dimension=rows)
```{r}
ncol(df_school)
nrow(df_school)
```


## Subset atomic vectors using []

### Subsetting elements of atomic vectors

"Subsetting" a vector refers to isolating particular elements of a vector

- I sometimes refer to this as "accessing elements of a vector"
- subsetting elements of a vector is similar to "filtering" rows of a data-frame
- `[]` is the subsetting function for vectors

Six ways to subset an atomic vector using `[]`

1. Using positive integers to return elements at specified positions
2. Using negative integers to exclude elements at specified positions
3. Using logicals to return elements where corresponding logical is `TRUE`
4. Empty `[]` returns original vector (useful for dataframes)
5. Zero vector [0], useful for testing data
6. If vector is "named," use character vectors to return elements with matching names


### 1. Using positive integers to return elements at specified positions (subset atomic vectors using [])

Create atomic vector `x`
```{r}
(x <- c(1.1, 2.2, 3.3, 4.4, 5.5))
str(x)
```

`[]` is the subsetting function for vectors

- contents inside `[]` can refer to element number (also called "position"). 
    - e.g., `[3]` refers to contents of 3rd element (or position 3)

```{r}
x[5] #return 5th element

x[c(3, 1)] #return 3rd and 1st element

x[c(4,4,4)] #return 4th element, 4th element, and 4th element

#Return 3rd through 5th element
x[3:5]
```


### 2. Using negative integers to exclude elements at specified positions (subset atomic vectors using [])

Before excluding elements based on position, investigate object
```{r}
x

length(x)
str(x)
```

Use negative integers to exclude elements based on element position
```{r}
x[-1] # exclude 1st element

x[c(3,1)] # 3rd and 1st element
x[-c(3,1)] # exclude 3rd and 1st element
```


### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using [])

```{r}
x
```

When using `x[y]` to subset `x`, good practice to have `length(x)==length(y)`
```{r}
length(x) # length of vector x
length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # length of y
length(x) == length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # condition true
x[c(TRUE,TRUE,FALSE,FALSE,TRUE)]
```

Recycling rules:

- in `x[y]`, if `x` is different length than `y`, R "recycles" length of shorter to match length of longer

```{r}
length(c(TRUE,FALSE))
x
x[c(TRUE,FALSE)]
```


### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using [])

```{r}
x
```

Note that a missing value (`NA`) in the index always yields a missing value in the output:

```{r}
x[c(TRUE, FALSE, NA, TRUE, NA)]
```

Return all elements of object `x` where element is greater than 3:

```{r}
x # print object X
x>3 # for each element of X, print T/F whether element value > 3
str(x>3)
x[x>3] # prints only the values that had TRUE at that position
```

### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) [cont.]

The `visits_by_100751` column shows how many visits the University of Alabama made to each school. Let's subset this to only include 2 or more visits:

```{r}
df_school$visits_by_100751[1:100]
df_school$visits_by_100751[1:100]>2
df_school$visits_by_100751[df_school$visits_by_100751>2]
```


### 4. Empty `[]` returns original vector (subset atomic vectors using [])


```{r}
x

x[]
```

This is useful for sub-setting data frames, as we will show below

### 5. Zero vector [0] (subset atomic vectors using [])

Zero vector, `x[0]`

- R interprets this as returning element 0
```{r}
x[0]
```

Wickham states:

- "This is not something you usually do on purpose, but it can be helpful for generating test data."


### 6. If vector is named, character vectors to return elements with matching names (subset atomic vectors using [])


Create vector `y` that has values of vector `x` but each element is named
```{r}
x

(y <- c(a=1.1, b=2.2, c=3.3, d=4.4, e=5.5))
```
Return elements of vector based on name of element

- enclose element names in single `''` or double `""` quotes
```{r}
#show element named "a"
y["a"]

#show elements "a", "b", and "d"
y[c("a", "b", "d" )]
```

## Subsetting lists/data frames using []

### Subsetting lists using []

Using `[]` operator to subset lists works the same as subsetting atomic vector

- Using `[]` with a list always returns a list


```{r}
list_a <- list(list(1,2),3,"apple")
str(list_a)

#create new list that consists of elements 3 and 1 of list_a
list_b <- list_a[c(3, 1)]
str(list_b)

#show elements 3 and 1 of object list_a
#str(list_a[c(3, 1)])
```

### Subsetting data frames using []

Recall that a data frame is just a particular kind of list

- each element = a column = a variable

Using `[]` with a list always returns a list

- Using `[]` with a data frame always returns a data frame

Two ways to use `[]` to extract elements of a data frame

1. use "single index" `df_name[<columns>]` to extract columns (variables) based on element position number (i.e., column number)
1. use "double index" `df_name[<rows>, <columns>]` to extact particular rows and columns of a data frame

### Subsetting data frames using [] to extract columns (variables) based on element position

Use "single index" `df_name[<columns>]` to extract columns (variables) based on element number (i.e., column number)

\medskip

Examples [output omitted]
```{r, results="hide"}
names(df_event)

#extract elements 1 through 4 (elements=columns=variables)
df_event[1:4]
df_event[c(1,2,3,4)]

str(df_event[1:4])
#extract columns 13 and 7
df_event[c(13,7)]
```

### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality

use "double index" syntax `df_name[<rows>, <columns>]` to extact particular rows and columns of a data frame

- often combined with sequences (e.g., `1:10`)


```{r}
#Return rows 1-3 and columns 1-4
df_event[1:3, 1:4]

#Return rows 50-52 and columns 10 and 20
df_event[50:52, c(10,20)]
```

### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality

use "double index" syntax `df_name[<rows>, <columns>]` to extact particular rows and columns of a data frame

\medskip

recall that empty `[]` returns original object (output omitted)
```{r results="hide"}
#return original data frame
df_event[]

#return specific rows and all columns (variables)
df_event[1:5, ]

#return all rows and specific columns (variables)
df_event[, c(1,2,3)]
```

### Use [] to extract data frame columns based on variable names

Selecting columns from a data frame by subsetting with `[]` and list of element names (i.e., variable names) enclose in quotes

\medskip

"single index" approach extracts specific variables, all rows (output omitted)
```{r, results="hide"}
df_event[c("instnm", "univ_id", "event_state")] 
```

"Double index" approach extracts specific variables and specific rows

- syntax `df_name[<rows>, <columns>]`

```{r}
df_event[1:5, c("instnm", "event_state", "event_type")] 
```

### Student exercises

Use subsetting operators from base R in extracting columns (variables), observations:

1. Use both "single index" and "double index" in subsetting to create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from the `df_event` data frame. And show what columns (variables) are in the newly created dataframe. 

2. Use subsetting to return rows 1-5 of columns `state_code`, `name`, `address` from the `df_school` data frame.


### Solution to Student Exercises

Solution to 1

__base R__ using subsetting operators
```{r}
# single index
df_event_br <- df_event[c("instnm", "event_date", "event_type")]
#double index
df_event_br <- df_event[, c("instnm", "event_date", "event_type")]
names(df_event_br)
```

Solution to 2

__base R__ using subsetting operators
```{r}
df_school[1:5, c("state_code", "name", "address")]
```

## Subsetting lists/data frames using [[]] and $

### Subset single element from object using [[]] operator, atomic vectors

So far we have used `[]` to extract elements from an object

- Apply `[]` to atomic vector: returns atomic vector with elements you requested
- Apply `[]` to list: returns list with elements you requested

`[[]]` also extract elements from an object

- Applying `[[]]` to atomic vector gives same result as `[]`; that is, an atomic vector with element you request
```{r}
(x <- c(1.1, 2.2, 3.3, 4.4, 5.5))

str(x[3]) 

str(x[[3]])
```

- Caveat: when applying `[[]]` to atomic vector, you can only subset a single element
```{r}
x[c(3,4)] # single bracket; this works

#x[[c(3,4)]] # double bracket; this won't work
```

### Subsetting lists using `[]` vs. `[[]]`, introduce "train metaphor"

Applying `[[]]` to a list

- Understanding what `[]` vs. `[[]]` does to a list is very important but requires some explanation!

_Advanced R_ [chapter 4.3](https://adv-r.hadley.nz/subsetting.html#subset-single) by Wickham uses the "train metaphor" to explain a list vs. **contents** of a list and how this relates to `[]` vs. `[[]]`


Below code chunk makes a list named `list_x` that contains 3 elements
```{r}
list_x <- list(1:3, "a", 4:6) # create list object list_x
```

In our train metaphor, object `list_x` is a train that contains 3 carriages

[![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single)

### Subsetting lists using `[]` vs. `[[]]`, introduce "train metaphor"

list object `list_x` is a train that contains 3 carriages

```{r, out.width = "45%", echo = FALSE}
library(knitr)
include_graphics("three_carriage_train.png")
#[![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single)
```

When we "subset a list" -- that is, extract one or more elements from the list -- we have two broad choices (image below)

```{r, out.width = "45%", echo = FALSE}
library(knitr)
include_graphics("one_carriage_train_vs_contents.png")
# [![](one_carriage_train_vs_contents.png)](https://adv-r.hadley.nz/subsetting.html#subset-single)
```


1. Extracting elements using `[]` always returns a list, usually one with fewer elements
    - you can think of this as a train with fewer carriages
```{r}
#str(list_x)
str(list_x[1]) # returns a list
```
2. Extracting element using `[[]]` returns **_contents_** of particular carriage
    - I say applying `[[]]` to a list or data frame returns a simpler object that moves up one level of hierarchy
```{r}
str(list_x[[1]]) # returns an atomic vector
```


### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[]`

<!-- Use train metaphor to deepen understanding of using `[]` to subset the list object `list_x` -->

Rules about applying subset operator `[]` to a list

- Applying `[]` to a list always returns a list
- Resulting list contains 1 or more elements depending on what typed inside `[]`

Here is a list object named `list_x`
```{r}
list_x <- list(1:3, "a", 4:6)
```

Here is an image of a few "trains" that can be created by applying `[]` to `list_x`
<!-- [![](smaller_trains.png)](https://adv-r.hadley.nz/subsetting.html#subset-single){height=50%} -->

```{r, out.width = "45%", echo = FALSE}
library(knitr)
include_graphics("smaller_trains.png")
```

And here is code to create the "trains" shown in above image (output omitted)
```{r, results = "hide"}
list_x[1:2]
list_x[-2]
list_x[c(1,1)]
list_x[0]
list_x[] # returns the original list; not shown in above train picture
```

### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]`

Rules about applying subset operator `[[]]` to a list

- Can apply `[[]]` to return the **contents** of a **single element** of a list


Create list `list_x` and show "train" Image of applying `list_x[1]` vs. `list_x[[1]]`

```{r}
list_x <- list(1:3, "a", 4:6)
```

```{r, out.width = "45%", echo = FALSE}
library(knitr)
include_graphics("one_carriage_train_vs_contents.png")
```


Object created by `list_x[1]` is a list with one element (output omitted)
```{r, results = "hide"}
list_x[1]
str(list_x[1])
```

Object created by `list_x[[1]]` is a vector with 3 elements (output omitted)

- `list_x[[1]]` gives us "contents" of element 1
- Since element 1 contains a numeric vector, object created by `list_x[[1]]` is a numeric vector
```{r, results = "hide"}
list_x[[1]]
str(list_x[[1]])
```

### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]`

Rules about applying subset operator `[[]]` to a list

- Can apply `[[]]` to return the **contents** of a **single element** of a list

```{r}
list_x <- list(1:3, "a", 4:6) # create list list_x
```


We cannot use `[[]]` to subset multiple elements of a list (output omitted)

- e.g., we could write `list_x[[2]]` but not `list_x[[2:3]]`
```{r, eval = FALSE}
list_x[[c(2)]] # this works, subset element 2 using [[]]
list_x[[c(2,3)]] # this doesn't work; subset element 2 and 3 using [[]]
list_x[c(2,3)] # this works; subset element 2 and 3 using []
```
### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]`

Like `[]`, can use `[[]]` to return contents of __named__ elements specified using quotes

- syntax: `obj_name[["element_name"]]`

```{r}
list_x <- list(var1=1:3, var2="a", var3=4:6) # create list with named elements
```

Subset list `list_x` using `[[]]` element names
```{r}
list_x[["var1"]] # subset by element position: list_x[[1]]
str(list_x[["var1"]])
str(list_x["var1"]) # note: suggests var name is attribute of list, not atomic vector
```

Can do same thing with data frames because data frames are lists (output omitted)

- e.g., `df_event[["zip"]]` returns contents of element named `"zip"`
- object created by `df_event[["zip"]]` is character vector of length = 18,680
```{r, results='hide'}
# df_event[["zip"]] # this works but long output
str(df_event[["zip"]]) # character vector of length 18,860
typeof(df_event[["zip"]])
length(df_event[["zip"]])
str(df_event["zip"]) # by contrast, this is a dataframe w/ one variable
```


### General rules of applying `[]` vs `[[]]` to (nested) objects

What we just learned about applying `[]` vs `[[]]` to lists applies more generally to "nested objects"

- "nested objects" are objects with a hierarchical structure such that an element of an object contains another object


General rules of applying `[]` vs. `[[]]` to nested objects

- subset any object `x` using `[]` will return object with same data structure as `x`
- subset any object `x` using `[[]]` will return an object thay may or may not have same data structure of `x`
  - if object `x` is not a nested object, then applying `[[]]` to a single element of `x` will return object with same data structure as `x`
  - if object `x` has a nested data structure, then then applying `[[]]` to a single element of `x` will "move up one level of hierarchy" to extract the **contents** of an element within the object `x`
  
When working w/ data frames, functions that calculate things expect to be working with atomic vectors (think `[[]]`) not lists (think `[]`)
```{r}
mean(df_event[['med_inc']], na.rm = TRUE)
# mean(df_event['med_inc'], na.rm = TRUE) # by contrast, this doesn't work
```


### Subset lists/data frames using $


```{r}
list_x <- list(var1=1:3, var2="a", var3=4:6)
```

`obj_name$element_name` is shorthand operator for `obj_name[["element_name"]]`


These three lines of code all give the same result
```{r}
list_x[[1]]
list_x[["var1"]]
list_x$var1
```

`df_name$var_name`: easiest way in base R to refer to variable in a data frame

- these two lines of code are equivalent
```{r}
str(df_event[["zip"]])
str(df_event$zip)
```


## Subset Data frames by combining [] and $

### Subset Data Frames by combining `[]` and `$`, Motivation

Motivation

- When working with data frames we often want to isolate those observations that satisfy certain conditions
- This is often referred to as "filtering"
  - We filter observations based on the values of one or more variables
- Perhaps you have seen "filtering" in Microsoft Excel
  - open some spreadsheet that contains variables (columns) and observations (rows)
  - click on `Data` >> `Filter` and then filter observations based on values of variable(s)


Filtering example using data frame `df_school`

- Observations: 
  - One observation per high school (public and private)
- Variables: 
  - high school characteristics; number of off-campus recruiting visits from particular universities
  - NCES ID for UC Berkeley is `110635`
  - variable `visits_by_110635` shows number of visits a high school received from UC Berkeley
- **Task**:
  - Isolate observations where the high school received at least 1 visit from UC Berkeley
  
  
### Subset Data Frames by combining `[]` and `$`

**Task**:

- Isolate obs where school received at least 1 visit from UC Berkeley

General syntax: `df_name[df_name$var_name <condition>, ]`

- where `<condition>` is something that evaluates to `TRUE` or `FALSE` for each element of the atomic vector (i.e., variable)
- Note that syntax uses "double index" `df_name[<rows>, <columns>]` syntax
  - Therefore, the `<condition>` in above syntax is isolating `<rows>`
- __Cannot__ use "single index" syntax `df_name[<columns>]`


Solution to task (output omitted)

- Note: below code filters observations but keeps all variables
```{r results="hide"}
df_school[df_school$visits_by_110635 >= 1, ]
```
### Subset Data Frames by combining `[]` and `$`, decompose syntax

**Task**: Isolate obs where school received at least 1 visit from UC Berkeley

- general syntax: `df_name[df_name$var_name <condition>, ]`
- solution: `df_school[df_school$visits_by_110635 >= 1, ]`

```{r results="hide", include = FALSE}
df_school[df_school$visits_by_110635 >= 1, ]
```

Decomposing syntax `df_school[df_school$visits_by_110635 >= 1, ]`

- `df_school$visits_by_110635 >= 1`: returns a logical (`TRUE`/`FALSE`) atomic vector with length equal to number of obs in `df_school`

```{r results="hide"}
typeof(df_school$visits_by_110635 >= 1)
length(df_school$visits_by_110635 >= 1)
str(df_school$visits_by_110635 >= 1)
```
- `df_school[df_school$visits_by_110635 >= 1, ]`
  - uses "double index" `df_name[<rows>, <columns>]` syntax to extract rows, columns
  - rows: extract rows where `df_school$visits_by_110635 >= 1` is `TRUE`
  - columns: since `<columns>` is empty, extracts all columns
- __key point__: `df_name[df_name$var_name <condition>, ]` is "subset a vector approach #3": "Using logicals to return elements where condition `TRUE`"
- example using atomic vectors (output omitted)
```{r results="hide"}
x <- c(1.1, 2.2, 3.3, 4.4, 5.5)
x[x>3]
```


### Subset Data Frames by combining `[]` and `$`, keep desired columns

- General syntax to filter desired observations (rows) and variables (columns) of data frame:

- `df_name[df_name$var_name <condition>, <desired columns>]`

__Tasks__ (output omitted)

- Extract observations where the high school received at least 1 visit from UC Berkeley and the first three columns
```{r results="hide"}
df_school[df_school$visits_by_110635 >= 1, 1:3]
```
- Extract observations where the high school received at least 1 visit from UC Berkeley and variables "state_code" "school_type" "name"
```{r results="hide"}
df_school[df_school$visits_by_110635 >= 1, c("state_code","school_type","name")]
```

### Subset Data Frames by combining `[]` and `$`, more examples

Syntax: 

- filter based on one variable: 
  - `df_name[df_name$var_name <condition>, <columns>]`
- Example syntax to filter based on two conditions being true
  - `df_name[df_name$var_name <condition> & df_name$var_name <condition>, <columns>]`

Pro tip:

- wrap above syntax within `nrow()` function to count how many observations (rows) satisfy the condition (as opposed to printing all rows that satisfy condition)

__Tasks__

- Count obs where high schools received at least 1 visit by Bama (100751) **and** at least one visit by Berkeley (110635)

```{r}
nrow(df_school[df_school$visits_by_110635 >= 1 & 
                 df_school$visits_by_100751 >= 1, ])
# Equivalently:
# nrow(df_school[df_school[["visits_by_110635"]] >= 1 & 
#                df_school[["visits_by_100751"]] >= 1, ])
```

- Count obs where schools received 1+ visit by Bama **or** 1+ visit by Berkeley
```{r}
nrow(df_school[df_school$visits_by_110635 >= 1 
  | df_school$visits_by_100751 >= 1, ])
```
### Logical operators for comparisons

- Logical operators to isolate/filter observations of data frame

Symbol | Meaning
-------|-------
`==` | Equal to
`!=` | Not equal to
`>` | greater than
`>=` | greater than or equal to
`<` | less than
`<=` | less than or equal to
`&` | AND 
`|` | OR
`%in%` | includes

\medskip 

- Visualization of "Boolean" operators (e.g., AND, OR, AND NOT)

!["Boolean" operations, x=left circle, y=right circle, from Wichkam (2018)](transform-logical.png){width=40%}

### Subset Data Frames by combining `[]` and `$`, more examples

**Example**: Count the number of out-of-state high schools that UC Berkeley visited.

\smallskip

```{r}
# The `inst_110635` variable contains the home state of UC Berkeley
unique(df_school$inst_110635)

# Filter for schools visited by UC Berkeley AND whose state is not "CA"
nrow(df_school[df_school$visits_by_110635 >= 1 &
                 df_school$state_code != df_school$inst_110635, ])
```

\bigskip

**Example**: Count the number of schools in the Northeast that received a visit from either UC Berkeley, U of Alabama, or CU Boulder.

\smallskip

```{r}
# Vector containing states located in the Northeast region
northeast_states <- c('CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA')

# Filter for schools in the Northeast AND visited by any of the 3 univs
nrow(df_school[df_school$state_code %in% northeast_states &
                 (df_school$visits_by_110635 >= 1 |
                    df_school$visits_by_100751 >= 1 |
                    df_school$visits_by_126614 >= 1), ])
```


### Subset Data Frames by combining `[]` and `$`, `NA` Observations

Filtering observations of data frame using `[]` combined with `$` is more complicated in the presence of missing values (`NA` values)

\medskip 

The next few slides will explain 

- why it is more complicated
- how to filter correctly when `NA`s are present

### Subset Data Frames by combining `[]` and `$`, `NA` Observations

When sub-setting via `[]` combined with `$`, result will include:

- rows where condition is `TRUE`
- __as well as__ rows with `NA` (missing) values for `<condition>`. 

__Task__ (using `df_event`, which has one obs per university, recruiting event)

- How many events at public HS with at least $50k median household income?

```{r}
sum(is.na(df_event$med_inc)) # number of observations (all events) with missing values for med_inc

#num obs event_type=="public hs" and med_inc is missing
nrow(df_event[df_event$event_type == "public hs" 
  & is.na(df_event$med_inc)==1 , ]) # note TRUE evaluates to 1

#num obs event_type=="public hs" & med_inc is not NA & med_inc >= $50,000
nrow(df_event[df_event$event_type == "public hs" 
  & is.na(df_event$med_inc)==0 & df_event$med_inc>=50000 , ])  # note FALSE evaluates to 0

#num obs event_type=="public hs" and med_inc >= $50,000
nrow(df_event[df_event$event_type == "public hs" 
  & df_event$med_inc>=50000 , ])
```

### Subset Data Frames by combining `[]` and `$`, `NA` Observations

To exclude rows where condition is `NA` if subset using `[]` combined w/ `$`

- use `which()` to ask only for values where condition evaluates to `TRUE`
- `which()` returns position numbers for elements where condition is `TRUE`
```{r}
#?which
c(TRUE,FALSE,NA,TRUE)
str(c(TRUE,FALSE,NA,TRUE))
which(c(TRUE,FALSE,NA,TRUE))
```

Task: Count events at public HS with at least $50k median household income?
```{r}
#Base R, `[]` combined with `$`; without which()
nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000, ])

#Base R, `[]` combined with `$`; with which()
nrow(df_event[which(df_event$event_type == "public hs" 
  & df_event$med_inc>=50000), ])
```

### Student Exercises

Subsetting Data Frames with `[]` and `$`:

1. Show how many public high schools in California with at least 50% Latinx (hispanic in data) student enrollment from df_school. 

2. Show how many out-state events at public high schools with more than $30K median from df_event (do not forget to exclude missing values).

### Solution to Student Exercises

Solution to 1

__base R__ using [] and $ 
```{r}
df_school_br1<- df_school[df_school$school_type == "public" 
                  & df_school$pct_hispanic >= 50 
                  & df_school$state_code == "CA", ]
nrow(df_school_br1)
```

### Solution to Student Exercises

Solution to 2:

__base R__ using [] and $
```{r}
# use is.na to exclude NA
nrow(df_event[df_event$event_type == "public hs" & df_event$event_inst =="Out-State" 
              & df_event$med_inc > 30000 & is.na(df_event$med_inc) ==0, ])

# use which to exclude NA
nrow(df_event[which(df_event$event_type == "public hs" & df_event$event_inst =="Out-State" 
              & df_event$med_inc > 30000 ), ])
```


# Subset using subset() function

### Subset function

The `subset()` is a base R function to "filter" observations from some object `x`

- object `x` can be a matrix, data frame, list
- `subset()` automatically excludes elements/rows with `NA` for condition
- Can also use `subset()` to select variables
- what `subset()` function returns:
  - "An object similar to x contain just the selected \ldots rows and columns (for a matrix or data frame)"
- `subset()` can be combined with:
    - assignment (`<-`) to create new objects
    - `nrow()` to count number of observations that satisfy criteria

```{r, eval=FALSE}
?subset
```

\medskip

Syntax [when object is data frame]: __subset(x, subset, select, drop = FALSE)__

- `x` is object to be subset
- `subset` is the logical expression(s) (evaluates to `TRUE/FALSE`) indicating elements (rows) to keep
- `select` indicates columns to select from data frame (if argument is not used default will keep all columns)
- `drop` to preserve original __dimensions__ [SKIP]


### Subset function, examples 

Recall the previous example where we count events at public HS with at least $50k median household income. 

- _Note_. `subset()` automatically excludes rows where condition is `NA`:

```{r}
#Base R, `[]` combined with `$`, without which(); includes `NA`
nrow(df_event[df_event$event_type == "public hs" 
              & df_event$med_inc>=50000, ])	

#Base R, `[]` combined with `$`, with which(); excludes `NA`
nrow(df_event[which(df_event$event_type == "public hs" 
                    & df_event$med_inc>=50000), ])

#Base R, `subset()`; excludes `NA`
nrow(subset(df_event, event_type == "public hs" 
            & med_inc>=50000))

#Base R, `subset()`; excludes `NA`; explicitly name arguments of subset()
nrow(subset(x = df_event, subset = event_type == "public hs" 
            & med_inc>=50000))
```


### Subset function, examples 

Using `df_school`, show all public high schools that are at least 50% Latinx (var=`pct_hispanic`) student enrollment in California 

- Using base R, `subset()` [output omitted]
```{r, results="hide"}
#public high schools with at least 50% Latinx student enrollment 
subset(x= df_school, subset = school_type == "public" & pct_hispanic >= 50 
     & state_code == "CA")
```

- Can wrap `subset()` within `nrow()` to count number of observations that satisfy criteria
```{r}
nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 
     & state_code == "CA"))
```

### Subset function, examples

Note that `subset()` identify the number of observations for which the condition is `TRUE`

```{r}
nrow(subset(x = df_school, subset = TRUE))
nrow(subset(x = df_school, subset = FALSE))
```

### Subset function, examples 

Count all CA public high schools that are at least 50% Latinx and received at least 1 visit from UC Berkeley (var=`visits_by_110635`)

```{r}
nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 
  & state_code == "CA" & visits_by_110635 >= 1))
```

### Subset function, examples 

`subset()` can also use `%in%` operator, which is more efficient version of __OR__ operator `|`


- Count number of schools from MA, ME, or VT that received at least one visit from University of Alabama (var=`visits_by_100751`)
```{r}
nrow(subset(df_school, state_code %in% c("MA","ME","VT") 
  & visits_by_100751 >= 1))
```

### Subset function, examples 

Use the `select` argument within `subset()` to keep selected variables

- syntax: `select = c(var_name1,var_name2,...,var_name_n)`

Subset all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address`

```{r}
subset(x = df_school, subset = school_type == "public" & pct_hispanic >= 50 
             & state_code == "CA", select = c(name, address))
```

### Subset function, examples 

Combine `subset()` with assignment (`<-`) to create a new data frame

Create a new date frame of all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address`
```{r}
df_school_v2 <- subset(df_school, school_type == "public" & pct_hispanic >= 50 
  & state_code == "CA", select = c(name, address))

head(df_school_v2, n=5)

nrow(df_school_v2)
```

### Student Exercises

Using `subset()` from base R:

1. Create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from `df_event` data frame. And show what columns (variables) are in the newly created dataframe. 

2. Create a new dataframe from the `df_school` data frame that includes out-of-state public high schools with 50%+ Latinx student enrollment that received at least one visit by the University of California Berkeley (var= visits_by_110635). And count the number of observations.

3. Count the number of public schools from CA, FL or MA that received one or two visits from UC Berkeley from the `df_school` data frame.

4. Subset all public out-of-state high schools visited by University of California Berkeley that enroll at least 50% Black students, and only keep variables `state_code`, `name` and `zip_code`.

### Solution to Student Exercises 

Solution to 1 

```{r}
df_event_br <- subset(df_event, select=c(instnm, event_date, event_type))
names(df_event_br)
```

Solution to 2

```{r}
df_school_br <- subset(df_school, state_code != "CA" & school_type == "public" 
                        & pct_hispanic >= 50 & visits_by_110635 >=1 )
nrow(df_school_br)
```

Solution to 3

```{r}
nrow(subset(df_school, state_code %in% c("CA", "FL", "MA")  
             & school_type == "public" & visits_by_110635 %in% c(1,2) ))
```


### Solution to Student Exercises 

Solution to 4

```{r}
subset(df_school, school_type == "public" & state_code != "CA" 
       & visits_by_110635 >= 1 & pct_black >= 50, 
       select = c(state_code, name, zip_code))
```

# Creating variables

### Create new data frame based on `df_school_all`

Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities
```{r}
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData"))
names(df_school_all)
```
### Create new data frame based on `df_school_all`

Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables
```{r, results='hide'}
library(tidyverse) # below code use tidyverse functions and pipe operator
school_v2 <- df_school_all %>% 
  select(-contains("inst_")) %>% # remove vars that start with "inst_"
  rename( # rename selected variables
    visits_by_berkeley = visits_by_110635,
    visits_by_boulder = visits_by_126614,
    visits_by_bama = visits_by_100751,
    visits_by_stonybrook = visits_by_196097,
    visits_by_rutgers = visits_by_186380,
    visits_by_pitt = visits_by_215293,
    visits_by_cinci = visits_by_201885,
    visits_by_nebraska = visits_by_181464,
    visits_by_georgia = visits_by_139959,
    visits_by_scarolina = visits_by_218663,
    visits_by_ncstate = visits_by_199193,
    visits_by_irvine = visits_by_110653,
    visits_by_kansas = visits_by_155317,
    visits_by_arkansas = visits_by_106397,
    visits_by_sillinois = visits_by_149222,
    visits_by_umass = visits_by_166629,
    num_took_read = num_took_rla,
    num_prof_read = num_prof_rla,
    med_inc = avgmedian_inc_2564
  )

glimpse(school_v2)
```

### Base R approach to creating new variables

Create new variables using assignment operator `<-` and subsetting operators `[]` and `$` to create new variables and set conditions of the input variables 

\medskip

Pseudo syntax: `df$newvar <- ...` 

- where `...` argument is expression(s)/calculation(s) used to create new variables
    - expressions can include subsetting operators and/or other base R functions

\medskip

__Task__: Create measure of percent of students on free-reduced lunch

__base R approach__
```{r}
school_v2_temp<- school_v2 #create copy of dataset; not necessary
school_v2_temp$pct_fr_lunch <- 
   school_v2_temp$num_fr_lunch/school_v2_temp$total_students

#investigate variable you created
str(school_v2_temp$pct_fr_lunch)
school_v2_temp$pct_fr_lunch[1:5] # print first 5 obs
```

__tidyverse approach (with pipes)__
```{r}
school_v2_temp <- school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) 
```

### Base R approach to creating new variables

If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `if_else()` or `recode()`

\medskip

- Pseudo syntax: `df$newvar[logical condition]<- new value` 
- `logical condition`: a condition that evaluates to `TRUE` or `FALSE`

###  Base R approach to creating new variables

__Task__: Create 0/1 indicator if school has median income greater than $100k

__tidyverse approach (using pipes)__
```{r}
school_v2_temp %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
```

__Base R approach__
```{r}
school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs 
                              # otherwise you'll get warning
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0
count(school_v2_temp, inc_gt_100k)
```

### Creating variables

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

__tidyverse approach (using pipes)__
```{r}
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))
wwlist_temp <- wwlist %>% 
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
str(wwlist_temp$state_gen)
wwlist_temp %>% count(state_gen)
```

### Base R approach to creating new variables 

__Task__: Using  `wwlist` and input vars `state` and `firstgen`, create a 4-category var

__base R approach__
```{r}
wwlist_temp <- wwlist 

wwlist_temp$state_gen <- NA
wwlist_temp$state_gen[wwlist_temp$state == "WA" 
  & wwlist_temp$firstgen =="Y"] <- "instate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state == "WA" 
  & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" 
  & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" 
  & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen"

str(wwlist_temp$state_gen)
count(wwlist_temp, state_gen)
```

# Appendix


## Sorting data 

### Base R `sort()` for vectors 


`sort()` is a base R function that sorts vectors

Syntax: `sort(x, decreasing=FALSE, ...)`

- where x is object being sorted
- By default it sorts in ascending order (low to high)
- Need to set decreasing argument to `TRUE` to sort from high to low

```{r}
#?sort()
x<- c(31, 5, 8, 2, 25)
sort(x)
sort(x, decreasing = TRUE)
```


### Base R `order()` for dataframes

`order()` is a base R function that sorts vectors

- Syntax: `order(..., na.last = TRUE, decreasing = FALSE)`
- where `...` are variable(s) to sort by
- By default it sorts in ascending order (low to high)
- Need to set decreasing argument to `TRUE` to sort from high to low


Descending argument only works when we want either one (and only) variable descending or all variables descending (when sorting by multiple vars)

- use `-` when you want to indicate which variables are descending while using the default ascending sorting
```{r results="hide"}
df_event[order(df_event$event_date), ] 
df_event[order(df_event$event_date, df_event$total_12), ]

#sort descending via argument
df_event[order(df_event$event_date, decreasing = TRUE), ] 
df_event[order(df_event$event_date, df_event$total_12, decreasing = TRUE), ] 

#sorting by both ascending and descending variables
df_event[order(df_event$event_date, -df_event$total_12), ]
```

### Example, sorting

- Create a new dataframe from df_events that sorts by ascending by `event_date`, ascending `event_state`, and descending `pop_total`.

__base R__ using  `order()` function:

```{r results="hide"}
df_event_br1 <- df_event[order(df_event$event_date, df_event$event_state, 
                               -df_event$pop_total), ]
```