---
title: "Lecture 4: Pipes and variable creation"
subtitle:  "Managing and Manipulating Data Using R"
author:
date: 
fontsize: 8pt
classoption: dvipsnames  # for colors
urlcolor: blue
output:
  beamer_presentation:
    keep_tex: true
    toc: false
    slide_level: 3
    theme: default # AnnArbor # push to header?
    #colortheme: "dolphin" # push to header?
    #fonttheme: "structurebold"
    highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header
    df_print: tibble #default # tibble # push to header?    
    latex_engine: xelatex #  Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts.
    includes:
      in_header: ../beamer_header.tex
      #after_body: table-of-contents.txt 
---

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
#knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
  #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##"
```

# Introduction

### What we will do today

\tableofcontents

```{r, eval=FALSE, echo=FALSE}
#Use this if you want TOC to show level 2 headings
\tableofcontents
#Use this if you don't want TOC to show level 2 headings
\tableofcontents[subsectionstyle=hide/hide/hide]
```

### Libraries we will use today

"Load" the package we will use today (output omitted)

- __you must run this code chunk__
```{r, message=FALSE}
library(tidyverse)
```
If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file

- Generic syntax: `install.packages("package_name")`
- Install "tidyverse": `install.packages("tidyverse")`

Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:

- `install.packages("tidyverse")`
- `library(tidyverse)`


## Data for lecture

### Lecture 3 data: prospects purchased by Western Washington U.

The "Student list" business

- Universities identify/target "prospects" by buying "student lists" from College Board/ACT (e.g., $.40 per prospect)
- Prospect lists contain contact info (e.g., address, email), academic achievement, socioeconomic, demographic characteristics
- Universities choose which prospects to purchase by filtering on criteria like zip-code, GPA, test score range, etc.

```{r}
#load prospect list data
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))
```

Object `wwlist`

- De-identified list of prospective students purchased by Western Washington University from College Board
- We collected these data using FOIA request
    - ASIDE: Become an expert on collecting data via FOIA requests and you will become a superstar!


### Lecture 3 data: prospects purchased by Western Washington U.

Observations on `wwlist`

- each observation represents a prospective student
```{r}
typeof(wwlist)
dim(wwlist)
```

Variables on `wwlist`

- some vars provide de-identified data on individual prospects
    - e.g., `psat_range`, `state`, `sex`, `ethn_code`
- some vars provide data about zip-code student lives in
    - e.g., `med_inc`, `pop_total`, `pop_black`
- some vars provide data about school student enrolled in
    - e.g., `fr_lunch` is number of students on free/reduced lunch
    - note: bad merge between prospect-level data and school-level data
```{r, results="hide"}
names(wwlist)
str(wwlist)
glimpse(wwlist) # tidyverse function, similar to str()
```

### Lecture 3 data: prospects purchased by Western Washington U.

Variable `firstgen` identifies whether prospect is a first-generation college student

Imagine we want to isolate all the first-generation prospects

1. Investigate variable type/structure. 

- A dichotomous var, but stored as character in `wwlist`. So must use quotes (`''` or `""`) to filter/subset based on values of `firstgen`
```{r}
str(wwlist$firstgen)
```

2. Create frequency table to identify possible values of `firstgen`
```{r}
table(wwlist$firstgen, useNA = "always")
```
3. Isolate all the first-gen prospects
```{r}
filter(wwlist, firstgen == "Y")
```


# Pipes

### What are "pipes", %>%

__Pipes__ are a means of perfoming multiple steps in a single line of code

- When writing code, the pipe symbol is `%>%`
- Basic flow of using pipes in code:
    - `object %>% some_function %>% some_function %>% some_function`    
- Pipes work from left to right:
    - The object from left of `%>%` pipe symbol is input as the first argument of the function to the right of the `%>%` pipe symbol 
    - In turn, the resulting output becomes the input (the first argument) of the function to the right of the next `%>%` pipe symbol 
- Pipes are part of __tidyverse__ suite of packages, not __base R__
    
\medskip

Intuitive mnemonic device for understanding pipes

- whenever you see a pipe `%>%` think of the words "__and then...__"

\medskip

Example: isolate all the first-generation prospects [output omitted]

- in words: start with object `wwlist` __and then__ filter first generation students
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y")
```
    

### Do task with and without pipes

Task: Using object `wwlist` print data for "first-gen" prospects (`firstgen == "Y"`)
```{r, results='hide'}
# without pipes
filter(wwlist, firstgen == "Y")

# with pipes
wwlist %>% filter(firstgen == "Y")
```
Comparing the two approaches:

- "without pipes", object `wwlist` is the first argument `filter()` function
- In "pipes" approach, you don't specify object `wwlist` as first argument in `filter()`
    - Why? Because `%>%` "pipes" the object to the left of the `%>%` operator into the function to the right of the `%>%` operator
    
Main takeaway:

- When writing code using pipes, functions to right of `%>%` pipe operator should not explicitly name object that is the input to the function. 
- Rather, object to the left of `%>%` pipe operator is automatically the input.
    

### More intuition on the pipe operator, `%>%`

The pipe operator "pipes" (verb) an object from left of `%>%` operator into the function to the right of the %>% operator

Example, the "structure" function `str()`, with and without pipes

- Examine syntax for `str()`: `str(object, ...)`
```{r, eval=FALSE}
?str
```

- Investigate structure of dataframe `wwlist` without and with pipes
```{r, results="hide"}
str(wwlist) # without pipe

wwlist %>% str() # with pipe
```

Questions:

- In the pipes approach, `wwlist %>% str()`, why didn't we need to insert argument values inside `str()`
- What would happen if we just ran this line of code?
```{r, eval=FALSE}
str()
```


### Do task with and without pipes

__Task__: Using object `wwlist`, print data for "first-gen" prospects for selected variables [output omitted]

```{r, results='hide'}
#Without pipes
select(filter(wwlist, firstgen == "Y"), state, hs_city, sex)
#With pipes
wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex)
```


Comparing the two approaches:

- In the "without pipes" approach, code is written "inside out"
    - The first step in the task -- identifying the object -- is the innermost part of code
    - The last step in task -- selecting variables to print -- is the outermost part of code
- In "pipes" approach the left-to-right order of code matches how we think about the task
    - First, we start with an object __*and then*__ (`%>%`) we use `filter()` to isolate first-gen students __*and then*__ (`%>%`) we select which  variables to print

\medskip

__Important__: `str()` function helpful for understanding what object is piped in from one function to another
```{r, results="hide"}
#object that was "piped" into `select()` from `filter()`
wwlist %>% filter(firstgen == "Y") %>% str()

#object that was created after `select()` function
wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% str()
```

### Aside: `count()` function

\medskip `count()` function from `dplyr` package counts the number of obs by group
```{r, eval=FALSE, echo=FALSE}
?count
```

__Syntax__ [see help file for full syntax]

- `count(x,...)`

__Arguments__ [see help file for full arguments]

- `x`: an object, often a data frame
- `...`: variables to group by

Examples of using `count()`

- Without vars in `...` argument, counts number of obs in object
```{r, results="hide"}
count(wwlist)
wwlist %>% count()
wwlist %>% count() %>% str()
```
- With vars in `...` argument, counts number of obs per variable value
    - This is the best way to create frequency table, better than `table()`
    - note: by default, `count()` always shows `NAs` [this is good!]
```{r, results="hide"}
count(wwlist,school_category)
wwlist %>% count(school_category)
wwlist %>% count(school_category) %>% str()
```
### pipe operators and new lines

\medskip Often want to insert line breaks to make long line of code more readable

- When inserting line breaks, __pipe operator `%>%` should be the last thing before a line break, not the first thing after a line break__

__This works__
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y") %>% 
  select(state, hs_city, sex) %>%
  count(sex)
```
__This works too__
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y",
                  state != "WA") %>% 
  select(state, hs_city, sex) %>%
  count(sex)
```

__This doesn't work__
```{r, eval=FALSE}
wwlist %>% filter(firstgen == "Y") 
  %>% select(state, hs_city, sex) 
  %>% count(sex)
```

### The power of pipes

You might be thinking, "what's the big deal?"

__TasK__:

- in one line of code, modify `wwlist` and create bar chart that counts number of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state
```{r, eval=FALSE}
wwlist %>%  filter(is.na(state)==0) %>% # drop obs where variable state missing
  mutate( # create out-of-state indicator; create recoded ethnicity var
    out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")),
    ethn_race = recode(ethn_code,
      "american indian or alaska native" = "nativeam",
      "asian or native hawaiian or other pacific islander" = "api",
      "black or african american" = "black",
      "cuban" = "latinx",
      "mexican/mexican american" = "latinx",
      "not reported" = "not_reported",
      "other-2 or more" = "multirace",
      "other spanish/hispanic" = "latinx",
      "puerto rican" = "latinx",
      "white" = "white")) %>% 
    group_by(out_state) %>% # group_by "in-state" vs. "out-of-state"
    count(ethn_race) %>%  # count of number of prospects purchased by race
    ggplot(aes(x=ethn_race, y=n)) +  # plot
    ylab("number of prospects") + xlab("race/ethnicity") +
    geom_col() +  coord_flip() + facet_wrap(~ out_state) 
```

### The power of pipes

__TasK__:

- in one line of code, modify `wwlist` and create bar chart of median income (in zip-code) of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state
```{r, eval=FALSE}
wwlist %>%  filter(is.na(state)==0) %>% # drop obs where variable state missing
  mutate( # create out-of-state indicator; create recoded ethnicity var
    out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")),
    ethn_race = recode(ethn_code,
      "american indian or alaska native" = "nativeam",
      "asian or native hawaiian or other pacific islander" = "api",
      "black or african american" = "black",
      "cuban" = "latinx",
      "mexican/mexican american" = "latinx",
      "not reported" = "not_reported",
      "other-2 or more" = "multirace",
      "other spanish/hispanic" = "latinx",
      "puerto rican" = "latinx",
      "white" = "white")) %>% 
    group_by(out_state, ethn_race) %>% # group_by "out-state" and ethnicity
    summarize(avg_inc_zip = mean(med_inc_zip, na.rm = TRUE)) %>% # calculate avg. inc
    ggplot(aes(x=out_state, y=avg_inc_zip)) + 
    ylab("avg. income in zip code") + xlab("") +
    geom_col() +  coord_flip() + facet_wrap(~ ethn_race) # plot
```

### The power of pipes

Example R script from Ben Skinner, which creates analysis data for [Skinner (2018)](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1)

- [Link to R script](https://github.com/btskinner/colchoice_rep/blob/master/scripts/makedata.r)

\medskip

Other relevant links

- [Link to Github repository for Skinner (2018)](https://github.com/btskinner/colchoice_rep)
- [Link to published paper](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1)
- [Link to Skinner's Github page](https://github.com/btskinner)
    - A lot of cool stuff here
- [Link to Skinner's personal website](https://www.btskinner.io/)
    - A lot of cool stuff here


### Do task with and without pipes [STUDENTS WORK ON THEIR OWN]

Task: 

- Count the number "first-generation" prospects from the state of Washington

Without pipes
```{r}
count(filter(wwlist, firstgen == "Y", state == "WA"))
```

With pipes
```{r}
wwlist %>% filter(firstgen == "Y", state == "WA") %>% count()
```

### Do task with and without pipes [STUDENTS WORK ON THEIR OWN]

__Task__: frequency table of `school_type` for non first-gen prospects from WA

__without pipes__
```{r}
wwlist_temp <- filter(wwlist, firstgen == "N", state == "WA")
table(wwlist_temp$school_type, useNA = "always")
rm(wwlist_temp) # cuz we don't need after creating table
```
__With pipes__
```{r}
wwlist %>% filter(firstgen == "N", state == "WA") %>% count(school_type)
```

__Comparison of two approaches__

- without pipes, task requires multiple lines of code (this is quite common)
    - first line creates object; second line analyzes object
- with pipes, task can be completed in one line of code and you aren't left with objects you don't care about

### Student exercises with pipes

1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign `<-` them to object `wwlist_temp`.  (ex. wwlist_temp <- wwlist)

2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California.  

3. **Bonus**: Try doing question 1 and 2 together. Use original object `wwlist`, but do not assign to a new object. 

Once finished you can `rm(wwlist_temp)`

### Solution to exercises with pipes

1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign them to object `wwlist_temp` 
```{r}
wwlist_temp <- wwlist %>%
  select(state, firstgen, ethn_code) 
```

### Solution to exercises with pipes
2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. 
```{r}
#names(wwlist)
wwlist_temp %>%
  filter(firstgen == "Y", state == "CA") %>% count(ethn_code)
```

### Solution to exercises with pipes
3. **Bonus**: Try doing question 1 and 2 together. 

```{r}
wwlist %>%
  select(state, firstgen, ethn_code) %>%
  filter(firstgen == "Y", state == "CA") %>% 
  count(ethn_code)
#rm(wwlist_temp)
```

```{r}
rm(wwlist_temp)
```


# Creating variables using mutate (tidyverse approach)

### Our plan for learning how to create new variables

Recall that `dplyr` package within `tidyverse` provide a set of functions that can be described as "verbs": __subsetting__, __sorting__, and __transforming__


What we've done | Where we're going
--------------- | --------------------
__Subsetting data__ | __Transforming data__
- `select()` variables  | - `mutate()` creates new variables
- `filter()` observations | - `summarize()` calculates across rows
__Sorting data__ | - `group_by()` to calculate across rows within groups
- `arrange()` | 

__Today__

- we'll use `mutate()` to create new variables based on calculations across columns within a row

__Next week__

- we'll combine `mutate()` with `summarize()` and `group_by()` to create variables based on calculations across rows

### Create new data frame based on `df_school_all`

Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities
```{r}
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData"))
names(df_school_all)
```
### Create new data frame based on `df_school_all`

Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables
```{r, results='hide'}
school_v2 <- df_school_all %>% 
  select(-contains("inst_")) %>% # remove vars that start with "inst_"
  rename( # rename selected variables
    visits_by_berkeley = visits_by_110635,
    visits_by_boulder = visits_by_126614,
    visits_by_bama = visits_by_100751,
    visits_by_stonybrook = visits_by_196097,
    visits_by_rutgers = visits_by_186380,
    visits_by_pitt = visits_by_215293,
    visits_by_cinci = visits_by_201885,
    visits_by_nebraska = visits_by_181464,
    visits_by_georgia = visits_by_139959,
    visits_by_scarolina = visits_by_218663,
    visits_by_ncstate = visits_by_199193,
    visits_by_irvine = visits_by_110653,
    visits_by_kansas = visits_by_155317,
    visits_by_arkansas = visits_by_106397,
    visits_by_sillinois = visits_by_149222,
    visits_by_umass = visits_by_166629,
    num_took_read = num_took_rla,
    num_prof_read = num_prof_rla,
    med_inc = avgmedian_inc_2564
  )

glimpse(school_v2)
```

## Introduce mutate() function 

### Introduce `mutate()` function

`mutate()` is __tidyverse__ approach to creating variables (not __Base R__ approach)

Description of `mutate()`

- creates new columns (variables) that are functions of existing columns
- After creating a new variable using `mutate()`, every row of data is retained
- `mutate()` works best with pipes `%>%`

__Task__: 

- Using data frame `school_v2` create new variable that measures the pct of students on free/reduced lunch (output omitted)
```{r, results='hide'}
# create new dataset with fewer vars; not necessary to do this
school_sml <- school_v2 %>% 
  select(ncessch, school_type, num_fr_lunch, total_students)

# create new var
school_sml %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) 

# remove data frame object
rm(school_sml) 
```

### Investigate `mutate()` syntax


```{r, eval=FALSE, echo=FALSE}
?mutate
```

__Usage (i.e., syntax)__

- `mutate(.data,...)`

__Arguments__

- `.data`: a data frame
    - if using `mutate()` after pipe operator `%>%`, then this argument can be omitted
        - Why? Because data frame object to left of `%>%` "piped in" to first argument of `mutate()`
- `...`: expressions used to create new variables
    - "Name-value pairs of expressions"
    - "The name of each argument will be the name of a new variable, and the value will be its corresponding value." 
    - "Use a `NULL` value in mutate to drop a variable." 
    - "New variables overwrite existing variables of the same name"


__Value__

- returns a (data frame) object that contains the original input data frame and new variables that were created by `mutate()`

### Investigate `mutate()` syntax


```{r, eval=FALSE, echo=FALSE}
?mutate
```

__Can create variables using standard mathematical or logical operators__  [output omitted]

```{r, results="hide"}
#glimpse(school_v2)
school_v2 %>% 
  select(state_code,school_type,ncessch,med_inc,num_fr_lunch,total_students,num_took_math) %>%
  mutate( # each argument creates a new variable, name of argument is name of variable
    one = 1,
    med_inc000 = med_inc/1000,
    pct_fr_lunch = num_fr_lunch/total_students*100,
    took_math_na = is.na(num_took_math)==1
  ) %>%
  select(state_code,school_type,ncessch,one,med_inc,med_inc000,num_fr_lunch,total_students,pct_fr_lunch,num_took_math,took_math_na)
```


__Can create variables using "helper functions" called within `mutate()`__ [output omitted]

- These are standalone functions can be called *within* `mutate()`
    - e.g., `if_else()`, `recode()`, `case_when()`
- will walk through helper functions in more detail in subsequent sections of lecture

```{r, results="hide"}
school_v2 %>% 
  select(state_code,ncessch,name,school_type) %>%
  mutate(public = if_else(school_type == "public", 1, 0))
```

### Introduce `mutate()` function

New variable not retained unless we __assign__ `<-` it to an object (existing or new)

\medskip 

- __`mutate()` without assignment__
```{r, results='hide'}
school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students)

names(school_v2)
```

\medskip 

- __`mutate()` with assignment__
```{r, results="hide"}
school_v2_temp <- school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students)

names(school_v2_temp)
rm(school_v2_temp)

```

### `mutate()` can create multiple variables at once

`mutate()` can create multiple variables at once

```{r, results='hide'}
school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students,
         pct_prof_math= num_prof_math/num_took_math) %>%
  select(num_fr_lunch, total_students, pct_fr_lunch, 
         num_prof_math, num_took_math, pct_prof_math)
```

Or we could write code this way:
```{r, results="hide"}
school_v2 %>% 
  select(num_fr_lunch, total_students, num_prof_math, num_took_math) %>%
  mutate(pct_fr_lunch = num_fr_lunch/total_students,
         pct_prof_math= num_prof_math/num_took_math) 
```

`mutate()` can use variables previously created within `mutate()`
```{r, results="hide"}
school_v2 %>% 
  select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read,
         avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2) 
```
### `mutate()`, removing variables created by `mutate()`

Within `mutate()` use syntax `var_name = NULL` to remove variable from data frame

- note: Variable not permanently removed from data frame unless you use assignment `<-` to create new data frame or overwrite existing data frame
```{r, results='hide'}
ncol(school_v2)
school_v2 %>% 
  select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% glimpse()

school_v2 %>% 
  select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% 
  mutate(num_prof_math = NULL, num_took_math = NULL) %>% glimpse()
#But variables not permanently removed because we didn't use assignment
ncol(school_v2)
```
Why would we remove variables within `mutate()` rather `select()`?

- remove temporary "work" variables used to create desired variable
- Example: measure of average of pct who passed math and pct who passed reading
```{r, results='hide'}
school_v2 %>% 
  select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math, # create work var
         pct_prof_read = num_prof_read/num_took_read, # create work var
         avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2, #create analysis var
         pct_prof_math = NULL, # remove work var
         pct_prof_read = NULL) %>% # remove work var
  glimpse()
```

### Student exercise using mutate()

1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`.  

2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading.  

3. Count the number of schools from question 2.

4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? 


### Solutions for exercise using mutate()

1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`.

```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) 
```
### Solutions for exercise using mutate()

2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading.
```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) %>%
  filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) 
```

### Solutions for exercise using mutate()
3. Count the number of schools from question 2.
```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) %>%
  filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) %>%
  count()
```

### Solutions for exercise using mutate()
4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? 

```{r}
school_v2 %>% 
  mutate(med_inc_na = is.na(med_inc)) %>%
  count(med_inc_na)
```

## Using if_else() function within mutate()

### Using `if_else()` function within `mutate()`

```{r, eval=FALSE, echo=FALSE}
?if_else
```

__Description__

- if `logical condition` `TRUE`, assign a value; if `logical condition` `FALSE` assign a value

__Usage (i.e., syntax)__

- `if_else(logical condition, true, false, missing = NULL)`

__Arguments__

- `logical condition`: a condition that evaluates to `TRUE` or `FALSE`
- `true`: value to assign if condition `TRUE`
- `false`: value to assign if condition `FALSE`
- `missing`: value to assign to rows that have value `NA` for condition
    - default is `missing = NULL`; means that if condition is `NA`, then new_var == `NA`
    - But can assign different values to `NA`s, e.g., `missing = -9`

__Value__

- "Where condition is TRUE, the matching value from true, where it's FALSE, the matching value from false, otherwise NA."
- Unless otherwise specified, `NA`s in "input" var(s) assigned `NA` in "output var"

__Example__: Create 0/1 indicator of whether got at least one visit from Berkeley
```{r, results="hide"}
school_v2 %>% 
  mutate(got_visit_berkeley = if_else(visits_by_berkeley>0,1,0)) %>%
  count(got_visit_berkeley)
```

### `if_else()` within `mutate()` to create 0/1 indicator variables

We often create dichotomous (0/1) indicator variables of whether something happened (or whether something is TRUE)

- Variables that are of substantive interest to project
    - e.g., did student graduate from college
- Variables that help you investigate data, check quality
    - e.g., indicator of whether an observation is missing/non-missing for a particular variable

### Using `if_else()` within `mutate()`

__Task__

- Create 0/1 indicator if school has median income greater than $100,000

Usually a good idea to investigate "input" variables __before__ creating analysis vars
```{r, results="hide"}
str(school_v2$med_inc) # investigate variable type
school_v2 %>% count(med_inc) # frequency count, but this isn't very helpful

school_v2 %>% filter(is.na(med_inc)) %>% count(med_inc) 
school_v2 %>% filter(is.na(med_inc)) %>% count()
# shows number of obs w/ missing med_inc
```
Create variable
```{r}
school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
```

### Using `if_else()` within `mutate()`

__Task__: 

- Create 0/1 indicator if school has median income greater than $100,000. 

This time, let's experiment with the `missing` argument of `if_else()`
```{r, eval=FALSE}
#what we wrote before
school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>%
  count(inc_gt_100k)

#manually write out the default value for `missing`
school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NULL)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated

school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NA_real_)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
# NA can be coerced to any other vector type except raw: 
# NA_integer_, NA_real_, NA_complex_ and NA_character_

# Here we give missing values in condition the value of -9 in new variable
school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = -9)) %>%
  count(inc_gt_100k) 
```

### Using `if_else()` function within `mutate()`

__Task__

- Create 0/1 indicator variable `nonmiss_math` which indicates whether school has non-missing values for the variable `num_took_math`
    - note: `num_took_math` refers to number of students at school that took state math proficiency test


Usually a good to investigate "input" variables before creating analysis vars
```{r, results="hide"}
school_v2 %>% count(num_took_math) # this isn't very helpful
school_v2 %>% filter(is.na(num_took_math)) %>% count(num_took_math) # shows number of obs w/ missing med_inc
```
Create variable
```{r}
school_v2 %>% select(num_took_math) %>% 
  mutate(nonmiss_math= if_else(!is.na(num_took_math),1,0)) %>%
  count(nonmiss_math) # note how NA values treated
```

### Student exercises `if_else()`

1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`).  
2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine.  
3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine.

### Exercise`if_else()` solutions

1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). 
```{r, results="hide"}
str(school_v2$state_code) # investigate input variable
school_v2 %>% filter(is.na(state_code)) %>% count() # investigate input var

#Create var
school_v2 %>% mutate(in_state_berkeley=if_else(state_code=="CA",1,0)) %>%
  count(in_state_berkeley)
```

### Exercise`if_else()` solutions

2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine.
```{r, results="hide"}
#investigate input vars
school_v2 %>% select(visits_by_berkeley, visits_by_irvine) %>% str() 
school_v2 %>% filter(is.na(visits_by_berkeley)) %>% count()
school_v2 %>% filter(is.na(visits_by_irvine)) %>% count()

#create variable
school_v2 %>% 
  mutate(berkeley_and_irvine=if_else(visits_by_berkeley>0 
    & visits_by_irvine>0,1,0)) %>%
  count(berkeley_and_irvine)
```

### Exercise`if_else()` solutions

3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine.
```{r, results="hide"}
school_v2 %>% 
  mutate(berkeley_or_irvine=if_else(visits_by_berkeley>0 | visits_by_irvine>0,1,0)) %>%
  count(berkeley_or_irvine)
```

## Using recode() function within mutate()

### Using `recode()` function within `mutate()`

```{r, eval=FALSE, echo=FALSE}
?recode
```
__Description__: Recode values of a variable

__Usage (i.e., syntax)__

- recode(.x, ..., .default = NULL, .missing = NULL)

__Arguments__ [see help file for further details]

- `.x`	A vector (e.g., variable) to modify
- `...`	Specifications for recode, of form `current_value = new_recoded_value` 
- `.default`:	If supplied, all values not otherwise matched given this value. 
- `.missing`: If supplied, any missing values in .x replaced by this value.

__Example__: Using data frame `wwlist`, create new 0/1 indicator `public_school` from variable `school_type`
```{r, results="hide"}
str(wwlist$school_type)
wwlist %>% count(school_type)

wwlist_temp <- wwlist %>% select(school_type) %>% 
  mutate(public_school = recode(school_type,"public" = 1, "private" = 0))

wwlist_temp %>% head(n=10)
str(wwlist_temp$public_school) # note: numeric variable
wwlist_temp %>% count(public_school) # note the NAs
rm(wwlist_temp)
```
### Using `recode()` function within `mutate()`

Recoding `school_type` could have been accomplished using `if_else()`

- Use `recode()` when new variable has more than two categories

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"

Investigate input var
```{r, results="hide"}
str(wwlist$school_category) # character variable
wwlist %>% count(school_category) 
```


Recode
```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Regular elementary or secondary" = "regular",
    "Regular School" = "regular",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational")
  )
str(wwlist_temp$school_catv2) # character variable created
wwlist_temp %>% count(school_catv2)
rm(wwlist_temp)
```
### Using `recode()` within `mutate()`

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"
- This time use the `.missing` argument to recode `NAs` to "unknown"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Regular elementary or secondary" = "regular",
    "Regular School" = "regular",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational",
    .missing = "unknown")
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```

### Using `recode()` within `mutate()` 

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"
- This time use the `.default` argument to assign the value "regular"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational",
    .default = "regular")
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```
### Using `recode()` within `mutate()` 

__Task__: Create `school_catv2` based on `school_category` with these categories:

- This time create a numeric variable rather than character:
    - `1` for "regular"; `2` for "alternative"; `3` for "special"; `4` for "vocational"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = 2,
    "Alternative/other" = 2,
    "Regular elementary or secondary" = 1,
    "Regular School" = 1,
    "Special Education School" = 3,
    "Special program emphasis" = 3,
    "Vocational Education School" = 4)
  )
str(wwlist_temp$school_catv2) # note: numeric variable now
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```
### Student exercise using `recode()` within `mutate()`

```{r, results="hide"}
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
names(df_event)
```

1. Using object `df_event`, assign new object `df_event_temp` and a numeric variable create `event_typev2` based on `event_type` with these categories: 
    - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" 

2. This time use the `.default` argument to assign the value `5` for "public hs"

### Exercise using `recode()` within `mutate()` solutions 
Check input variable
```{r, results="hide"}
names(df_event)
str(df_event$event_type)
df_event %>% count(event_type)
```

### Exercise using `recode()` within `mutate()` solutions 
1. Using object `df_event`, assign new object `df_event_temp` and create a numeric variable `event_typev2` based on `event_type` with these categories: 
    - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs"   

```{r results="hide"}
df_event_temp <- df_event %>% 
  select(event_type) %>%
  mutate(event_typev2 = recode(event_type,
                              "2yr college" = 1,
                              "4yr college" = 2,
                              "other" = 3,
                              "private hs" = 4,
                              "public hs" = 5)
         )
str(df_event_temp$event_typev2)
df_event_temp %>% count(event_typev2)
df_event %>% count(event_type)
```

### Exercise using `recode()` within `mutate()` solutions  

2. This time assign the value use the `.default` argument to assign the value `5` for "public hs"
```{r, results="hide"}
df_event_temp <- df_event %>% select(event_type) %>% 
  mutate(event_typev2 = recode(event_type,
    "2yr college" = 1,
    "4yr college" = 2,
    "other" = 3,
    "private hs" = 4,
    .default = 5)
  )
str(df_event_temp$event_typev2)
df_event_temp %>% count(event_typev2)
df_event %>% count(event_type)
```

## Using case_when() function within mutate()

### Using `case_when()` function within `mutate()` 

```{r, eval=FALSE, echo=FALSE}
?case_when
```
`case_when()` useful for creating variable that is a function of multiple "input" variables

__Usage (i.e., syntax)__: `case_when(...)`

__Arguments__ [from help file; see help file for more details]

- `...`: A sequence of two-sided formulas. 
    - The left hand side (LHS) determines which values match this case. 
        - LHS must evaluate to a logical vector.     
    - The right hand side (RHS) provides the replacement value.
 
 
__Example task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(state,firstgen) %>%
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
str(wwlist_temp$state_gen)
wwlist_temp %>% count(state_gen)
```


### Using `case_when()` function within `mutate()` 

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var

Let's take a closer look at how values of inputs are coded into values of outputs
```{r, results="hide"}
wwlist %>% select(state,firstgen) %>% str()
count(wwlist,state)
count(wwlist,firstgen)
```

Create variable
```{r, results="hide"}
wwlist_temp <- wwlist %>% select(state,firstgen) %>%
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
```

Compare values of input vars to value of output var
```{r, results="hide"}
wwlist_temp %>% count(state_gen)
wwlist_temp %>% filter(is.na(state)) %>% count(state_gen)
wwlist_temp %>% filter(is.na(firstgen)) %>% count(state_gen)
wwlist_temp %>% filter(is.na(firstgen) | is.na(state)) %>% count(state_gen)
```

__Take-away__: by default var created by `case_when()` equals `NA` for obs where one of the inputs equals `NA`

### Student exercise using `case_when()` within `mutate()`

1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories:  
    - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" 
    - Note: We are referring to CA as in-state for this example


### Exercise using `case_when()` within `mutate()` solution
Investigate
```{r, results="hide"}
school_v2 %>% select(state_code,school_type) %>% str()
count(school_v2,state_code)
school_v2 %>% filter(is.na(state_code)) %>% count()

count(school_v2,school_type)
school_v2 %>% filter(is.na(school_type)) %>% count()
```

### Exercise using `case_when()` within `mutate()` solution
1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories:  
    - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" 
```{r}
school_v2_temp <- school_v2 %>% select(state_code,school_type) %>%
  mutate(state_type = case_when(
    state_code == "CA" & school_type == "public"  ~ "instate_public",
    state_code == "CA" & school_type == "private" ~ "instate_private",
    state_code != "CA" & school_type == "public" ~ "outstate_public",
    state_code != "CA" & school_type == "private" ~ "outstate_private")
  )

school_v2_temp %>% count(state_type)
#school_v2_temp %>% filter(is.na(state_code)) %>% count(state_type) #no missing
#school_v2_temp %>% filter(is.na(school_type)) %>% count(state_type) #no missing
```


# Base R appraoch to creating new variables 


### Base R approach to creating new variables

Create new variables using assignment operator `<-` and subsetting operators `[]` and `$` to create new variables and set conditions of the input variables 

\medskip

Pseudo syntax: `df$newvar <- ...` 

- where `...` argument is expression(s)/calculation(s) used to create new variables
    - expressions can include subsetting operators and/or other base R functions

\medskip

__Task__: Create measure of percent of students on free-reduced lunch

__base R approach__
```{r}
school_v2_temp<- school_v2 #create copy of dataset; not necessary
school_v2_temp$pct_fr_lunch <- 
   school_v2_temp$num_fr_lunch/school_v2_temp$total_students

#investigate variable you created
str(school_v2_temp$pct_fr_lunch)
school_v2_temp$pct_fr_lunch[1:5] # print first 5 obs
```

__tidyverse approach (with pipes)__
```{r}
school_v2_temp <- school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) 
```

### Base R approach to creating new variables

If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `if_else()` or `recode()`

\medskip

- Pseudo syntax: `df$newvar[logical condition]<- new value` 
- `logical condition`: a condition that evaluates to `TRUE` or `FALSE`

###  Base R approach to creating new variables

__Task__: Create 0/1 indicator if school has median income greater than $100k

__tidyverse approach (using pipes)__
```{r}
school_v2_temp %>% select(med_inc) %>% 
  mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
```

__Base R approach__
```{r}
school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs 
                              # otherwise you'll get warning
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0
count(school_v2_temp, inc_gt_100k)
```

### Base R approach to creating new variables 

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

__tidyverse approach (using pipes)__
```{r}
wwlist_temp <- wwlist %>% 
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
str(wwlist_temp$state_gen)
wwlist_temp %>% count(state_gen)
```


### Base R approach to creating new variables 

__Task__: Using  `wwlist` and input vars `state` and `firstgen`, create a 4-category var

__base R approach__
```{r}
wwlist_temp <- wwlist 

wwlist_temp$state_gen <- NA
wwlist_temp$state_gen[wwlist_temp$state == "WA" 
  & wwlist_temp$firstgen =="Y"] <- "instate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state == "WA" 
  & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" 
  & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" 
  & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen"

str(wwlist_temp$state_gen)
count(wwlist_temp, state_gen)
```