---
title: "Lecture 4: Pipes and variable creation"
subtitle:  "Managing and Manipulating Data Using R"
author:
date: 
fontsize: 8pt
classoption: dvipsnames  # for colors
urlcolor: blue
output:
  beamer_presentation:
    keep_tex: true
    toc: false
    slide_level: 3
    theme: default # AnnArbor # push to header?
    #colortheme: "dolphin" # push to header?
    #fonttheme: "structurebold"
    highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header
    df_print: tibble #default # tibble # push to header?    
    latex_engine: xelatex #  Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts.
    includes:
      in_header: ../beamer_header.tex
      #after_body: table-of-contents.txt 
---

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
#knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
  #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##"
```

# Introduction

### What we will do today

\tableofcontents

```{r, eval=FALSE, echo=FALSE}
#Use this if you want TOC to show level 2 headings
\tableofcontents
#Use this if you don't want TOC to show level 2 headings
\tableofcontents[subsectionstyle=hide/hide/hide]
```

### Libraries we will use today

"Load" the package we will use today (output omitted)

- __you must run this code chunk__
```{r, message=FALSE}
library(tidyverse)
```
If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file

- Generic syntax: `install.packages("package_name")`
- Install "tidyverse": `install.packages("tidyverse")`

Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:

- `install.packages("tidyverse")`
- `library(tidyverse)`


## Data for lecture

### Lecture 3 data: prospects purchased by Western Washington U.

The "Student list" business

- Universities identify/target "prospects" by buying "student lists" from College Board/ACT (e.g., $.40 per prospect)
- Prospect lists contain contact info (e.g., address, email), academic achievement, socioeconomic, demographic characteristics
- Universities choose which prospects to purchase by filtering on criteria like zip-code, GPA, test score range, etc.

```{r}
#load prospect list data
load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData"))
```

Object `wwlist`

- De-identified list of prospective students purchased by Western Washington University from College Board
- We collected these data using FOIA request
    - ASIDE: Become an expert on collecting data via FOIA requests and you will become a superstar!


### Lecture 3 data: prospects purchased by Western Washington U.

Observations on `wwlist`

- each observation represents a prospective student
```{r}
typeof(wwlist)
dim(wwlist)
```

Variables on `wwlist`

- some vars provide de-identified data on individual prospects
    - e.g., `psat_range`, `state`, `sex`, `ethn_code`
- some vars provide data about zip-code student lives in
    - e.g., `med_inc`, `pop_total`, `pop_black`
- some vars provide data about school student enrolled in
    - e.g., `fr_lunch` is number of students on free/reduced lunch
    - note: bad merge between prospect-level data and school-level data
```{r, results="hide"}
names(wwlist)
str(wwlist)
```


# Pipes

### What are "pipes", %>%

__Pipes__ are a means of perfoming multiple steps in a single line of code

- Pipes are part of __tidyverse__ suite of packages, not __base R__
- When writing code, the pipe symbol is `%>%`
- Basic flow of using pipes in code:
    - `object %>% some_function %>% some_function, \ldots`    
- Pipes work from left to right:
    - The object/result from left of `%>%` pipe symbol is the input of function to the right of the `%>%` pipe symbol 
    - In turn, the resulting output becomes the input of the function to the right of the next `%>%` pipe symbol 
    
Intuitive mnemonic device for understanding pipes

- whenever you see a pipe `%>%` think of the words "__and then...__"
- Example: `wwlist %>% filter(firstgen == "Y")`
    - in words: start with object `wwlist` __and then__ filter first generation students

### Do task with and without pipes

Task:

- Using object `wwlist` print data for "first-generation" prospects (`firstgen == "Y"`)
```{r, results='hide'}
filter(wwlist, firstgen == "Y") # without pipes
wwlist %>% filter(firstgen == "Y") # with pipes
```
Comparing the two approaches:

- In the "without pipes" approach, the object is the first argument `filter()` function
- In the "pipes" approach, you don't specify the object as the first argument of `filter()`
    - Why? Because `%>%` "pipes" the object to the left of the `%>%` operator into the function to the right of the `%>%` operator
    
Main takeaway:

- When writing code using pipes, functions to right of `%>%` pipe operator should not explicitly name object that is the input to the function. 
- Rather, object to the left of `%>%` pipe operator is automatically the input.
    

### More intuition on the pipe operator, `%>%`

The pipe operator "pipes" (verb) an object from left of `%>%` operator into the function to the right of the %>% operator

Example:
```{r, results="hide"}
str(wwlist) # without pipe

wwlist %>% str() # with pipe
```

### Do task with and without pipes

__Task__: Using object `wwlist`, print data for "first-gen" prospects for selected variables [output omitted]

```{r, results='hide'}
#Without pipes
select(filter(wwlist, firstgen == "Y"), state, hs_city, sex)
#With pipes
wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex)
```


Comparing the two approaches:

- In the "without pipes" approach, code is written "inside out"
    - The first step in the task -- identifying the object -- is the innermost part of code
    - The last step in task -- selecting variables to print -- is the outermost part of code
- In "pipes" approach the left-to-right order of code matches how we think about the task
    - First, we start with an object __*and then*__ (`%>%`) we use `filter()` to isolate first-gen students __*and then*__ (`%>%`) we select which  variables to print

Think about what object was "piped" into `select()` from `filter()`
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y") %>% str()
```

### Aside: the `count()` function  [students work on their own]

\medskip `count()` function from `dplyr` package counts the number of obs by group
```{r, eval=FALSE, echo=FALSE}
?count
```

__Syntax__ [see help file for full syntax]

- `count(x,...)`

__Arguments__ [see help file for full arguments]

- `x`: an object, often a data frame
- `...`: variables to group by

Examples of using `count()`

- Without vars in `...` argument, counts number of obs in object
```{r, results="hide"}
count(wwlist)
wwlist %>% count()
```
- With vars in `...` argument, counts number of obs per variable value
    - note: by default, `count()` always shows `NAs` [this is good!]
```{r, results="hide"}
count(wwlist,school_category)
wwlist %>% count(school_category)
```
### Aside: pipe operators and new lines

\medskip Often want to insert line breaks to make long line of code more readable

- When inserting line breaks, __pipe operator `%>%` should be the last thing before a line break, not the first thing after a line break__

__This works__
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y") %>% 
  select(state, hs_city, sex) %>%
  count(sex)
```
__This works too__
```{r, results="hide"}
wwlist %>% filter(firstgen == "Y",
                  state != "WA") %>% 
  select(state, hs_city, sex) %>%
  count(sex)
```

__This doesn't work__
```{r, eval=FALSE}
wwlist %>% filter(firstgen == "Y") 
  %>% select(state, hs_city, sex) 
  %>% count(sex)
```


### Do task with and without pipes

Task: 

- Count the number "first-generation" prospects from the state of Washington

Without pipes
```{r}
count(filter(wwlist, firstgen == "Y", state == "WA"))
```

With pipes
```{r}
wwlist %>% filter(firstgen == "Y", state == "WA") %>% count()
```

### Do task with and without pipes

__Task__: frequency table of `school_type` for non first-gen prospects from WA

__without pipes__
```{r}
wwlist_temp <- filter(wwlist, firstgen == "N", state == "WA")
table(wwlist_temp$school_type, useNA = "always")
rm(wwlist_temp) # cuz we don't need after creating table
```
__With pipes__
```{r}
wwlist %>% filter(firstgen == "N", state == "WA") %>% count(school_type)
```

__Comparison of two approaches__

- without pipes, task requires multiple lines of code (this is quite common)
    - first line creates object; second line analyzes object
- with pipes, task can be completed in one line of code and you aren't left with objects you don't care about

### Student exercises with pipes

1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign `<-` them to object `wwlist_temp`.  (ex. wwlist_temp <- wwlist)

2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California.  

3. **Bonus**: Try doing question 1 and 2 together. Use original object `wwlist`, but do not assign to a new object. 

Once finished you can `rm(wwlist_temp)`

### Solution to exercises with pipes

1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign them to object `wwlist_temp` 
```{r}
wwlist_temp <- wwlist %>%
  select(state, firstgen, ethn_code) 
```

### Solution to exercises with pipes
2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. 
```{r}
#names(wwlist)
wwlist_temp %>%
  filter(firstgen == "Y", state == "CA") %>% count(ethn_code)
```

### Solution to exercises with pipes
3. **Bonus**: Try doing question 1 and 2 together. 

```{r}
wwlist %>%
  select(state, firstgen, ethn_code) %>%
  filter(firstgen == "Y", state == "CA") %>% 
  count(ethn_code)
#rm(wwlist_temp)
```

```{r}
rm(wwlist_temp)
```


# Creating variables using mutate (tidyverse approach)

### Our plan for learning how to create new variables

Recall that `dplyr` package within `tidyverse` provide a set of functions that can be described as "verbs": __subsetting__, __sorting__, and __transforming__


What we've done | Where we're going
--------------- | --------------------
__Subsetting data__ | __Transforming data__
- `select()` variables  | - `mutate()` creates new variables
- `filter()` observations | - `summarize()` calculates across rows
__Sorting data__ | - `group_by()` to calculate across rows within groups
- `arrange()` | 

__Today__

- we'll use `mutate()` to create new variables based on calculations across columns within a row

__Next week__

- we'll combine `mutate()` with `summarize()` and `group_by()` to create variables based on calculations across rows

### Create new data frame based on `df_school_all`

Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities
```{r}
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData"))
names(df_school_all)
```
### Create new data frame based on `df_school_all`

Let's create new version of this data frame, called `school_v2`, which we'll use to introduce how to create new variables
```{r, results='hide'}
school_v2 <- df_school_all %>% 
  select(-contains("inst_")) %>% # remove vars that start with "inst_"
  rename(
    visits_by_berkeley = visits_by_110635,
    visits_by_boulder = visits_by_126614,
    visits_by_bama = visits_by_100751,
    visits_by_stonybrook = visits_by_196097,
    visits_by_rutgers = visits_by_186380,
    visits_by_pitt = visits_by_215293,
    visits_by_cinci = visits_by_201885,
    visits_by_nebraska = visits_by_181464,
    visits_by_georgia = visits_by_139959,
    visits_by_scarolina = visits_by_218663,
    visits_by_ncstate = visits_by_199193,
    visits_by_irvine = visits_by_110653,
    visits_by_kansas = visits_by_155317,
    visits_by_arkansas = visits_by_106397,
    visits_by_sillinois = visits_by_149222,
    visits_by_umass = visits_by_166629,
    num_took_read = num_took_rla,
    num_prof_read = num_prof_rla,
    med_inc = avgmedian_inc_2564)

names(school_v2)
```

## Introduce mutate() function 

### Introduce `mutate()` function

`mutate()` is __tidyverse__ approach to creating variables (not __Base R__ approach)

Description of `mutate()`

- creates new columns (variables) that are functions of existing columns
- After creating a new variable using `mutate()`, every row of data is retained
- `mutate()` works best with pipes `%>%`

__Task__: 

- Using data frame `school_v2` create new variable that measures the pct of students on free/reduced lunch (output omitted)
```{r, results='hide'}
school_sml <- school_v2 %>% # create new dataset with fewer vars; not necessary to do this
  select(ncessch, school_type, num_fr_lunch, total_students)

school_sml %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) # create new var

rm(school_sml)
```
### Syntax for `mutate()`

Let's spend a couple minutes looking at help file for `mutate()`
```{r, eval=FALSE, echo=FALSE}
?mutate
```

__Usage (i.e., syntax)__

- `mutate(.data,...)`

__Arguments__

- `.data`: a data frame
    - if using `mutate()` after pipe operator `%>%`, then this argument can be omitted
        - Why? Because data frame object to left of `%>%` "piped in" to first argument of `mutate()`
- `...`: expressions used to create new variables
    - Can create multiple variables at once

__Value__

- returns an object that contains the original input data frame and new variables that were created by `mutate()`

__Useful functions (i.e., "helper functions")__

- These are standalone functions can be called *within* `mutate()`
    - e.g., `if_else()`, `recode()`, `case_when()`
- will show examples of this in subsequent slides

### Introduce `mutate()` function

New variable not retained unless we __assign__ `<-` it to an object (existing or new)

\medskip __`mutate()` without assignment__
```{r, results='hide'}
school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students)

names(school_v2)
```

__`mutate()` with assignment__
```{r, results="hide"}
school_v2_temp <- school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) 

names(school_v2_temp)
rm(school_v2_temp)

```

### `mutate()` can create multiple variables at once

`mutate()` can create multiple variables at once

```{r, results='hide'}
school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students,
         pct_prof_math= num_prof_math/num_took_math) %>%
  select(num_fr_lunch, total_students, pct_fr_lunch, 
         num_prof_math, num_took_math, pct_prof_math)
```

Or we could write code this way:
```{r, results="hide"}
school_v2 %>% 
  select(num_fr_lunch, total_students, num_prof_math, num_took_math) %>%
  mutate(pct_fr_lunch = num_fr_lunch/total_students,
         pct_prof_math= num_prof_math/num_took_math) 
```

### Student exercise using mutate()

1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`.  

2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading.  

3. If you have time, count the number of schools from question 2.


### Solutions for exercise using mutate()

1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`.

```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) 
```
### Solutions for exercise using mutate()

2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading.
```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) %>%
  filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) 
```

### Solutions for exercise using mutate()
3. If you have time, count the number of schools from question 2.
```{r}
school_v2 %>%
  select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>%
  mutate(pct_prof_math = num_prof_math/num_took_math,
         pct_prof_read = num_prof_read/num_took_read) %>%
  filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) %>%
  count()
```


## Using ifelse() function within mutate()

### Using `ifelse()` function within `mutate()`
```{r, eval=FALSE}
?if_else
```

__Description__

- if `condition` `TRUE`, assign a value; if `condition` `FALSE` assign a value

__Usage (i.e., syntax)__

- `if_else(logical condition, true, false, missing = NULL)`

__Arguments__

- `logical condition`: a condition that evaluates to `TRUE` or `FALSE`
- `true`: value to assign if condition `TRUE`
- `false`: value to assign if condition `FALSE`

__Value__

- "Where condition is TRUE, the matching value from true, where it's FALSE, the matching value from false, otherwise NA."
- missing values from "input" var are assigned missing values in "output var", unless you specify otherwise

__Example__: Create 0/1 indicator of whether got at least one visit from Berkeley
```{r, results="hide"}
school_v2 %>% 
  mutate(got_visit_berkeley = ifelse(visits_by_berkeley>0,1,0)) %>%
  count(got_visit_berkeley)
```

### `ifelse()` within `mutate()` to create 0/1 indicator variables

We often create dichotomous (0/1) indicator variables of whether something happened (or whether something is TRUE)

- Variables that are of substantive interest to project
    - e.g., did student graduate from college
- Variables that help you investigate data, check quality
    - e.g., indicator of whether an observation is missing/non-missing for a particular variable

### Using `ifelse()` within `mutate()`

__Task__

- Create 0/1 indicator if school has median income greater than $100,000

Usually a good idea to investigate "input" variables __before__ creating analysis vars
```{r, results="hide"}
str(school_v2$med_inc) # investigate variable type
school_v2 %>% count(med_inc) # frequency count, but this isn't very helpful

school_v2 %>% filter(is.na(med_inc)) %>% count(med_inc) 
# shows number of obs w/ missing med_inc
```
Create variable
```{r}
school_v2 %>% select(med_inc) %>% 
  mutate(inc_gt_100k= ifelse(med_inc>100000,1,0)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
```


### Using `ifelse()` function within `mutate()`

__Task__

- Create 0/1 indicator variable `nonmiss_math` which indicates whether school has non-missing values for the variable `num_took_math`
    - note: `num_took_math` refers to number of students at school that took state math proficiency test


Usually a good to investigate "input" variables before creating analysis vars
```{r, results="hide"}
school_v2 %>% count(num_took_math) # this isn't very helpful
school_v2 %>% filter(is.na(num_took_math)) %>% count(num_took_math) # shows number of obs w/ missing med_inc
```
Create variable
```{r}
school_v2 %>% select(num_took_math) %>% 
  mutate(nonmiss_math= ifelse(!is.na(num_took_math),1,0)) %>%
  count(nonmiss_math) # note how NA values treated
```
### Student exercises `ifelse()`

1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`).  
2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine.  
3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine.

### Exercise`ifelse()` solutions

1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). 
```{r, results="hide"}
str(school_v2$state_code) # investigate input variable
school_v2 %>% filter(is.na(state_code)) %>% count() # investigate input var

#Create var
school_v2 %>% mutate(in_state_berkeley=ifelse(state_code=="CA",1,0)) %>%
  count(in_state_berkeley)
```

### Exercise`ifelse()` solutions

2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine.
```{r, results="hide"}
#investigate input vars
school_v2 %>% select(visits_by_berkeley, visits_by_irvine) %>% str() 
school_v2 %>% filter(is.na(visits_by_berkeley)) %>% count()
school_v2 %>% filter(is.na(visits_by_irvine)) %>% count()
#create variable
school_v2 %>% 
  mutate(berkeley_and_irvine=ifelse(visits_by_berkeley>0 & visits_by_irvine>0,1,0)) %>%
  count(berkeley_and_irvine)
```

### Exercise`ifelse()` solutions

3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine.
```{r, results="hide"}
school_v2 %>% 
  mutate(berkeley_or_irvine=ifelse(visits_by_berkeley>0 | visits_by_irvine>0,1,0)) %>%
  count(berkeley_or_irvine)
```

## Using recode() function within mutate()

### Using `recode()` function within `mutate()`

```{r, eval=FALSE, echo=FALSE}
?recode
```
__Description__: Recode values of a variable

__Usage (i.e., syntax)__

- recode(.x, ..., .default = NULL, .missing = NULL)

__Arguments__ [see help file for further details]

- `.x`	A vector (e.g., variable) to modify
- `...`	Specifications for recode, of form `current_value = new_recoded_value` 
- `.default`:	If supplied, all values not otherwise matched given this value. 
- `.missing`: If supplied, any missing values in .x replaced by this value.

__Example__: Using data frame `wwlist`, create new 0/1 indicator `public_school` from variable `school_type`
```{r, results="hide"}
str(wwlist$school_type)
wwlist %>% count(school_type)

wwlist_temp <- wwlist %>% select(school_type) %>% 
  mutate(public_school = recode(school_type,"public" = 1, "private" = 0))

wwlist_temp %>% head(n=10)
str(wwlist_temp$public_school)
wwlist_temp %>% count(public_school)
rm(wwlist_temp)
```
### Using `recode()` function within `mutate()`

Recoding `school_type` could have been accomplished using `if_else()`

- Use `recode()` when new variable has more than two categories

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"

Investigate input var
```{r, results="hide"}
str(wwlist$school_category)
wwlist %>% count(school_category)
```
Recode
```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Regular elementary or secondary" = "regular",
    "Regular School" = "regular",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational")
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```
### Using `recode()` within `mutate()` [do in pairs/groups]

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"
- This time use the `.missing` argument to recode `NAs` to "unknown"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Regular elementary or secondary" = "regular",
    "Regular School" = "regular",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational",
    .missing = "unknown")
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```

### Using `recode()` within `mutate()` 

__Task__: Create `school_catv2` based on `school_category` with these categories:

- "regular"; "alternative"; "special"; "vocational"
- This time use the `.default` argument to assign the value "regular"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = "alternative",
    "Alternative/other" = "alternative",
    "Special Education School" = "special",
    "Special program emphasis" = "special",
    "Vocational Education School" = "vocational",
    .default = "regular")
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```
### Using `recode()` within `mutate()` 

__Task__: Create `school_catv2` based on `school_category` with these categories:

- This time create a numeric variable rather than character:
    - `1` for "regular"; `2` for "alternative"; `3` for "special"; `4` for "vocational"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(school_category) %>% 
  mutate(school_catv2 = recode(school_category,
    "Alternative Education School" = 2,
    "Alternative/other" = 2,
    "Regular elementary or secondary" = 1,
    "Regular School" = 1,
    "Special Education School" = 3,
    "Special program emphasis" = 3,
    "Vocational Education School" = 4)
  )
str(wwlist_temp$school_catv2)
wwlist_temp %>% count(school_catv2)
wwlist %>% count(school_category)
rm(wwlist_temp)
```
### Student exercise using `recode()` within `mutate()`

```{r, results="hide"}
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
names(df_event)
```

1. Using object `df_event`, assign new object `df_event_temp` and create `event_typev2` based on `event_type` with these categories: 
    - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" 

2. This time use the `.default` argument to assign the value `5` for "public hs"

### Exercise using `recode()` within `mutate()` solutions 
Check input variable
```{r, results="hide"}
names(df_event)
str(df_event$event_type)
df_event %>% count(event_type)
```

### Exercise using `recode()` within `mutate()` solutions 
1. Using object `df_event`, assign new object `df_event_temp` and create `event_typev2` based on `event_type` with these categories: 
    - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs"   

```{r results="hide"}
df_event_temp <- df_event %>% 
  select(event_type) %>%
  mutate(event_typev2 = recode(event_type,
                              "2yr college" = 1,
                              "4yr college" = 2,
                              "other" = 3,
                              "private hs" = 4,
                              "public hs" = 5)
         )
str(df_event_temp$event_typev2)
df_event_temp %>% count(event_typev2)
df_event %>% count(event_type)
```

### Exercise using `recode()` within `mutate()` solutions  

2. This time use the `.default` argument to assign the value `5` for "public hs"
```{r, results="hide"}
df_event %>% select(event_type) %>% 
  mutate(event_typev2 = recode(event_type,
    "2yr college" = 1,
    "4yr college" = 2,
    "other" = 3,
    "private hs" = 4,
    .default = 5)
  )
str(df_event_temp$event_typev2)
df_event_temp %>% count(event_typev2)
df_event %>% count(event_type)
```

## Using case_when() function within mutate()

### Using `case_when()` function within `mutate()` 

```{r, eval=FALSE, echo=FALSE}
?case_when
```
__Description__ Useful when the variable you want to create is more complicated than variables that can be created using `ifelse()` or `recode()`

- Useful when new variable is a function of multiple "input" variables

__Usage (i.e., syntax)__: `case_when(...)`

__Arguments__ [from help file; see help file for more details]

- `...`: A sequence of two-sided formulas. 
    - The left hand side (LHS) determines which values match this case. 
        - LHS must evaluate to a logical vector.     
    - The right hand side (RHS) provides the replacement value.
 
 
__Example task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

```{r, results="hide"}
wwlist_temp <- wwlist %>% select(state,firstgen) %>%
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
str(wwlist_temp$state_gen)
wwlist_temp %>% count(state_gen)
```


### Using `case_when()` function within `mutate()` 

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

Let's take a closer look at how values of inputs are coded into values of outputs
```{r, results="hide"}
wwlist %>% select(state,firstgen) %>% str()
count(wwlist,state)
count(wwlist,firstgen)

wwlist_temp <- wwlist %>% select(state,firstgen) %>%
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )

wwlist_temp %>% count(state_gen)
wwlist_temp %>% filter(is.na(state)) %>% count(state_gen)
wwlist_temp %>% filter(is.na(firstgen)) %>% count(state_gen)
```
__Take-away__: by default var created by `case_when()` equals `NA` for obs where one of the inputs equals `NA`

### Student exercise using `case_when()` within `mutate()`

1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories:  
    - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" 
    - Note: We are referring to CA as in-state for this example


### Exercise using `case_when()` within `mutate()` solution
Investigate
```{r, results="hide"}
school_v2 %>% select(state_code,school_type) %>% str()
count(school_v2,state_code)
school_v2 %>% filter(is.na(state_code)) %>% count()

count(school_v2,school_type)
school_v2 %>% filter(is.na(school_type)) %>% count()
```

### Exercise using `case_when()` within `mutate()` solution
1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories:  
    - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" 
```{r}
school_v2_temp <- school_v2 %>% select(state_code,school_type) %>%
  mutate(state_type = case_when(
    state_code == "CA" & school_type == "public"  ~ "instate_public",
    state_code == "CA" & school_type == "private" ~ "instate_private",
    state_code != "CA" & school_type == "public" ~ "outstate_public",
    state_code != "CA" & school_type == "private" ~ "outstate_private")
  )

school_v2_temp %>% count(state_type)
#school_v2_temp %>% filter(is.na(state_code)) %>% count(state_type) #no missing
#school_v2_temp %>% filter(is.na(school_type)) %>% count(state_type) #no missing
```


# Base R appraoch to creating new variables 


### Base R approach to creating new variables

Subsetting operators `[]` and `$` are used to create new variables and set conditions of the input variables 

\medskip

If creating new variable based on calculation of input variables, basically the tidyverse equivalent of `mutate()` __without__ `ifelse()` or `recode()`

- Sudo syntax: `df$newvar <- ...` 
- where ... argument is expression(s)/calculation(s) used to create new variables 

\medskip

__Task__: Create measure of percent of students on free-reduced lunch

__base R approach__
```{r}
school_v2_temp<- school_v2 #create copy of dataset; not necessary
school_v2_temp$pct_fr_lunch <- 
   school_v2_temp$num_fr_lunch/school_v2_temp$total_students 
```

__tidyverse approach (with pipes)__
```{r}
school_v2_temp <- school_v2 %>% 
  mutate(pct_fr_lunch = num_fr_lunch/total_students) 
```

### Base R approach to creating new variables

If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `ifelse()` or `recode()`

\medskip

- Sudo syntax: `df$newvar[logical condition]<- new value` 
- `logical condition`: a condition that evaluates to `TRUE` or `FALSE`


###  Base R approach to creating new variables

__Task__: Create 0/1 indicator if school has median income greater than $100k

__tidyverse approach (using pipes)__
```{r}
school_v2_temp %>% select(med_inc) %>% 
  mutate(inc_gt_100k= ifelse(med_inc>100000,1,0)) %>%
  count(inc_gt_100k) # note how NA values of med_inc treated
```

__Base R approach__
```{r}
school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs 
                              # otherwise you'll get warning
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1
school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0
count(school_v2_temp, inc_gt_100k)
```

### Base R approach to creating new variables 

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

__tidyverse approach (using pipes)__
```{r}
wwlist_temp <- wwlist %>% 
  mutate(state_gen = case_when(
    state == "WA" & firstgen =="Y" ~ "instate_firstgen",
    state == "WA" & firstgen =="N" ~ "instate_nonfirstgen",
    state != "WA" & firstgen =="Y" ~ "outstate_firstgen",
    state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen")
  )
str(wwlist_temp$state_gen)
wwlist_temp %>% count(state_gen)
```


### Base R approach to creating new variables 

__Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories:

- "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen"

__base R approach__
```{r}
wwlist_temp <- wwlist 

wwlist_temp$state_gen <- NA
wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="Y"] <- "instate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen"
wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen"

str(wwlist_temp$state_gen)
count(wwlist_temp, state_gen)
```