--- title: "Lecture 4: Pipes and variable creation" subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # Introduction ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk__ ```{r, message=FALSE} library(tidyverse) ``` If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ## Data for lecture ### Lecture 3 data: prospects purchased by Western Washington U. The "Student list" business - Universities identify/target "prospects" by buying "student lists" from College Board/ACT (e.g., $.40 per prospect) - Prospect lists contain contact info (e.g., address, email), academic achievement, socioeconomic, demographic characteristics - Universities choose which prospects to purchase by filtering on criteria like zip-code, GPA, test score range, etc. ```{r} #load prospect list data load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) ``` Object `wwlist` - De-identified list of prospective students purchased by Western Washington University from College Board - We collected these data using FOIA request - ASIDE: Become an expert on collecting data via FOIA requests and you will become a superstar! ### Lecture 3 data: prospects purchased by Western Washington U. Observations on `wwlist` - each observation represents a prospective student ```{r} typeof(wwlist) dim(wwlist) ``` Variables on `wwlist` - some vars provide de-identified data on individual prospects - e.g., `psat_range`, `state`, `sex`, `ethn_code` - some vars provide data about zip-code student lives in - e.g., `med_inc`, `pop_total`, `pop_black` - some vars provide data about school student enrolled in - e.g., `fr_lunch` is number of students on free/reduced lunch - note: bad merge between prospect-level data and school-level data ```{r, results="hide"} names(wwlist) str(wwlist) glimpse(wwlist) # tidyverse function, similar to str() ``` ### Lecture 3 data: prospects purchased by Western Washington U. Variable `firstgen` identifies whether prospect is a first-generation college student Imagine we want to isolate all the first-generation prospects 1. Investigate variable type/structure. - A dichotomous var, but stored as character in `wwlist`. So must use quotes (`''` or `""`) to filter/subset based on values of `firstgen` ```{r} str(wwlist$firstgen) ``` 2. Create frequency table to identify possible values of `firstgen` ```{r} table(wwlist$firstgen, useNA = "always") ``` 3. Isolate all the first-gen prospects ```{r} filter(wwlist, firstgen == "Y") ``` # Pipes ### What are "pipes", %>% __Pipes__ are a means of perfoming multiple steps in a single line of code - When writing code, the pipe symbol is `%>%` - Basic flow of using pipes in code: - `object %>% some_function %>% some_function %>% some_function` - Pipes work from left to right: - The object from left of `%>%` pipe symbol is input as the first argument of the function to the right of the `%>%` pipe symbol - In turn, the resulting output becomes the input (the first argument) of the function to the right of the next `%>%` pipe symbol - Pipes are part of __tidyverse__ suite of packages, not __base R__ \medskip Intuitive mnemonic device for understanding pipes - whenever you see a pipe `%>%` think of the words "__and then...__" \medskip Example: isolate all the first-generation prospects [output omitted] - in words: start with object `wwlist` __and then__ filter first generation students ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") ``` ### Do task with and without pipes Task: Using object `wwlist` print data for "first-gen" prospects (`firstgen == "Y"`) ```{r, results='hide'} # without pipes filter(wwlist, firstgen == "Y") # with pipes wwlist %>% filter(firstgen == "Y") ``` Comparing the two approaches: - "without pipes", object `wwlist` is the first argument `filter()` function - In "pipes" approach, you don't specify object `wwlist` as first argument in `filter()` - Why? Because `%>%` "pipes" the object to the left of the `%>%` operator into the function to the right of the `%>%` operator Main takeaway: - When writing code using pipes, functions to right of `%>%` pipe operator should not explicitly name object that is the input to the function. - Rather, object to the left of `%>%` pipe operator is automatically the input. ### More intuition on the pipe operator, `%>%` The pipe operator "pipes" (verb) an object from left of `%>%` operator into the function to the right of the %>% operator Example, the "structure" function `str()`, with and without pipes - Examine syntax for `str()`: `str(object, ...)` ```{r, eval=FALSE} ?str ``` - Investigate structure of dataframe `wwlist` without and with pipes ```{r, results="hide"} str(wwlist) # without pipe wwlist %>% str() # with pipe ``` Questions: - In the pipes approach, `wwlist %>% str()`, why didn't we need to insert argument values inside `str()` - What would happen if we just ran this line of code? ```{r, eval=FALSE} str() ``` ### Do task with and without pipes __Task__: Using object `wwlist`, print data for "first-gen" prospects for selected variables [output omitted] ```{r, results='hide'} #Without pipes select(filter(wwlist, firstgen == "Y"), state, hs_city, sex) #With pipes wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) ``` Comparing the two approaches: - In the "without pipes" approach, code is written "inside out" - The first step in the task -- identifying the object -- is the innermost part of code - The last step in task -- selecting variables to print -- is the outermost part of code - In "pipes" approach the left-to-right order of code matches how we think about the task - First, we start with an object __*and then*__ (`%>%`) we use `filter()` to isolate first-gen students __*and then*__ (`%>%`) we select which variables to print \medskip __Important__: `str()` function helpful for understanding what object is piped in from one function to another ```{r, results="hide"} #object that was "piped" into `select()` from `filter()` wwlist %>% filter(firstgen == "Y") %>% str() #object that was created after `select()` function wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% str() ``` ### Aside: `count()` function \medskip `count()` function from `dplyr` package counts the number of obs by group ```{r, eval=FALSE, echo=FALSE} ?count ``` __Syntax__ [see help file for full syntax] - `count(x,...)` __Arguments__ [see help file for full arguments] - `x`: an object, often a data frame - `...`: variables to group by Examples of using `count()` - Without vars in `...` argument, counts number of obs in object ```{r, results="hide"} count(wwlist) wwlist %>% count() wwlist %>% count() %>% str() ``` - With vars in `...` argument, counts number of obs per variable value - This is the best way to create frequency table, better than `table()` - note: by default, `count()` always shows `NAs` [this is good!] ```{r, results="hide"} count(wwlist,school_category) wwlist %>% count(school_category) wwlist %>% count(school_category) %>% str() ``` ### pipe operators and new lines \medskip Often want to insert line breaks to make long line of code more readable - When inserting line breaks, __pipe operator `%>%` should be the last thing before a line break, not the first thing after a line break__ __This works__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` __This works too__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y", state != "WA") %>% select(state, hs_city, sex) %>% count(sex) ``` __This doesn't work__ ```{r, eval=FALSE} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` ### The power of pipes You might be thinking, "what's the big deal?" __TasK__: - in one line of code, modify `wwlist` and create bar chart that counts number of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state ```{r, eval=FALSE} wwlist %>% filter(is.na(state)==0) %>% # drop obs where variable state missing mutate( # create out-of-state indicator; create recoded ethnicity var out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")), ethn_race = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white")) %>% group_by(out_state) %>% # group_by "in-state" vs. "out-of-state" count(ethn_race) %>% # count of number of prospects purchased by race ggplot(aes(x=ethn_race, y=n)) + # plot ylab("number of prospects") + xlab("race/ethnicity") + geom_col() + coord_flip() + facet_wrap(~ out_state) ``` ### The power of pipes __TasK__: - in one line of code, modify `wwlist` and create bar chart of median income (in zip-code) of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state ```{r, eval=FALSE} wwlist %>% filter(is.na(state)==0) %>% # drop obs where variable state missing mutate( # create out-of-state indicator; create recoded ethnicity var out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")), ethn_race = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white")) %>% group_by(out_state, ethn_race) %>% # group_by "out-state" and ethnicity summarize(avg_inc_zip = mean(med_inc_zip, na.rm = TRUE)) %>% # calculate avg. inc ggplot(aes(x=out_state, y=avg_inc_zip)) + ylab("avg. income in zip code") + xlab("") + geom_col() + coord_flip() + facet_wrap(~ ethn_race) # plot ``` ### The power of pipes Example R script from Ben Skinner, which creates analysis data for [Skinner (2018)](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1) - [Link to R script](https://github.com/btskinner/colchoice_rep/blob/master/scripts/makedata.r) \medskip Other relevant links - [Link to Github repository for Skinner (2018)](https://github.com/btskinner/colchoice_rep) - [Link to published paper](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1) - [Link to Skinner's Github page](https://github.com/btskinner) - A lot of cool stuff here - [Link to Skinner's personal website](https://www.btskinner.io/) - A lot of cool stuff here ### Do task with and without pipes [STUDENTS WORK ON THEIR OWN] Task: - Count the number "first-generation" prospects from the state of Washington Without pipes ```{r} count(filter(wwlist, firstgen == "Y", state == "WA")) ``` With pipes ```{r} wwlist %>% filter(firstgen == "Y", state == "WA") %>% count() ``` ### Do task with and without pipes [STUDENTS WORK ON THEIR OWN] __Task__: frequency table of `school_type` for non first-gen prospects from WA __without pipes__ ```{r} wwlist_temp <- filter(wwlist, firstgen == "N", state == "WA") table(wwlist_temp$school_type, useNA = "always") rm(wwlist_temp) # cuz we don't need after creating table ``` __With pipes__ ```{r} wwlist %>% filter(firstgen == "N", state == "WA") %>% count(school_type) ``` __Comparison of two approaches__ - without pipes, task requires multiple lines of code (this is quite common) - first line creates object; second line analyzes object - with pipes, task can be completed in one line of code and you aren't left with objects you don't care about ### Student exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign `<-` them to object `wwlist_temp`. (ex. wwlist_temp <- wwlist) 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. 3. **Bonus**: Try doing question 1 and 2 together. Use original object `wwlist`, but do not assign to a new object. Once finished you can `rm(wwlist_temp)` ### Solution to exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign them to object `wwlist_temp` ```{r} wwlist_temp <- wwlist %>% select(state, firstgen, ethn_code) ``` ### Solution to exercises with pipes 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. ```{r} #names(wwlist) wwlist_temp %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) ``` ### Solution to exercises with pipes 3. **Bonus**: Try doing question 1 and 2 together. ```{r} wwlist %>% select(state, firstgen, ethn_code) %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) #rm(wwlist_temp) ``` ```{r} rm(wwlist_temp) ``` # Creating variables using mutate (tidyverse approach) ### Our plan for learning how to create new variables Recall that `dplyr` package within `tidyverse` provide a set of functions that can be described as "verbs": __subsetting__, __sorting__, and __transforming__ What we've done | Where we're going --------------- | -------------------- __Subsetting data__ | __Transforming data__ - `select()` variables | - `mutate()` creates new variables - `filter()` observations | - `summarize()` calculates across rows __Sorting data__ | - `group_by()` to calculate across rows within groups - `arrange()` | __Today__ - we'll use `mutate()` to create new variables based on calculations across columns within a row __Next week__ - we'll combine `mutate()` with `summarize()` and `group_by()` to create variables based on calculations across rows ### Create new data frame based on `df_school_all` Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) names(df_school_all) ``` ### Create new data frame based on `df_school_all` Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables ```{r, results='hide'} school_v2 <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( # rename selected variables visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564 ) glimpse(school_v2) ``` ## Introduce mutate() function ### Introduce `mutate()` function `mutate()` is __tidyverse__ approach to creating variables (not __Base R__ approach) Description of `mutate()` - creates new columns (variables) that are functions of existing columns - After creating a new variable using `mutate()`, every row of data is retained - `mutate()` works best with pipes `%>%` __Task__: - Using data frame `school_v2` create new variable that measures the pct of students on free/reduced lunch (output omitted) ```{r, results='hide'} # create new dataset with fewer vars; not necessary to do this school_sml <- school_v2 %>% select(ncessch, school_type, num_fr_lunch, total_students) # create new var school_sml %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) # remove data frame object rm(school_sml) ``` ### Investigate `mutate()` syntax ```{r, eval=FALSE, echo=FALSE} ?mutate ``` __Usage (i.e., syntax)__ - `mutate(.data,...)` __Arguments__ - `.data`: a data frame - if using `mutate()` after pipe operator `%>%`, then this argument can be omitted - Why? Because data frame object to left of `%>%` "piped in" to first argument of `mutate()` - `...`: expressions used to create new variables - "Name-value pairs of expressions" - "The name of each argument will be the name of a new variable, and the value will be its corresponding value." - "Use a `NULL` value in mutate to drop a variable." - "New variables overwrite existing variables of the same name" __Value__ - returns a (data frame) object that contains the original input data frame and new variables that were created by `mutate()` ### Investigate `mutate()` syntax ```{r, eval=FALSE, echo=FALSE} ?mutate ``` __Can create variables using standard mathematical or logical operators__ [output omitted] ```{r, results="hide"} #glimpse(school_v2) school_v2 %>% select(state_code,school_type,ncessch,med_inc,num_fr_lunch,total_students,num_took_math) %>% mutate( # each argument creates a new variable, name of argument is name of variable one = 1, med_inc000 = med_inc/1000, pct_fr_lunch = num_fr_lunch/total_students*100, took_math_na = is.na(num_took_math)==1 ) %>% select(state_code,school_type,ncessch,one,med_inc,med_inc000,num_fr_lunch,total_students,pct_fr_lunch,num_took_math,took_math_na) ``` __Can create variables using "helper functions" called within `mutate()`__ [output omitted] - These are standalone functions can be called *within* `mutate()` - e.g., `if_else()`, `recode()`, `case_when()` - will walk through helper functions in more detail in subsequent sections of lecture ```{r, results="hide"} school_v2 %>% select(state_code,ncessch,name,school_type) %>% mutate(public = if_else(school_type == "public", 1, 0)) ``` ### Introduce `mutate()` function New variable not retained unless we __assign__ `<-` it to an object (existing or new) \medskip - __`mutate()` without assignment__ ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2) ``` \medskip - __`mutate()` with assignment__ ```{r, results="hide"} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2_temp) rm(school_v2_temp) ``` ### `mutate()` can create multiple variables at once `mutate()` can create multiple variables at once ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) %>% select(num_fr_lunch, total_students, pct_fr_lunch, num_prof_math, num_took_math, pct_prof_math) ``` Or we could write code this way: ```{r, results="hide"} school_v2 %>% select(num_fr_lunch, total_students, num_prof_math, num_took_math) %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) ``` `mutate()` can use variables previously created within `mutate()` ```{r, results="hide"} school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read, avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2) ``` ### `mutate()`, removing variables created by `mutate()` Within `mutate()` use syntax `var_name = NULL` to remove variable from data frame - note: Variable not permanently removed from data frame unless you use assignment `<-` to create new data frame or overwrite existing data frame ```{r, results='hide'} ncol(school_v2) school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% glimpse() school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(num_prof_math = NULL, num_took_math = NULL) %>% glimpse() #But variables not permanently removed because we didn't use assignment ncol(school_v2) ``` Why would we remove variables within `mutate()` rather `select()`? - remove temporary "work" variables used to create desired variable - Example: measure of average of pct who passed math and pct who passed reading ```{r, results='hide'} school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, # create work var pct_prof_read = num_prof_read/num_took_read, # create work var avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2, #create analysis var pct_prof_math = NULL, # remove work var pct_prof_read = NULL) %>% # remove work var glimpse() ``` ### Student exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. 3. Count the number of schools from question 2. 4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? ### Solutions for exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) ``` ### Solutions for exercise using mutate() 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) ``` ### Solutions for exercise using mutate() 3. Count the number of schools from question 2. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) %>% count() ``` ### Solutions for exercise using mutate() 4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? ```{r} school_v2 %>% mutate(med_inc_na = is.na(med_inc)) %>% count(med_inc_na) ``` ## Using if_else() function within mutate() ### Using `if_else()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?if_else ``` __Description__ - if `logical condition` `TRUE`, assign a value; if `logical condition` `FALSE` assign a value __Usage (i.e., syntax)__ - `if_else(logical condition, true, false, missing = NULL)` __Arguments__ - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` - `true`: value to assign if condition `TRUE` - `false`: value to assign if condition `FALSE` - `missing`: value to assign to rows that have value `NA` for condition - default is `missing = NULL`; means that if condition is `NA`, then new_var == `NA` - But can assign different values to `NA`s, e.g., `missing = -9` __Value__ - "Where condition is TRUE, the matching value from true, where it's FALSE, the matching value from false, otherwise NA." - Unless otherwise specified, `NA`s in "input" var(s) assigned `NA` in "output var" __Example__: Create 0/1 indicator of whether got at least one visit from Berkeley ```{r, results="hide"} school_v2 %>% mutate(got_visit_berkeley = if_else(visits_by_berkeley>0,1,0)) %>% count(got_visit_berkeley) ``` ### `if_else()` within `mutate()` to create 0/1 indicator variables We often create dichotomous (0/1) indicator variables of whether something happened (or whether something is TRUE) - Variables that are of substantive interest to project - e.g., did student graduate from college - Variables that help you investigate data, check quality - e.g., indicator of whether an observation is missing/non-missing for a particular variable ### Using `if_else()` within `mutate()` __Task__ - Create 0/1 indicator if school has median income greater than $100,000 Usually a good idea to investigate "input" variables __before__ creating analysis vars ```{r, results="hide"} str(school_v2$med_inc) # investigate variable type school_v2 %>% count(med_inc) # frequency count, but this isn't very helpful school_v2 %>% filter(is.na(med_inc)) %>% count(med_inc) school_v2 %>% filter(is.na(med_inc)) %>% count() # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` ### Using `if_else()` within `mutate()` __Task__: - Create 0/1 indicator if school has median income greater than $100,000. This time, let's experiment with the `missing` argument of `if_else()` ```{r, eval=FALSE} #what we wrote before school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) #manually write out the default value for `missing` school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NULL)) %>% count(inc_gt_100k) # note how NA values of med_inc treated school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NA_real_)) %>% count(inc_gt_100k) # note how NA values of med_inc treated # NA can be coerced to any other vector type except raw: # NA_integer_, NA_real_, NA_complex_ and NA_character_ # Here we give missing values in condition the value of -9 in new variable school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = -9)) %>% count(inc_gt_100k) ``` ### Using `if_else()` function within `mutate()` __Task__ - Create 0/1 indicator variable `nonmiss_math` which indicates whether school has non-missing values for the variable `num_took_math` - note: `num_took_math` refers to number of students at school that took state math proficiency test Usually a good to investigate "input" variables before creating analysis vars ```{r, results="hide"} school_v2 %>% count(num_took_math) # this isn't very helpful school_v2 %>% filter(is.na(num_took_math)) %>% count(num_took_math) # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(num_took_math) %>% mutate(nonmiss_math= if_else(!is.na(num_took_math),1,0)) %>% count(nonmiss_math) # note how NA values treated ``` ### Student exercises `if_else()` 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ### Exercise`if_else()` solutions 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). ```{r, results="hide"} str(school_v2$state_code) # investigate input variable school_v2 %>% filter(is.na(state_code)) %>% count() # investigate input var #Create var school_v2 %>% mutate(in_state_berkeley=if_else(state_code=="CA",1,0)) %>% count(in_state_berkeley) ``` ### Exercise`if_else()` solutions 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. ```{r, results="hide"} #investigate input vars school_v2 %>% select(visits_by_berkeley, visits_by_irvine) %>% str() school_v2 %>% filter(is.na(visits_by_berkeley)) %>% count() school_v2 %>% filter(is.na(visits_by_irvine)) %>% count() #create variable school_v2 %>% mutate(berkeley_and_irvine=if_else(visits_by_berkeley>0 & visits_by_irvine>0,1,0)) %>% count(berkeley_and_irvine) ``` ### Exercise`if_else()` solutions 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ```{r, results="hide"} school_v2 %>% mutate(berkeley_or_irvine=if_else(visits_by_berkeley>0 | visits_by_irvine>0,1,0)) %>% count(berkeley_or_irvine) ``` ## Using recode() function within mutate() ### Using `recode()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?recode ``` __Description__: Recode values of a variable __Usage (i.e., syntax)__ - recode(.x, ..., .default = NULL, .missing = NULL) __Arguments__ [see help file for further details] - `.x` A vector (e.g., variable) to modify - `...` Specifications for recode, of form `current_value = new_recoded_value` - `.default`: If supplied, all values not otherwise matched given this value. - `.missing`: If supplied, any missing values in .x replaced by this value. __Example__: Using data frame `wwlist`, create new 0/1 indicator `public_school` from variable `school_type` ```{r, results="hide"} str(wwlist$school_type) wwlist %>% count(school_type) wwlist_temp <- wwlist %>% select(school_type) %>% mutate(public_school = recode(school_type,"public" = 1, "private" = 0)) wwlist_temp %>% head(n=10) str(wwlist_temp$public_school) # note: numeric variable wwlist_temp %>% count(public_school) # note the NAs rm(wwlist_temp) ``` ### Using `recode()` function within `mutate()` Recoding `school_type` could have been accomplished using `if_else()` - Use `recode()` when new variable has more than two categories __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" Investigate input var ```{r, results="hide"} str(wwlist$school_category) # character variable wwlist %>% count(school_category) ``` Recode ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational") ) str(wwlist_temp$school_catv2) # character variable created wwlist_temp %>% count(school_catv2) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.missing` argument to recode `NAs` to "unknown" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .missing = "unknown") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.default` argument to assign the value "regular" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .default = "regular") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - This time create a numeric variable rather than character: - `1` for "regular"; `2` for "alternative"; `3` for "special"; `4` for "vocational" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = 2, "Alternative/other" = 2, "Regular elementary or secondary" = 1, "Regular School" = 1, "Special Education School" = 3, "Special program emphasis" = 3, "Vocational Education School" = 4) ) str(wwlist_temp$school_catv2) # note: numeric variable now wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Student exercise using `recode()` within `mutate()` ```{r, results="hide"} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) names(df_event) ``` 1. Using object `df_event`, assign new object `df_event_temp` and a numeric variable create `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" 2. This time use the `.default` argument to assign the value `5` for "public hs" ### Exercise using `recode()` within `mutate()` solutions Check input variable ```{r, results="hide"} names(df_event) str(df_event$event_type) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 1. Using object `df_event`, assign new object `df_event_temp` and create a numeric variable `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" ```{r results="hide"} df_event_temp <- df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, "public hs" = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 2. This time assign the value use the `.default` argument to assign the value `5` for "public hs" ```{r, results="hide"} df_event_temp <- df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, .default = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ## Using case_when() function within mutate() ### Using `case_when()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?case_when ``` `case_when()` useful for creating variable that is a function of multiple "input" variables __Usage (i.e., syntax)__: `case_when(...)` __Arguments__ [from help file; see help file for more details] - `...`: A sequence of two-sided formulas. - The left hand side (LHS) determines which values match this case. - LHS must evaluate to a logical vector. - The right hand side (RHS) provides the replacement value. __Example task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Using `case_when()` function within `mutate()` __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var Let's take a closer look at how values of inputs are coded into values of outputs ```{r, results="hide"} wwlist %>% select(state,firstgen) %>% str() count(wwlist,state) count(wwlist,firstgen) ``` Create variable ```{r, results="hide"} wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) ``` Compare values of input vars to value of output var ```{r, results="hide"} wwlist_temp %>% count(state_gen) wwlist_temp %>% filter(is.na(state)) %>% count(state_gen) wwlist_temp %>% filter(is.na(firstgen)) %>% count(state_gen) wwlist_temp %>% filter(is.na(firstgen) | is.na(state)) %>% count(state_gen) ``` __Take-away__: by default var created by `case_when()` equals `NA` for obs where one of the inputs equals `NA` ### Student exercise using `case_when()` within `mutate()` 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" - Note: We are referring to CA as in-state for this example ### Exercise using `case_when()` within `mutate()` solution Investigate ```{r, results="hide"} school_v2 %>% select(state_code,school_type) %>% str() count(school_v2,state_code) school_v2 %>% filter(is.na(state_code)) %>% count() count(school_v2,school_type) school_v2 %>% filter(is.na(school_type)) %>% count() ``` ### Exercise using `case_when()` within `mutate()` solution 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" ```{r} school_v2_temp <- school_v2 %>% select(state_code,school_type) %>% mutate(state_type = case_when( state_code == "CA" & school_type == "public" ~ "instate_public", state_code == "CA" & school_type == "private" ~ "instate_private", state_code != "CA" & school_type == "public" ~ "outstate_public", state_code != "CA" & school_type == "private" ~ "outstate_private") ) school_v2_temp %>% count(state_type) #school_v2_temp %>% filter(is.na(state_code)) %>% count(state_type) #no missing #school_v2_temp %>% filter(is.na(school_type)) %>% count(state_type) #no missing ``` # Base R appraoch to creating new variables ### Base R approach to creating new variables Create new variables using assignment operator `<-` and subsetting operators `[]` and `$` to create new variables and set conditions of the input variables \medskip Pseudo syntax: `df$newvar <- ...` - where `...` argument is expression(s)/calculation(s) used to create new variables - expressions can include subsetting operators and/or other base R functions \medskip __Task__: Create measure of percent of students on free-reduced lunch __base R approach__ ```{r} school_v2_temp<- school_v2 #create copy of dataset; not necessary school_v2_temp$pct_fr_lunch <- school_v2_temp$num_fr_lunch/school_v2_temp$total_students #investigate variable you created str(school_v2_temp$pct_fr_lunch) school_v2_temp$pct_fr_lunch[1:5] # print first 5 obs ``` __tidyverse approach (with pipes)__ ```{r} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) ``` ### Base R approach to creating new variables If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `if_else()` or `recode()` \medskip - Pseudo syntax: `df$newvar[logical condition]<- new value` - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` ### Base R approach to creating new variables __Task__: Create 0/1 indicator if school has median income greater than $100k __tidyverse approach (using pipes)__ ```{r} school_v2_temp %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` __Base R approach__ ```{r} school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs # otherwise you'll get warning school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1 school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0 count(school_v2_temp, inc_gt_100k) ``` ### Base R approach to creating new variables __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" __tidyverse approach (using pipes)__ ```{r} wwlist_temp <- wwlist %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Base R approach to creating new variables __Task__: Using `wwlist` and input vars `state` and `firstgen`, create a 4-category var __base R approach__ ```{r} wwlist_temp <- wwlist wwlist_temp$state_gen <- NA wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="Y"] <- "instate_firstgen" wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen" str(wwlist_temp$state_gen) count(wwlist_temp, state_gen) ```