--- title: "Lecture 4: Pipes and variable creation" subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # Introduction ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk__ ```{r, message=FALSE} library(tidyverse) ``` If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ## Data for lecture ### Lecture 3 data: prospects purchased by Western Washington U. The "Student list" business - Universities identify/target "prospects" by buying "student lists" from College Board/ACT (e.g., $.40 per prospect) - Prospect lists contain contact info (e.g., address, email), academic achievement, socioeconomic, demographic characteristics - Universities choose which prospects to purchase by filtering on criteria like zip-code, GPA, test score range, etc. ```{r} #load prospect list data load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) ``` Object `wwlist` - De-identified list of prospective students purchased by Western Washington University from College Board - We collected these data using FOIA request - ASIDE: Become an expert on collecting data via FOIA requests and you will become a superstar! ### Lecture 3 data: prospects purchased by Western Washington U. Observations on `wwlist` - each observation represents a prospective student ```{r} typeof(wwlist) dim(wwlist) ``` Variables on `wwlist` - some vars provide de-identified data on individual prospects - e.g., `psat_range`, `state`, `sex`, `ethn_code` - some vars provide data about zip-code student lives in - e.g., `med_inc`, `pop_total`, `pop_black` - some vars provide data about school student enrolled in - e.g., `fr_lunch` is number of students on free/reduced lunch - note: bad merge between prospect-level data and school-level data ```{r, results="hide"} names(wwlist) str(wwlist) ``` # Pipes ### What are "pipes", %>% __Pipes__ are a means of perfoming multiple steps in a single line of code - Pipes are part of __tidyverse__ suite of packages, not __base R__ - When writing code, the pipe symbol is `%>%` - Basic flow of using pipes in code: - `object %>% some_function %>% some_function, \ldots` - Pipes work from left to right: - The object/result from left of `%>%` pipe symbol is the input of function to the right of the `%>%` pipe symbol - In turn, the resulting output becomes the input of the function to the right of the next `%>%` pipe symbol Intuitive mnemonic device for understanding pipes - whenever you see a pipe `%>%` think of the words "__and then...__" - Example: `wwlist %>% filter(firstgen == "Y")` - in words: start with object `wwlist` __and then__ filter first generation students ### Do task with and without pipes Task: - Using object `wwlist` print data for "first-generation" prospects (`firstgen == "Y"`) ```{r, results='hide'} filter(wwlist, firstgen == "Y") # without pipes wwlist %>% filter(firstgen == "Y") # with pipes ``` Comparing the two approaches: - In the "without pipes" approach, the object is the first argument `filter()` function - In the "pipes" approach, you don't specify the object as the first argument of `filter()` - Why? Because `%>%` "pipes" the object to the left of the `%>%` operator into the function to the right of the `%>%` operator Main takeaway: - When writing code using pipes, functions to right of `%>%` pipe operator should not explicitly name object that is the input to the function. - Rather, object to the left of `%>%` pipe operator is automatically the input. ### More intuition on the pipe operator, `%>%` The pipe operator "pipes" (verb) an object from left of `%>%` operator into the function to the right of the %>% operator Example: ```{r, results="hide"} str(wwlist) # without pipe wwlist %>% str() # with pipe ``` ### Do task with and without pipes __Task__: Using object `wwlist`, print data for "first-gen" prospects for selected variables [output omitted] ```{r, results='hide'} #Without pipes select(filter(wwlist, firstgen == "Y"), state, hs_city, sex) #With pipes wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) ``` Comparing the two approaches: - In the "without pipes" approach, code is written "inside out" - The first step in the task -- identifying the object -- is the innermost part of code - The last step in task -- selecting variables to print -- is the outermost part of code - In "pipes" approach the left-to-right order of code matches how we think about the task - First, we start with an object __*and then*__ (`%>%`) we use `filter()` to isolate first-gen students __*and then*__ (`%>%`) we select which variables to print Think about what object was "piped" into `select()` from `filter()` ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") %>% str() ``` ### Aside: the `count()` function [students work on their own] \medskip `count()` function from `dplyr` package counts the number of obs by group ```{r, eval=FALSE, echo=FALSE} ?count ``` __Syntax__ [see help file for full syntax] - `count(x,...)` __Arguments__ [see help file for full arguments] - `x`: an object, often a data frame - `...`: variables to group by Examples of using `count()` - Without vars in `...` argument, counts number of obs in object ```{r, results="hide"} count(wwlist) wwlist %>% count() ``` - With vars in `...` argument, counts number of obs per variable value - note: by default, `count()` always shows `NAs` [this is good!] ```{r, results="hide"} count(wwlist,school_category) wwlist %>% count(school_category) ``` ### Aside: pipe operators and new lines \medskip Often want to insert line breaks to make long line of code more readable - When inserting line breaks, __pipe operator `%>%` should be the last thing before a line break, not the first thing after a line break__ __This works__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` __This works too__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y", state != "WA") %>% select(state, hs_city, sex) %>% count(sex) ``` __This doesn't work__ ```{r, eval=FALSE} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` ### Do task with and without pipes Task: - Count the number "first-generation" prospects from the state of Washington Without pipes ```{r} count(filter(wwlist, firstgen == "Y", state == "WA")) ``` With pipes ```{r} wwlist %>% filter(firstgen == "Y", state == "WA") %>% count() ``` ### Do task with and without pipes __Task__: frequency table of `school_type` for non first-gen prospects from WA __without pipes__ ```{r} wwlist_temp <- filter(wwlist, firstgen == "N", state == "WA") table(wwlist_temp$school_type, useNA = "always") rm(wwlist_temp) # cuz we don't need after creating table ``` __With pipes__ ```{r} wwlist %>% filter(firstgen == "N", state == "WA") %>% count(school_type) ``` __Comparison of two approaches__ - without pipes, task requires multiple lines of code (this is quite common) - first line creates object; second line analyzes object - with pipes, task can be completed in one line of code and you aren't left with objects you don't care about ### Student exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign `<-` them to object `wwlist_temp`. (ex. wwlist_temp <- wwlist) 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. 3. **Bonus**: Try doing question 1 and 2 together. Use original object `wwlist`, but do not assign to a new object. Once finished you can `rm(wwlist_temp)` ### Solution to exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign them to object `wwlist_temp` ```{r} wwlist_temp <- wwlist %>% select(state, firstgen, ethn_code) ``` ### Solution to exercises with pipes 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. ```{r} #names(wwlist) wwlist_temp %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) ``` ### Solution to exercises with pipes 3. **Bonus**: Try doing question 1 and 2 together. ```{r} wwlist %>% select(state, firstgen, ethn_code) %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) #rm(wwlist_temp) ``` ```{r} rm(wwlist_temp) ``` # Creating variables using mutate (tidyverse approach) ### Our plan for learning how to create new variables Recall that `dplyr` package within `tidyverse` provide a set of functions that can be described as "verbs": __subsetting__, __sorting__, and __transforming__ What we've done | Where we're going --------------- | -------------------- __Subsetting data__ | __Transforming data__ - `select()` variables | - `mutate()` creates new variables - `filter()` observations | - `summarize()` calculates across rows __Sorting data__ | - `group_by()` to calculate across rows within groups - `arrange()` | __Today__ - we'll use `mutate()` to create new variables based on calculations across columns within a row __Next week__ - we'll combine `mutate()` with `summarize()` and `group_by()` to create variables based on calculations across rows ### Create new data frame based on `df_school_all` Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) names(df_school_all) ``` ### Create new data frame based on `df_school_all` Let's create new version of this data frame, called `school_v2`, which we'll use to introduce how to create new variables ```{r, results='hide'} school_v2 <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564) names(school_v2) ``` ## Introduce mutate() function ### Introduce `mutate()` function `mutate()` is __tidyverse__ approach to creating variables (not __Base R__ approach) Description of `mutate()` - creates new columns (variables) that are functions of existing columns - After creating a new variable using `mutate()`, every row of data is retained - `mutate()` works best with pipes `%>%` __Task__: - Using data frame `school_v2` create new variable that measures the pct of students on free/reduced lunch (output omitted) ```{r, results='hide'} school_sml <- school_v2 %>% # create new dataset with fewer vars; not necessary to do this select(ncessch, school_type, num_fr_lunch, total_students) school_sml %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) # create new var rm(school_sml) ``` ### Syntax for `mutate()` Let's spend a couple minutes looking at help file for `mutate()` ```{r, eval=FALSE, echo=FALSE} ?mutate ``` __Usage (i.e., syntax)__ - `mutate(.data,...)` __Arguments__ - `.data`: a data frame - if using `mutate()` after pipe operator `%>%`, then this argument can be omitted - Why? Because data frame object to left of `%>%` "piped in" to first argument of `mutate()` - `...`: expressions used to create new variables - Can create multiple variables at once __Value__ - returns an object that contains the original input data frame and new variables that were created by `mutate()` __Useful functions (i.e., "helper functions")__ - These are standalone functions can be called *within* `mutate()` - e.g., `if_else()`, `recode()`, `case_when()` - will show examples of this in subsequent slides ### Introduce `mutate()` function New variable not retained unless we __assign__ `<-` it to an object (existing or new) \medskip __`mutate()` without assignment__ ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2) ``` __`mutate()` with assignment__ ```{r, results="hide"} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2_temp) rm(school_v2_temp) ``` ### `mutate()` can create multiple variables at once `mutate()` can create multiple variables at once ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) %>% select(num_fr_lunch, total_students, pct_fr_lunch, num_prof_math, num_took_math, pct_prof_math) ``` Or we could write code this way: ```{r, results="hide"} school_v2 %>% select(num_fr_lunch, total_students, num_prof_math, num_took_math) %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) ``` ### Student exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. 3. If you have time, count the number of schools from question 2. ### Solutions for exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) ``` ### Solutions for exercise using mutate() 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) ``` ### Solutions for exercise using mutate() 3. If you have time, count the number of schools from question 2. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) %>% count() ``` ## Using ifelse() function within mutate() ### Using `ifelse()` function within `mutate()` ```{r, eval=FALSE} ?if_else ``` __Description__ - if `condition` `TRUE`, assign a value; if `condition` `FALSE` assign a value __Usage (i.e., syntax)__ - `if_else(logical condition, true, false, missing = NULL)` __Arguments__ - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` - `true`: value to assign if condition `TRUE` - `false`: value to assign if condition `FALSE` __Value__ - "Where condition is TRUE, the matching value from true, where it's FALSE, the matching value from false, otherwise NA." - missing values from "input" var are assigned missing values in "output var", unless you specify otherwise __Example__: Create 0/1 indicator of whether got at least one visit from Berkeley ```{r, results="hide"} school_v2 %>% mutate(got_visit_berkeley = ifelse(visits_by_berkeley>0,1,0)) %>% count(got_visit_berkeley) ``` ### `ifelse()` within `mutate()` to create 0/1 indicator variables We often create dichotomous (0/1) indicator variables of whether something happened (or whether something is TRUE) - Variables that are of substantive interest to project - e.g., did student graduate from college - Variables that help you investigate data, check quality - e.g., indicator of whether an observation is missing/non-missing for a particular variable ### Using `ifelse()` within `mutate()` __Task__ - Create 0/1 indicator if school has median income greater than $100,000 Usually a good idea to investigate "input" variables __before__ creating analysis vars ```{r, results="hide"} str(school_v2$med_inc) # investigate variable type school_v2 %>% count(med_inc) # frequency count, but this isn't very helpful school_v2 %>% filter(is.na(med_inc)) %>% count(med_inc) # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= ifelse(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` ### Using `ifelse()` function within `mutate()` __Task__ - Create 0/1 indicator variable `nonmiss_math` which indicates whether school has non-missing values for the variable `num_took_math` - note: `num_took_math` refers to number of students at school that took state math proficiency test Usually a good to investigate "input" variables before creating analysis vars ```{r, results="hide"} school_v2 %>% count(num_took_math) # this isn't very helpful school_v2 %>% filter(is.na(num_took_math)) %>% count(num_took_math) # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(num_took_math) %>% mutate(nonmiss_math= ifelse(!is.na(num_took_math),1,0)) %>% count(nonmiss_math) # note how NA values treated ``` ### Student exercises `ifelse()` 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ### Exercise`ifelse()` solutions 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). ```{r, results="hide"} str(school_v2$state_code) # investigate input variable school_v2 %>% filter(is.na(state_code)) %>% count() # investigate input var #Create var school_v2 %>% mutate(in_state_berkeley=ifelse(state_code=="CA",1,0)) %>% count(in_state_berkeley) ``` ### Exercise`ifelse()` solutions 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. ```{r, results="hide"} #investigate input vars school_v2 %>% select(visits_by_berkeley, visits_by_irvine) %>% str() school_v2 %>% filter(is.na(visits_by_berkeley)) %>% count() school_v2 %>% filter(is.na(visits_by_irvine)) %>% count() #create variable school_v2 %>% mutate(berkeley_and_irvine=ifelse(visits_by_berkeley>0 & visits_by_irvine>0,1,0)) %>% count(berkeley_and_irvine) ``` ### Exercise`ifelse()` solutions 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ```{r, results="hide"} school_v2 %>% mutate(berkeley_or_irvine=ifelse(visits_by_berkeley>0 | visits_by_irvine>0,1,0)) %>% count(berkeley_or_irvine) ``` ## Using recode() function within mutate() ### Using `recode()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?recode ``` __Description__: Recode values of a variable __Usage (i.e., syntax)__ - recode(.x, ..., .default = NULL, .missing = NULL) __Arguments__ [see help file for further details] - `.x` A vector (e.g., variable) to modify - `...` Specifications for recode, of form `current_value = new_recoded_value` - `.default`: If supplied, all values not otherwise matched given this value. - `.missing`: If supplied, any missing values in .x replaced by this value. __Example__: Using data frame `wwlist`, create new 0/1 indicator `public_school` from variable `school_type` ```{r, results="hide"} str(wwlist$school_type) wwlist %>% count(school_type) wwlist_temp <- wwlist %>% select(school_type) %>% mutate(public_school = recode(school_type,"public" = 1, "private" = 0)) wwlist_temp %>% head(n=10) str(wwlist_temp$public_school) wwlist_temp %>% count(public_school) rm(wwlist_temp) ``` ### Using `recode()` function within `mutate()` Recoding `school_type` could have been accomplished using `if_else()` - Use `recode()` when new variable has more than two categories __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" Investigate input var ```{r, results="hide"} str(wwlist$school_category) wwlist %>% count(school_category) ``` Recode ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` [do in pairs/groups] __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.missing` argument to recode `NAs` to "unknown" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .missing = "unknown") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.default` argument to assign the value "regular" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .default = "regular") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - This time create a numeric variable rather than character: - `1` for "regular"; `2` for "alternative"; `3` for "special"; `4` for "vocational" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = 2, "Alternative/other" = 2, "Regular elementary or secondary" = 1, "Regular School" = 1, "Special Education School" = 3, "Special program emphasis" = 3, "Vocational Education School" = 4) ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Student exercise using `recode()` within `mutate()` ```{r, results="hide"} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) names(df_event) ``` 1. Using object `df_event`, assign new object `df_event_temp` and create `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" 2. This time use the `.default` argument to assign the value `5` for "public hs" ### Exercise using `recode()` within `mutate()` solutions Check input variable ```{r, results="hide"} names(df_event) str(df_event$event_type) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 1. Using object `df_event`, assign new object `df_event_temp` and create `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" ```{r results="hide"} df_event_temp <- df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, "public hs" = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 2. This time use the `.default` argument to assign the value `5` for "public hs" ```{r, results="hide"} df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, .default = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ## Using case_when() function within mutate() ### Using `case_when()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?case_when ``` __Description__ Useful when the variable you want to create is more complicated than variables that can be created using `ifelse()` or `recode()` - Useful when new variable is a function of multiple "input" variables __Usage (i.e., syntax)__: `case_when(...)` __Arguments__ [from help file; see help file for more details] - `...`: A sequence of two-sided formulas. - The left hand side (LHS) determines which values match this case. - LHS must evaluate to a logical vector. - The right hand side (RHS) provides the replacement value. __Example task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Using `case_when()` function within `mutate()` __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" Let's take a closer look at how values of inputs are coded into values of outputs ```{r, results="hide"} wwlist %>% select(state,firstgen) %>% str() count(wwlist,state) count(wwlist,firstgen) wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) wwlist_temp %>% count(state_gen) wwlist_temp %>% filter(is.na(state)) %>% count(state_gen) wwlist_temp %>% filter(is.na(firstgen)) %>% count(state_gen) ``` __Take-away__: by default var created by `case_when()` equals `NA` for obs where one of the inputs equals `NA` ### Student exercise using `case_when()` within `mutate()` 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" - Note: We are referring to CA as in-state for this example ### Exercise using `case_when()` within `mutate()` solution Investigate ```{r, results="hide"} school_v2 %>% select(state_code,school_type) %>% str() count(school_v2,state_code) school_v2 %>% filter(is.na(state_code)) %>% count() count(school_v2,school_type) school_v2 %>% filter(is.na(school_type)) %>% count() ``` ### Exercise using `case_when()` within `mutate()` solution 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" ```{r} school_v2_temp <- school_v2 %>% select(state_code,school_type) %>% mutate(state_type = case_when( state_code == "CA" & school_type == "public" ~ "instate_public", state_code == "CA" & school_type == "private" ~ "instate_private", state_code != "CA" & school_type == "public" ~ "outstate_public", state_code != "CA" & school_type == "private" ~ "outstate_private") ) school_v2_temp %>% count(state_type) #school_v2_temp %>% filter(is.na(state_code)) %>% count(state_type) #no missing #school_v2_temp %>% filter(is.na(school_type)) %>% count(state_type) #no missing ``` # Base R appraoch to creating new variables ### Base R approach to creating new variables Subsetting operators `[]` and `$` are used to create new variables and set conditions of the input variables \medskip If creating new variable based on calculation of input variables, basically the tidyverse equivalent of `mutate()` __without__ `ifelse()` or `recode()` - Sudo syntax: `df$newvar <- ...` - where ... argument is expression(s)/calculation(s) used to create new variables \medskip __Task__: Create measure of percent of students on free-reduced lunch __base R approach__ ```{r} school_v2_temp<- school_v2 #create copy of dataset; not necessary school_v2_temp$pct_fr_lunch <- school_v2_temp$num_fr_lunch/school_v2_temp$total_students ``` __tidyverse approach (with pipes)__ ```{r} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) ``` ### Base R approach to creating new variables If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `ifelse()` or `recode()` \medskip - Sudo syntax: `df$newvar[logical condition]<- new value` - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` ### Base R approach to creating new variables __Task__: Create 0/1 indicator if school has median income greater than $100k __tidyverse approach (using pipes)__ ```{r} school_v2_temp %>% select(med_inc) %>% mutate(inc_gt_100k= ifelse(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` __Base R approach__ ```{r} school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs # otherwise you'll get warning school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1 school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0 count(school_v2_temp, inc_gt_100k) ``` ### Base R approach to creating new variables __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" __tidyverse approach (using pipes)__ ```{r} wwlist_temp <- wwlist %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Base R approach to creating new variables __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" __base R approach__ ```{r} wwlist_temp <- wwlist wwlist_temp$state_gen <- NA wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="Y"] <- "instate_firstgen" wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen" str(wwlist_temp$state_gen) count(wwlist_temp, state_gen) ```