--- title: "Enter the tidyverse: pipes and dplyr" subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: default #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE, eval = FALSE} # Download images saved on github site imgs <- c('transform-logical.png') for (i in imgs) { if(!file.exists(i)){ download.file(url = paste0('https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/enter_the_tidyverse/', i), destfile = i, mode = 'wb') } } ``` ### Lecture outline \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` # Introduction ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk__ ```{r, message=FALSE} library(tidyverse) ``` If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ## Data for lecture sections on select(), filter(), and arrange() functions ### Load .Rdata data frames, `df_event` and `df_school` Data on off-campus recruiting events by public universities - Data frame object `df_event` - One observation per university, recruiting event - Data frame object `df_school` - One observation per high school (visited and non-visited) ```{r} rm(list = ls()) # remove all objects in current environment getwd() #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) #load("../../data/recruiting/recruit_event_somevars.Rdata") #load dataset with one obs per high school load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_somevars.RData")) #load("../../data/recruiting/recruit_school_somevars.Rdata") ``` ## Data for lecture sections on pipes and mutate() function ### Load .Rdata data frame `wwlist`, "prospects" purchased by Western Washington U. Note: we won't use this data frame until the lecture section on "pipes" - You can ignore `wwlist` data frame for lecture sections on select(), filter(), and arrange() functions The "Student list" business - Universities identify/target "prospects" by buying "student lists" from College Board/ACT (e.g., $.40 per prospect) - Prospect lists contain contact info (e.g., address, email), academic achievement, socioeconomic, demographic characteristics - Universities choose which prospects to purchase by filtering on criteria like zip-code, GPA, test score range, etc. ```{r} #load prospect list data load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) ``` Object `wwlist` - De-identified list of prospective students purchased by Western Washington University from College Board - We collected these data using public records requests request ### Data frame `wwlist`, "prospects" purchased by Western Washington U. Observations on `wwlist` - each observation represents a prospective student ```{r} typeof(wwlist) dim(wwlist) ``` Variables on `wwlist` - some vars provide de-identified data on individual prospects - e.g., `psat_range`, `state`, `sex`, `ethn_code` - some vars provide data about zip-code student lives in - e.g., `med_inc`, `pop_total`, `pop_black` - some vars provide data about school student enrolled in - e.g., `fr_lunch` is number of students on free/reduced lunch - note: bad merge between prospect-level data and school-level data ```{r, results="hide"} names(wwlist) str(wwlist) glimpse(wwlist) # tidyverse function, similar to str() ``` ### Data frame `wwlist`, "prospects" purchased by Western Washington U. Variable `firstgen` identifies whether prospect is a first-generation college student Imagine we want to isolate all the first-generation prospects 1. Investigate variable type/structure. - A dichotomous var, but stored as character in `wwlist`. So must use quotes (`''` or `""`) to filter/subset based on values of `firstgen` ```{r} str(wwlist$firstgen) ``` 2. Create frequency table to identify possible values of `firstgen` ```{r} table(wwlist$firstgen, useNA = "always") ``` 3. Isolate all the first-gen prospects (output omitted) ```{r results="hide"} filter(wwlist, firstgen == "Y") ``` # Investigating data patterns ### Introduction to the `dplyr` library `dplyr`, a package within the `tidyverse` suite of packages, provide tools for manipulating data frames - Wickham describes functions within `dplyr` as a set of "verbs" that fall in the broader categories of __subsetting__, __sorting__, and __transforming__ Today | Upcoming weeks ------------- | ------------- __Subsetting data__ | __Transforming data__ - `select()` variables | - `mutate()` creates new variables - `filter()` observations | - `summarize()` calculates across rows __Sorting data__ | - `group_by()` to calculate across rows within groups - `arrange()` | All `dplyr` verbs (i.e., functions) work as follows 1. first argument is a data frame 1. subsequent arguments describe what to do with variables and observations in data frame - refer to variable names without quotes 1. result of the function is a new data frame ## select() variables ### Select variables using `select()` function Printing observations is key to investigating data, but datasets often have hundreds, thousands of variables `select()` function selects __columns__ of data (i.e., variables) you specify - first argument is the name of data frame object - remaining arguments are variable names, which are separated by commas and without quotes Without __assignment__ (`<-`), `select()` by itself simply prints selected vars ```{r} #?select select(df_event,instnm,event_date,event_type,event_state,med_inc) ``` ### Select variables using `select()` function Recall that all `dplyr` functions (e.g., `select()`) return a new data frame object - __type__ equals "list" - __length__ equals number of vars you select ```{r} typeof(select(df_event,instnm,event_date,event_type,event_state,med_inc)) length(select(df_event,instnm,event_date,event_type,event_state,med_inc)) ``` `glimpse()`: tidyverse function for viewing data frames - a cross between `str()` and simply printing data ```{r, eval=FALSE} ?glimpse glimpse(df_event) ``` `glimpse()` a `select()` set of variables ```{r} glimpse(select(df_event,instnm,event_date,event_type,event_state,med_inc)) ``` ### Select variables using `select()` function With __assignment__ (`<-`), `select()` creates a new object containing only the variables you specify ```{r} event_small <- select(df_event,instnm,event_date,event_type,event_state, med_inc) glimpse(event_small) ``` ### Select `select()` can use "helper functions" `starts_with()`, `contains()`, and `ends_with()` to choose columns ```{r, eval=FALSE} ?select ``` Example: ```{r} #names(df_event) select(df_event,instnm,starts_with("event")) ``` ### Rename variables `rename()` function renames variables within a data frame object \medskip Syntax: - `rename(obj_name, new_name = old_name,...)` \medskip ```{r results="hide"} rename(df_event, g12_offered = g12offered, titlei = titlei_status_pub) names(df_event) ``` Variable names do not change permanently unless we combine rename with assignment \medskip ```{r results="hide"} rename_event <- rename(df_event, g12_offered = g12offered, titlei = titlei_status_pub) names(rename_event) rm(rename_event) ``` ## filter() rows ### The `filter()` function `filter()` allows you to __select observations__ based on values of variables - Arguments - first argument is name of data frame - subsequent arguments are _logical expressions_ to filter the data frame - Multiple expressions separated by commas work as __AND__ operators (e.g., condtion 1 `TRUE` AND condition 2 `TRUE`) - What is the result of a `filter()` command? - `filter()` returns a data frame consisting of rows where the condition is `TRUE` ```{r, eval=FALSE} ?filter ``` Example from data frame object `df_school`, each obs is a high school - Show all obs where the high school received 1 visit from UC Berkeley (110635) [output omitted] ```{r results="hide"} filter(df_school,visits_by_110635 == 1) ``` Note that resulting object is list, consisting of obs where condition `TRUE` ```{r} nrow(df_school) nrow(filter(df_school,visits_by_110635 == 1)) ``` ### The `filter()` function, base R equivalents **Task**: Count the number of high schools that received 1 visit from UC Berkeley. \bigskip **[tidyverse]** Using `filter()`:\smallskip ```{r} nrow(filter(df_school, visits_by_110635 == 1)) ``` \bigskip **[base R]** Using `[]` and `$`:\smallskip ```{r} nrow(df_school[df_school$visits_by_110635 == 1, ]) ``` \bigskip **[base R]** Using `subset()`:\smallskip ```{r} nrow(subset(df_school, visits_by_110635 == 1)) ``` ### Filter, character variables Use single quotes `''` or double quotes `""` to refer to values of character variables ```{r} glimpse(select(df_school, school_type, state_code)) ``` Identify all private high schools in CA that got 1 visit by particular universities - Visited once by UC Berkeley (ID=110635) ```{r results="hide"} filter(df_school,visits_by_110635 == 1, school_type == "private", state_code == "CA") ``` - Visited once by University of Alabama (ID=100751) ```{r results="hide"} filter(df_school,visits_by_100751 == 1, school_type == "private", state_code == "CA") ``` - Visited once by Berkeley and University of Alabama ```{r results="hide"} filter(df_school,visits_by_100751 == 1, visits_by_110635 == 1, school_type == "private", state_code == "CA") ``` ### Filter by multiple conditions, base R equivalents **Task**: Count the number of private high schools in CA that received 1 visit each from UC Berkeley and University of Alabama. \medskip **[tidyverse]** Using `filter()`:\smallskip ```{r} nrow(filter(df_school, visits_by_100751 == 1, visits_by_110635 == 1, school_type == "private", state_code == "CA")) ``` \medskip **[base R]** Using `[]` and `$`:\smallskip ```{r} nrow(df_school[df_school$visits_by_100751 == 1 & df_school$visits_by_110635 == 1 & df_school$school_type == "private" & df_school$state_code == "CA", ]) ``` \medskip **[base R]** Using `subset()`:\smallskip ```{r} nrow(subset(df_school, visits_by_100751 == 1 & visits_by_110635 == 1 & school_type == "private" & state_code == "CA")) ``` ### Logical operators for comparisons logical operators useful for: filter obs w/ `filter()`; create variables w/ `mutate()` - logical operators also work when using Base R functions Operator symbol | Operator meaning -------|------- `==` | Equal to `!=` | Not equal to `>` | greater than `>=` | greater than or equal to `<` | less than `<=` | less than or equal to `&` | AND `|` | OR `%in%` | includes - Visualization of "Boolean" operators (e.g., AND, OR, AND NOT) !["Boolean" operations, x=left circle, y=right circle, from Wichkam (2018)](transform-logical.png){width=40%} ### Aside: `count()` function \medskip `count()` function from `dplyr` package counts the number of obs by group ```{r, eval=FALSE, echo=FALSE} ?count ``` __Syntax__ [see help file for full syntax] - `count(x,...)` __Arguments__ [see help file for full arguments] - `x`: an object, often a data frame - `...`: variables to group by Examples of using `count()` - Without vars in `...` argument, counts number of obs in object ```{r, results="hide"} count(df_school) # df_school %>% count() # same as above but using pipes str(count(df_school)) # #df_school %>% count() %>% str() # same as above but using pipes ``` - With vars in `...` argument, counts number of obs per variable value - This is the best way to create frequency table, better than `table()` - note: by default, `count()` always shows `NAs` [this is good!] ```{r, results="hide"} count(df_school,school_type) # df_school %>% count(school_type) # same as above but using pipes str(count(df_school,school_type)) # df_school %>% count(school_type) %>% str() # same as above but using pipes ``` ### Filters and comparisons, Demonstration Schools visited by Bama (100751) and/or Berkeley (110635)\bigskip ```{r results="hide"} # Berkeley AND Bama filter(df_school,visits_by_100751 >= 1, visits_by_110635 >= 1) filter(df_school,visits_by_100751 >= 1 & visits_by_110635 >= 1) # same same df_school[df_school$visits_by_100751 >= 1 & df_school$visits_by_110635 >= 1, ] # using [] and $ subset(df_school,visits_by_100751 >= 1 & visits_by_110635 >= 1) # using subset() ``` \bigskip ```{r results="hide"} # Berkeley OR Bama filter(df_school,visits_by_100751 >= 1 | visits_by_110635 >= 1) df_school[df_school$visits_by_100751 >= 1 | df_school$visits_by_110635 >= 1, ] # using [] and $ subset(df_school,visits_by_100751 >= 1 | visits_by_110635 >= 1) # using subset() ``` ### Filters and comparisons, Demonstration (cont.) Apply `count()` function on top of `filter()` function to count the number of observations that satisfy criteria - Avoids printing individual observations\medskip ```{r} # Number of schools that get visit by Berkeley AND Bama count(filter(df_school, visits_by_100751 >= 1 & visits_by_110635 >= 1)) # Number of schools that get visit by Berkeley OR Bama count(filter(df_school, visits_by_100751 >= 1 | visits_by_110635 >= 1)) ``` - Note: You could also use any of the base R equivalents from the previous slide ### Filters and comparisons, `>=` Number of public high schools that are at least 50% Black in Alabama compared to number of schools that received visit by Bama:\medskip ```{r} # at least 50% black count(filter(df_school, school_type == "public", pct_black >= 50, state_code == "AL")) # at least 50% black and received visit by Bama count(filter(df_school, school_type == "public", pct_black >= 50, state_code == "AL", visits_by_100751 >= 1)) ``` ### Filters and comparisons, `>=` (cont.) Number of public high schools that are at least 50% White in Alabama compared to number of schools that received visit by Bama:\medskip ```{r} # at least 50% white count(filter(df_school, school_type == "public", pct_white >= 50, state_code == "AL")) # at least 50% white and received visit by Bama count(filter(df_school, school_type == "public", pct_white >= 50, state_code == "AL", visits_by_100751 >= 1)) ``` ### Filters and comparisons, not equals (`!=`) Count the number of high schools visited by University of Colorado (126614) that are not located in CO ```{r} #number of high schools visited by U Colorado count(filter(df_school, visits_by_126614 >= 1)) #number of high schools visited by U Colorado not located in CO count(filter(df_school, visits_by_126614 >= 1, state_code != "CO")) #number of high schools visited by U Colorado located in CO #count(filter(df_school, visits_by_126614 >= 1, state_code == "CO")) ``` ### Filters and comparisons, `%in%` operator What if you wanted to count the number of schools visited by Bama (100751) in a group of states? ```{r} count(filter(df_school,visits_by_100751 >= 1, state_code == "MA" | state_code == "VT" | state_code == "ME")) ``` Easier way to do this is with `%in%` operator ```{r} count(filter(df_school,visits_by_100751 >= 1, state_code %in% c("MA","ME","VT"))) ``` Select the private high schools that got either 2 or 3 visits from Bama ```{r} count(filter(df_school, visits_by_100751 %in% 2:3, school_type == "private")) ``` ### Identifying data type and possible values helpful for filtering - `typeof()` and `str()` shows internal data type of a variable - `table()` to show potential values of categorical variables ```{r} typeof(df_event$event_type) str(df_event$event_type) # double quotes indicate character table(df_event$event_type, useNA="always") typeof(df_event$med_inc) str(df_event$med_inc) ``` Now that we know `event_type` is a character, we can filter values ```{r} count(filter(df_event, event_type == "public hs", event_state =="CA")) #below code would return an error because variables are character #count(filter(df_event, event_type == public hs, event_state ==CA)) ``` ### Filtering and missing values Wickham (2018) states: - "`filter()` only includes rows where condition is TRUE; it excludes both `FALSE` and `NA` values. To preserve missing values, ask for them explicitly:" \medskip Investigate var `df_event$fr_lunch`, number of free/reduced lunch students - only available for visits to public high schools ```{r} #visits to public HS with less than 50 students on free/reduced lunch count(filter(df_event,event_type == "public hs", fr_lunch<50)) #visits to public HS, where free/reduced lunch missing count(filter(df_event,event_type == "public hs", is.na(fr_lunch))) #visits to public HS, where free/reduced is less than 50 OR is missing count(filter(df_event,event_type == "public hs", fr_lunch<50 | is.na(fr_lunch))) ``` ### Exercise Task - Create a filter to identify all the high schools that recieved 1 visit from UC Berkeley (110635) AND 1 visit from CU Boulder (126614)[output omitted] ### Solution ```{r, results="hide"} filter(df_school,visits_by_110635 == 1, visits_by_126614==1) nrow(filter(df_school,visits_by_110635 == 1, visits_by_126614==1)) count(filter(df_school,visits_by_110635 == 1, visits_by_126614==1)) ``` - Must __assign__ to create new object based on filter ```{r results="hide"} berk_boulder <- filter(df_school,visits_by_110635 == 1, visits_by_126614==1) count(berk_boulder) ``` ### Exercises Use the data from df_event, which has one observation for each off-campus recruiting event a university attends 1. Count the number of events attended by the University of Pittsburgh (Pitt) `univ_id == 215293` 1. Count the number of recruiting events by Pitt at public or private high schools 1. Count the number of recruiting events by Pitt at public or private high schools located in the state of PA 1. Count the number of recruiting events by Pitt at public high schools not located in PA where median income is less than 100,000 1. Count the number of recruiting events by Pitt at public high schools not located in PA where median income is greater than or equal to 100,000 1. Count the number of out-of-state recruiting events by Pitt at private high schools or public high schools with median income of at least 100,000 ### Solution 1. Count the number of events attended by the University of Pittsburgh (Pitt) `univ_id == 215293` ```{r} count(filter(df_event, univ_id == 215293)) ``` 2. Count the number of recruiting events by Pitt at public or private high schools ```{r } str(df_event$event_type) table(df_event$event_type, useNA = "always") count(filter(df_event, univ_id == 215293, event_type == "private hs" | event_type == "public hs")) ``` ### Solution 3. Count the number of recruiting events by Pitt at public or private high schools located in the state of PA ```{r} count(filter(df_event, univ_id == 215293, event_type == "private hs" | event_type == "public hs", event_state == "PA")) ``` 4. Count the number of recruiting events by Pitt at public high schools not located in PA where median income is less than 100,000 ```{r} count(filter(df_event, univ_id == 215293, event_type == "public hs", event_state != "PA", med_inc < 100000)) ``` ### Solution 5. Count the number of recruiting events by Pitt at public high schools not located in PA where median income is greater than or equal to 100,000 ```{r} count(filter(df_event, univ_id == 215293, event_type == "public hs", event_state != "PA", med_inc >= 100000)) ``` 6. Count the number of out-of-state recruiting events by Pitt at private high schools or public high schools with median income of at least 100,000 ```{r} count(filter(df_event, univ_id == 215293, event_state != "PA", (event_type == "public hs" & med_inc >= 100000) | event_type == "private hs")) ``` ## arrange() rows (i.e., sort rows) ### `arrange()` function `arrange()` function "arranges" rows in a data frame; said different, it sorts observations \medskip Syntax: `arrange(x,...)` - First argument, `x`, is a data frame - Subsequent arguments are a "comma separated list of unquoted variable names" ```{r, results="hide"} df_event arrange(df_event, event_date) ``` Data frame goes back to previous order unless you __assign__ the new order ```{r, results="hide"} df_event df_event <- arrange(df_event, event_date) df_event ``` ### `arrange()` function Ascending and descending order - `arrange()` sorts in __ascending__ order by default - use `desc()` to sort a column by descending order ```{r results="hide"} arrange(df_event, desc(event_date)) ``` Can sort by multiple variables ```{r results="hide"} arrange(df_event, univ_id, desc(event_date), desc(med_inc)) #sort by university and descending by size of 12th grade class; combine with select select(arrange(df_event, univ_id, desc(g12)),instnm,event_type,event_date,g12) ``` ### `arrange()`, missing values sorted at the end Missing values automatically sorted at the end, regardless of whether you sort ascending or descending Below, we sort by university, then by date of event, then by ID of high school ```{r, results="hide"} #by university, date, ascending school id select(arrange(df_event, univ_id, desc(event_date), school_id), instnm,event_date,event_type,school_id) #by university, date, descending school id select(arrange(df_event, univ_id, desc(event_date), desc(school_id)), instnm,event_date,event_type,school_id) ``` Can sort by `is.na` to put missing values first ```{r} select(arrange(df_event, univ_id, desc(event_date), desc(is.na(school_id))), instnm,event_date,event_type,school_id) ``` ### Exercise, arranging Use the data from df_event, which has one observation for each off-campus recruiting event a university attends 1. Sort ascending by "univ_id" and descending by "event_date" 1. Select four variables in total and sort ascending by "univ_id" and descending by "event_date" 1. Now using the same variables from above, sort by `is.na` to put missing values in "school_id" first ### Solution 1. Sort ascending by "univ_id" and descending by "event_date" ```{r} arrange(df_event, univ_id, desc(event_date)) ``` ### Solution 2. Select four variables in total and sort ascending by "univ_id" and descending by "event_date" ```{r} select(arrange(df_event, univ_id, desc(event_date)), univ_id, event_date, instnm, event_type) ``` ### Solution 3. Select the variables "univ_id", "event_date", and "school_id" and sort by `is.na` to put missing values in "school_id" first. ```{r} select(arrange(df_event, univ_id, desc(event_date), desc(is.na(school_id))), univ_id, event_date, school_id) ``` # Pipes ### What are "pipes", %>% __Pipes__ are a means of performing multiple steps in a single line of code - When writing code, the pipe symbol is `%>%` - The pipe operator `%>%` is created by the **magrittr** package, which is not part of base R - However, the magrittr package is automatically loaded when you load the tidyverse package ```{r, eval = FALSE} ?magrittr::`%>%` ``` ### What are "pipes", %>% pipe syntax: `LHS %>% RHS` - `LHS` (refers to "left hand side" of the pipe) is an object or function - `RHS` (refers to "right hand side" of the pipe) is a function How pipes work: - Object created by `LHS` becomes the first argument of the function (`RHS`) to the right of the `%>%` pipe symbol - Basic code flow: `object %>% function1 %>% function2 %>% function3` - Output of `some_function1` becomes the input (the first argument) of the function `some_function2` to the right of the `%>%` pipe symbol Example of using pipes to calculate mean value of atomic vector ```{r} 1:10 # an atomic vector mean(1:10) # calculate mean without pipes 1:10 %>% mean() # calculate mean with pipes ``` - no pipe: (1) write function; (2) data object `1:10` is 1st argument of `mean()` - pipe: (1) write data object; (2) "pipe" (verb) object as 1st argument of `mean()` ### What are "pipes", %>% \medskip Intuitive mnemonic device for understanding pipes - whenever you see a pipe `%>%` think of the words "__and then...__" Example: isolate all the first-generation prospects [output omitted] - in words: start with object `wwlist` __and then__ filter first generation students ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") ``` below code in words: - start with `wwlist` __and then__ select a few vars __and then__ filter __and then__ sort __and then__ investigate structre of object ```{r} wwlist %>% select(firstgen, state, med_inc_zip) %>% filter(firstgen == "Y", state == "WA") %>% arrange(desc(med_inc_zip)) %>% str() ``` ### More intuition on the pipe operator, `%>%` Example: apply "structure" function `str()` to `wwlist` with and without pipes ```{r, results="hide"} str(wwlist) # without pipe wwlist %>% str() # with pipe ``` I use the `str()` when I add new `%>%`; shows what kind of object being piped in - task: select a few vars from `wwlist`; isolate first-gen students in WA; sort descending by income (output omitted) ```{r, eval = FALSE} wwlist %>% select(firstgen, state, med_inc_zip) %>% str() wwlist %>% select(firstgen, state, med_inc_zip) %>% filter(firstgen == "Y", state == "WA") %>% str() wwlist %>% select(firstgen, state, med_inc_zip) %>% filter(firstgen == "Y", state == "WA") %>% arrange(desc(med_inc_zip)) %>% str() ``` ### Compare data tasks, with and without pipes Task: Using object `wwlist` print data for "first-gen" prospects (`firstgen == "Y"`) ```{r, results='hide'} # without pipes filter(wwlist, firstgen == "Y") # with pipes wwlist %>% filter(firstgen == "Y") ``` Comparing the two approaches: - "without pipes", object `wwlist` is the first argument `filter()` function - In "pipes" approach, you don't specify object `wwlist` as first argument in `filter()` - Why? Because `%>%` "pipes" the object to the left of the `%>%` operator into the function to the right of the `%>%` operator ### Compare data tasks, with and without pipes __Task__: Using object `wwlist`, print data for "first-gen" prospects for selected variables ```{r, results='hide'} #Without pipes select(filter(wwlist, firstgen == "Y"), state, hs_city, sex) #With pipes wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) ``` Comparing the two approaches: - In the "without pipes" approach, code is written "inside out" - The first step in the task -- identifying the object -- is the innermost part of code - The last step in task -- selecting variables to print -- is the outermost part of code - In "pipes" approach the left-to-right order of code matches how we think about the task - First, we start with an object __*and then*__ (`%>%`) we use `filter()` to isolate first-gen students __*and then*__ (`%>%`) we select which variables to print \medskip `str()` helpful to understand object piped in from one function to another ```{r, results="hide"} #object that was "piped" into `select()` from `filter()` wwlist %>% filter(firstgen == "Y") %>% str() #object that was created after `select()` function wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% str() ``` ### Aside: `count()` function \medskip `count()` function from `dplyr` package counts the number of obs by group ```{r, eval=FALSE, echo=FALSE} ?count ``` __Syntax__ [see help file for full syntax] - `count(x,...)` __Arguments__ [see help file for full arguments] - `x`: an object, often a data frame - `...`: variables to group by Examples of using `count()` - Without vars in `...` argument, counts number of obs in object ```{r, results="hide"} count(wwlist) wwlist %>% count() wwlist %>% count() %>% str() ``` - With vars in `...` argument, counts number of obs per variable value - This is the best way to create frequency table, better than `table()` - note: by default, `count()` always shows `NAs` [this is good!] ```{r, results="hide"} count(wwlist,school_category) wwlist %>% count(school_category) wwlist %>% count(school_category) %>% str() ``` ### pipe operators and new lines \medskip Often want to insert line breaks to make long line of code more readable - When inserting line breaks, __pipe operator `%>%` should be the last thing before a line break, not the first thing after a line break__ __This works__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` __This works too__ ```{r, results="hide"} wwlist %>% filter(firstgen == "Y", state != "WA") %>% select(state, hs_city, sex) %>% count(sex) ``` __This doesn't work__ ```{r, eval=FALSE} wwlist %>% filter(firstgen == "Y") %>% select(state, hs_city, sex) %>% count(sex) ``` ### The power of pipes You might be thinking, "what's the big deal?" __TasK__: - in one line of code, modify `wwlist` and create bar chart that counts number of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state ```{r, eval=FALSE} wwlist %>% filter(is.na(state)==0) %>% # drop obs where variable state missing mutate( # create out-of-state indicator; create recoded ethnicity var out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")), ethn_race = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white")) %>% group_by(out_state) %>% # group_by "in-state" vs. "out-of-state" count(ethn_race) %>% # count of number of prospects purchased by race ggplot(aes(x=ethn_race, y=n)) + # plot ylab("number of prospects") + xlab("race/ethnicity") + geom_col() + coord_flip() + facet_wrap(~ out_state) ``` ### The power of pipes __TasK__: - in one line of code, modify `wwlist` and create bar chart of median income (in zip-code) of prospects purchased by race/ethnicity, separately for in-state vs. out-of-state ```{r, eval=FALSE} wwlist %>% filter(is.na(state)==0) %>% # drop obs where variable state missing mutate( # create out-of-state indicator; create recoded ethnicity var out_state = as_factor(if_else(state != "WA", "out-of-state", "in-state")), ethn_race = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white")) %>% group_by(out_state, ethn_race) %>% # group_by "out-state" and ethnicity summarize(avg_inc_zip = mean(med_inc_zip, na.rm = TRUE)) %>% # calculate avg. inc ggplot(aes(x=out_state, y=avg_inc_zip)) + ylab("avg. income in zip code") + xlab("") + geom_col() + coord_flip() + facet_wrap(~ ethn_race) # plot ``` ### The power of pipes Example R script from Ben Skinner, which creates analysis data for [Skinner (2018)](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1) - [Link to R script](https://github.com/btskinner/colchoice_rep/blob/master/scripts/makedata.r) \medskip Other relevant links - [Link to Github repository for Skinner (2018)](https://github.com/btskinner/colchoice_rep) - [Link to published paper](https://link.springer.com/article/10.1007%2Fs11162-018-9507-1) - [Link to Skinner's Github page](https://github.com/btskinner) - A lot of cool stuff here - [Link to Skinner's personal website](https://www.btskinner.io/) - A lot of cool stuff here ### Which objects and functions are pipeable Which objects and functions are "pipeable" (i.e., work with pipes) - function is pipeable if it takes a data object as first argument and returns an object of same type - In general, doesn’t seem to be any limit on which kinds of objects are pipeable (could be atomic vector, list, data frame) ```{r, result = "hide"} # applying pipes to atomic vectors 1:10 %>% mean 1:10 %>% mean %>% str() ``` But some pipeable functions restrict which kinds of data objects they accept - In particular, the `dplyr` functions (e.g., `filter`, `arrange`, etc.) expect the first argument to be a data frame. - `dpylr` functions won’t even accept a list as first argument, even though data frames are a particular class of list ```{r, eval = FALSE} wwlist %>% filter(firstgen == "Y") %>% str() as.data.frame(wwlist) %>% str() as.data.frame(wwlist) %>% filter(firstgen == "Y") %>% str() as.list(wwlist) %>% str() # as.list(wwlist) %>% filter(firstgen == "Y") %>% str() # error ``` ### Do task with and without pipes [STUDENTS WORK ON THEIR OWN] Task: - Count the number "first-generation" prospects from the state of Washington Without pipes ```{r} count(filter(wwlist, firstgen == "Y", state == "WA")) ``` With pipes ```{r} wwlist %>% filter(firstgen == "Y", state == "WA") %>% count() ``` ### Do task with and without pipes [STUDENTS WORK ON THEIR OWN] __Task__: frequency table of `school_type` for non first-gen prospects from WA __Without pipes__ ```{r} wwlist_temp <- filter(wwlist, firstgen == "N", state == "WA") table(wwlist_temp$school_type, useNA = "always") rm(wwlist_temp) # cuz we don't need after creating table ``` __With pipes__ ```{r} wwlist %>% filter(firstgen == "N", state == "WA") %>% count(school_type) ``` __Comparison of two approaches__ - without pipes, task requires multiple lines of code (this is quite common) - first line creates object; second line analyzes object - with pipes, task can be completed in one line of code and you aren't left with objects you don't care about ### Student exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign `<-` them to object `wwlist_temp`. (ex. wwlist_temp <- wwlist) 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. 3. **Bonus**: Try doing question 1 and 2 together. Use original object `wwlist`, but do not assign to a new object. Once finished you can `rm(wwlist_temp)` ### Solution to exercises with pipes 1. Using object `wwlist` select the following variables (state, firstgen, ethn_code) and assign them to object `wwlist_temp` ```{r} wwlist_temp <- wwlist %>% select(state, firstgen, ethn_code) ``` ### Solution to exercises with pipes 2. Using the object you just created `wwlist_temp`, create a frequency table of `ethn_code` for first-gen prospects from California. ```{r} #names(wwlist) wwlist_temp %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) ``` ### Solution to exercises with pipes 3. **Bonus**: Try doing question 1 and 2 together. ```{r} wwlist %>% select(state, firstgen, ethn_code) %>% filter(firstgen == "Y", state == "CA") %>% count(ethn_code) #rm(wwlist_temp) ``` ```{r} rm(wwlist_temp) ``` # Creating variables using mutate ### Our plan for learning how to create new variables Recall that `dplyr` package within `tidyverse` provide a set of functions that can be described as "verbs": __subsetting__, __sorting__, and __transforming__ What we've done | Where we're going --------------- | -------------------- __Subsetting data__ | __Transforming data__ - `select()` variables | - `mutate()` creates new variables - `filter()` observations | - `summarize()` calculates across rows __Sorting data__ | - `group_by()` to calculate across rows within groups - `arrange()` | __Today__ - we'll use `mutate()` to create new variables based on calculations across columns within a row __Next week__ - we'll combine `mutate()` with `summarize()` and `group_by()` to create variables based on calculations across rows ### Create new data frame based on `df_school_all` Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) names(df_school_all) ``` ### Create new data frame based on `df_school_all` Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables ```{r, results='hide'} school_v2 <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( # rename selected variables visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564 ) glimpse(school_v2) ``` ## Introduce mutate() function ### Introduce `mutate()` function `mutate()` is __tidyverse__ approach to creating variables (not __Base R__ approach) Description of `mutate()` - creates new columns (variables) that are functions of existing columns - After creating a new variable using `mutate()`, every row of data is retained - `mutate()` works best with pipes `%>%` __Task__: - Using data frame `school_v2` create new variable that measures the pct of students on free/reduced lunch (output omitted) ```{r, results='hide'} # create new dataset with fewer vars; not necessary to do this school_sml <- school_v2 %>% select(ncessch, school_type, num_fr_lunch, total_students) # create new var school_sml %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) # remove data frame object rm(school_sml) ``` ### Investigate `mutate()` syntax ```{r, eval=FALSE, echo=FALSE} ?mutate ``` __Usage (i.e., syntax)__ - `mutate(.data,...)` __Arguments__ - `.data`: a data frame - if using `mutate()` after pipe operator `%>%`, then this argument can be omitted - Why? Because data frame object to left of `%>%` "piped in" to first argument of `mutate()` - `...`: expressions used to create new variables - "Name-value pairs of expressions" - "The name of each argument will be the name of a new variable, and the value will be its corresponding value." - "Use a `NULL` value in mutate to drop a variable." - "New variables overwrite existing variables of the same name" __Value__ - returns a (data frame) object that contains the original input data frame and new variables that were created by `mutate()` ### Investigate `mutate()` syntax ```{r, eval=FALSE, echo=FALSE} ?mutate ``` __Can create variables using standard mathematical or logical operators__ [output omitted] ```{r, results="hide"} #glimpse(school_v2) school_v2 %>% select(state_code,school_type,ncessch,med_inc,num_fr_lunch,total_students,num_took_math) %>% mutate( # each argument creates a new variable, name of argument is name of variable one = 1, med_inc000 = med_inc/1000, pct_fr_lunch = num_fr_lunch/total_students*100, took_math_na = is.na(num_took_math)==1 ) %>% select(state_code,school_type,ncessch,one,med_inc,med_inc000,num_fr_lunch,total_students,pct_fr_lunch,num_took_math,took_math_na) ``` __Can create variables using "helper functions" called within `mutate()`__ [output omitted] - These are standalone functions can be called *within* `mutate()` - e.g., `if_else()`, `recode()`, `case_when()` - will walk through helper functions in more detail in subsequent sections of lecture ```{r, results="hide"} school_v2 %>% select(state_code,ncessch,name,school_type) %>% mutate(public = if_else(school_type == "public", 1, 0)) ``` ### Introduce `mutate()` function New variable not retained unless we __assign__ `<-` it to an object (existing or new) \medskip - __`mutate()` without assignment__ ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2) ``` \medskip - __`mutate()` with assignment__ ```{r, results="hide"} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) names(school_v2_temp) rm(school_v2_temp) ``` ### `mutate()` can create multiple variables at once `mutate()` can create multiple variables at once ```{r, results='hide'} school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) %>% select(num_fr_lunch, total_students, pct_fr_lunch, num_prof_math, num_took_math, pct_prof_math) ``` Or we could write code this way: ```{r, results="hide"} school_v2 %>% select(num_fr_lunch, total_students, num_prof_math, num_took_math) %>% mutate(pct_fr_lunch = num_fr_lunch/total_students, pct_prof_math= num_prof_math/num_took_math) ``` `mutate()` can use variables previously created within `mutate()` ```{r, results="hide"} school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read, avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2) ``` ### `mutate()`, removing variables created by `mutate()` Within `mutate()` use syntax `var_name = NULL` to remove variable from data frame - note: Variable not permanently removed from data frame unless you use assignment `<-` to create new data frame or overwrite existing data frame ```{r, results='hide'} ncol(school_v2) school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% glimpse() school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(num_prof_math = NULL, num_took_math = NULL) %>% glimpse() #But variables not permanently removed because we didn't use assignment ncol(school_v2) ``` Why would we remove variables within `mutate()` rather `select()`? - remove temporary "work" variables used to create desired variable - Example: measure of average of pct who passed math and pct who passed reading ```{r, results='hide'} school_v2 %>% select(num_prof_math, num_took_math, num_took_read,num_prof_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, # create work var pct_prof_read = num_prof_read/num_took_read, # create work var avg_pct_prof_math_read = (pct_prof_math + pct_prof_read)/2, #create analysis var pct_prof_math = NULL, # remove work var pct_prof_read = NULL) %>% # remove work var glimpse() ``` ### Student exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. 3. Count the number of schools from question 2. 4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? ### Solutions for exercise using mutate() 1. Using the object `school_v2`, select the following variables (`num_prof_math`, `num_took_math`, `num_prof_read`, `num_took_read`) and create a measure of percent proficient in math `pct_prof_math` and percent proficient in reading `pct_prof_read`. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) ``` ### Solutions for exercise using mutate() 2. Now using the code for question 1, filter schools where at least 50% of students are proficient in math **&** reading. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) ``` ### Solutions for exercise using mutate() 3. Count the number of schools from question 2. ```{r} school_v2 %>% select(num_prof_math, num_took_math, num_prof_read, num_took_read) %>% mutate(pct_prof_math = num_prof_math/num_took_math, pct_prof_read = num_prof_read/num_took_read) %>% filter(pct_prof_math >= 0.5 & pct_prof_read >= 0.5) %>% count() ``` ### Solutions for exercise using mutate() 4. Using `school_v2`, using `mutate()` combined with `is.na()` create a dichotomous indicator variable `med_inc_na` that identifies whether `med_inc` is missing (`NA`) or not. And then use syntax `count(var_name)` to create frequency table of variable `med_inc_na`. How many observations are missing? ```{r} school_v2 %>% mutate(med_inc_na = is.na(med_inc)) %>% count(med_inc_na) ``` ## Using if_else() function within mutate() ### Using `if_else()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?if_else ``` __Description__ - if `` `TRUE`, assign value; if `` `FALSE` assign value __Usage (i.e., syntax)__ - `if_else(logical condition, true, false, missing = NULL)` __Arguments__ - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` - `true`: value to assign if condition `TRUE` - `false`: value to assign if condition `FALSE` - `missing`: value to assign to rows that have value `NA` for condition - default is `missing = NULL`; means that if condition is `NA`, then new_var == `NA` - But can assign different values to `NA`s, e.g., `missing = -9` __Value__ - "Where condition is TRUE, the matching value from true, where it's FALSE, the matching value from false, otherwise NA." - Unless otherwise specified, `NA`s in "input" var(s) assigned `NA` in "output var" __Example__: Create 0/1 indicator of whether got at least one visit from Berkeley ```{r, results="hide"} school_v2 %>% mutate(got_visit_berkeley = if_else(visits_by_berkeley>0,1,0)) %>% count(got_visit_berkeley) ``` ### `if_else()` within `mutate()` to create 0/1 indicator variables We often create dichotomous (0/1) indicator variables of whether something happened (or whether something is TRUE) - Variables that are of substantive interest to project - e.g., did student graduate from college - Variables that help you investigate data, check quality - e.g., indicator of whether an observation is missing/non-missing for a particular variable ### Using `if_else()` within `mutate()` __Task__ - Create 0/1 indicator if school has median income greater than $100,000 Usually a good idea to investigate "input" variables __before__ creating analysis vars ```{r, results="hide"} str(school_v2$med_inc) # investigate variable type school_v2 %>% count(med_inc) # frequency count, but this isn't very helpful school_v2 %>% filter(is.na(med_inc)) %>% count() # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` ### Using `if_else()` within `mutate()` __Task__: - Create 0/1 indicator if school has median income greater than $100,000. This time, let's experiment with the `missing` argument of `if_else()` ```{r, eval=FALSE} #what we wrote before school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) #manually write out the default value for `missing` school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NULL)) %>% count(inc_gt_100k) # note how NA values of med_inc treated school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = NA_real_)) %>% count(inc_gt_100k) # note how NA values of med_inc treated # NA can be coerced to any other vector type except raw: # NA_integer_, NA_real_, NA_complex_ and NA_character_ # Here we give missing values in condition the value of -9 in new variable school_v2 %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0, missing = -9)) %>% count(inc_gt_100k) ``` ### Using `if_else()` function within `mutate()` __Task__ - Create 0/1 indicator variable `nonmiss_math` which indicates whether school has non-missing values for the variable `num_took_math` - note: `num_took_math` refers to number of students at school that took state math proficiency test Usually a good to investigate "input" variables before creating analysis vars ```{r, results="hide"} school_v2 %>% count(num_took_math) # this isn't very helpful school_v2 %>% filter(is.na(num_took_math)) %>% count(num_took_math) # shows number of obs w/ missing med_inc ``` Create variable ```{r} school_v2 %>% select(num_took_math) %>% mutate(nonmiss_math= if_else(!is.na(num_took_math),1,0)) %>% count(nonmiss_math) # note how NA values treated ``` ### Student exercises `if_else()` 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ### Exercise`if_else()` solutions 1. Using the object `school_v2`, create 0/1 indicator variable `in_state_berkeley` that equals `1` if the high school is in the same state as UC Berkeley (i.e., `state_code=="CA"`). ```{r, results="hide"} str(school_v2$state_code) # investigate input variable school_v2 %>% filter(is.na(state_code)) %>% count() # investigate input var #Create var school_v2 %>% mutate(in_state_berkeley=if_else(state_code=="CA",1,0)) %>% count(in_state_berkeley) ``` ### Exercise`if_else()` solutions 2. Create 0/1 indicator `berkeley_and_irvine` of whether a school got at least one visit from UC Berkeley __AND__ from UC Irvine. ```{r, results="hide"} #investigate input vars school_v2 %>% select(visits_by_berkeley, visits_by_irvine) %>% str() school_v2 %>% filter(is.na(visits_by_berkeley)) %>% count() school_v2 %>% filter(is.na(visits_by_irvine)) %>% count() #create variable school_v2 %>% mutate(berkeley_and_irvine=if_else(visits_by_berkeley>0 & visits_by_irvine>0,1,0)) %>% count(berkeley_and_irvine) ``` ### Exercise`if_else()` solutions 3. Create 0/1 indicator `berkeley_or_irvine` of whether a school got at least one visit from UC Berkeley __OR__ from UC Irvine. ```{r, results="hide"} school_v2 %>% mutate(berkeley_or_irvine=if_else(visits_by_berkeley>0 | visits_by_irvine>0,1,0)) %>% count(berkeley_or_irvine) ``` ## Using recode() function within mutate() ### Using `recode()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?recode ``` __Description__: Recode values of a variable __Usage (i.e., syntax)__ - recode(.x, ..., .default = NULL, .missing = NULL) __Arguments__ [see help file for further details] - `.x` A vector (e.g., variable) to modify - `...` Specifications for recode, of form `current_value = new_recoded_value` - `.default`: If supplied, all values not otherwise matched given this value. - `.missing`: If supplied, any missing values in .x replaced by this value. __Example__: Using data frame `wwlist`, create new 0/1 indicator `public_school` from variable `school_type` ```{r, results="hide"} str(wwlist$school_type) wwlist %>% count(school_type) wwlist_temp <- wwlist %>% select(school_type) %>% mutate(public_school = recode(school_type,"public" = 1, "private" = 0)) wwlist_temp %>% head(n=10) str(wwlist_temp$public_school) # note: numeric variable wwlist_temp %>% count(public_school) # note the NAs rm(wwlist_temp) ``` ### Using `recode()` function within `mutate()` Recoding `school_type` could have been accomplished using `if_else()` - Use `recode()` when new variable has more than two categories __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" Investigate input var ```{r, results="hide"} str(wwlist$school_category) # character variable wwlist %>% count(school_category) ``` Recode ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational") ) str(wwlist_temp$school_catv2) # character variable created wwlist_temp %>% count(school_catv2) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.missing` argument to recode `NAs` to "unknown" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Regular elementary or secondary" = "regular", "Regular School" = "regular", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .missing = "unknown") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - "regular"; "alternative"; "special"; "vocational" - This time use the `.default` argument to assign the value "regular" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = "alternative", "Alternative/other" = "alternative", "Special Education School" = "special", "Special program emphasis" = "special", "Vocational Education School" = "vocational", .default = "regular") ) str(wwlist_temp$school_catv2) wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Using `recode()` within `mutate()` __Task__: Create `school_catv2` based on `school_category` with these categories: - This time create a numeric variable rather than character: - `1` for "regular"; `2` for "alternative"; `3` for "special"; `4` for "vocational" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(school_category) %>% mutate(school_catv2 = recode(school_category, "Alternative Education School" = 2, "Alternative/other" = 2, "Regular elementary or secondary" = 1, "Regular School" = 1, "Special Education School" = 3, "Special program emphasis" = 3, "Vocational Education School" = 4) ) str(wwlist_temp$school_catv2) # note: numeric variable now wwlist_temp %>% count(school_catv2) wwlist %>% count(school_category) rm(wwlist_temp) ``` ### Student exercise using `recode()` within `mutate()` ```{r, results="hide"} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) names(df_event) ``` 1. Using object `df_event`, assign new object `df_event_temp` and a numeric variable create `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" 2. This time use the `.default` argument to assign the value `5` for "public hs" ### Exercise using `recode()` within `mutate()` solutions Check input variable ```{r, results="hide"} names(df_event) str(df_event$event_type) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 1. Using object `df_event`, assign new object `df_event_temp` and create a numeric variable `event_typev2` based on `event_type` with these categories: - `1` for "2yr college"; `2` for "4yr college"; `3` for "other"; `4` for "private hs"; `5` for "public hs" ```{r results="hide"} df_event_temp <- df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, "public hs" = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ### Exercise using `recode()` within `mutate()` solutions 2. This time assign the value use the `.default` argument to assign the value `5` for "public hs" ```{r, results="hide"} df_event_temp <- df_event %>% select(event_type) %>% mutate(event_typev2 = recode(event_type, "2yr college" = 1, "4yr college" = 2, "other" = 3, "private hs" = 4, .default = 5) ) str(df_event_temp$event_typev2) df_event_temp %>% count(event_typev2) df_event %>% count(event_type) ``` ## Using case_when() function within mutate() ### Using `case_when()` function within `mutate()` ```{r, eval=FALSE, echo=FALSE} ?case_when ``` `case_when()` useful for creating variable that is a function of multiple "input" variables __Usage (i.e., syntax)__: `case_when(...)` __Arguments__ [from help file; see help file for more details] - `...`: A sequence of two-sided formulas. - The left hand side (LHS) determines which values match this case. - LHS must evaluate to a logical vector. - The right hand side (RHS) provides the replacement value. __Example task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" ```{r, results="hide"} wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Using `case_when()` function within `mutate()` __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var Let's take a closer look at how values of inputs are coded into values of outputs ```{r, results="hide"} wwlist %>% select(state,firstgen) %>% str() count(wwlist,state) count(wwlist,firstgen) ``` Create variable ```{r, results="hide"} wwlist_temp <- wwlist %>% select(state,firstgen) %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) ``` Compare values of input vars to value of output var ```{r, results="hide"} wwlist_temp %>% count(state_gen) wwlist_temp %>% filter(is.na(state)) %>% count(state_gen) wwlist_temp %>% filter(is.na(firstgen)) %>% count(state_gen) wwlist_temp %>% filter(is.na(firstgen) | is.na(state)) %>% count(state_gen) ``` __Take-away__: by default var created by `case_when()` equals `NA` for obs where one of the inputs equals `NA` ### Student exercise using `case_when()` within `mutate()` 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" - Note: We are referring to CA as in-state for this example ### Exercise using `case_when()` within `mutate()` solution Investigate ```{r, results="hide"} school_v2 %>% select(state_code,school_type) %>% str() count(school_v2,state_code) school_v2 %>% filter(is.na(state_code)) %>% count() count(school_v2,school_type) school_v2 %>% filter(is.na(school_type)) %>% count() ``` ### Exercise using `case_when()` within `mutate()` solution 1. Using the object `school_v2` and input vars `school_type`, and `state_code` , create a 4-category var `state_type` with following categories: - "instate_public"; "instate_private"; "outstate_public"; "outstate_private" ```{r} school_v2_temp <- school_v2 %>% select(state_code,school_type) %>% mutate(state_type = case_when( state_code == "CA" & school_type == "public" ~ "instate_public", state_code == "CA" & school_type == "private" ~ "instate_private", state_code != "CA" & school_type == "public" ~ "outstate_public", state_code != "CA" & school_type == "private" ~ "outstate_private") ) school_v2_temp %>% count(state_type) #school_v2_temp %>% filter(is.na(state_code)) %>% count(state_type) #no missing #school_v2_temp %>% filter(is.na(school_type)) %>% count(state_type) #no missing ```