--- title: "Lecture 5: Processing across rows" # potentially push to header subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor #colortheme: "dolphin" #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE} #THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamder_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE download.file(url = 'https://raw.githubusercontent.com/ozanj/rclass/master/lectures/beamer_header.tex', destfile = '../beamer_header.tex', mode = 'wb') ``` # Introduction ### Logistics __Required reading for next week:__ - Grolemund and Wickham 5.6 - 5.7 (grouped summaries and mutates) - Xie, Allaire, and Grolemund 4.1 (R Markdown, ioslides presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html) and 4.3 (R Markdown, Beamer presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/beamer-presentation.html) - Why? Lectures for this class are `beamer_presentation` output type. - `ioslides_presentation` are the most basic presentation output format for RMarkdown, so learning about `ioslides` will help you understand `beamer` - Any slides from lecture we don't cover ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk__ ```{r, message=FALSE} library(tidyverse) ``` If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Data we will use today Data on off-campus recruiting events by public universities - Object `df_event` - One observation per university, recruiting event ```{r} rm(list = ls()) # remove all objects #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) #load("../../data/recruiting/recruit_event_allvars.Rdata") ``` ### Processing across observations, introduction Creation of analysis datasets often requires calculations across obs Examples: - You have a dataset with one observation per student-term and want to create a variable of credits attempted per term - You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters - Number of off-campus recruiting events university makes to each state - Average household income at visited versus non-visited high schools __Note__ - in today's lecture, I'll use the terms "observations" and "rows" interchangeably ### Processing across variables vs. processing across observations Visits by UC Berkeley to public high schools ```{r, echo=FALSE} #df_event %>% count(event_type) df_event %>% filter(event_type == "public hs", univ_id == 110635) %>% mutate(pct_fr_lunch = fr_lunch/total_students_pub) %>% rename(tot_stu_pub = total_students_pub, state= event_state) %>% select(school_id, state, tot_stu_pub, fr_lunch, pct_fr_lunch, med_inc) %>% slice(1:5) ``` - So far, we have focused on ``processing across variables'' - Performing calculations across columns (i.e., vars), typically within a row (i.e., observation) - Example: percent free-reduced lunch (above) - Processing across obs (focus of today's lecture) - Performing calculations across rows (i.e., obs), often within a column (i.e., variable) - Example: Average household income of visited high schools, by state # Introduce group_by() and summarise() ### Strategy for teaching processing across obs In `tidyverse` the `group_by()` and `summarise()` functions are the primary means of performing calculations across observations - Usually, processing across observations requires using `group_by()` and `summarise()` together - `group_by()` and `summarise()` usually aren't very useful by themselves (like peanut butter and jelly) How we'll teach: - introduce `group_by()` and `summarise()` separately - goal: you understand what each function does - then we'll combine them ## group_by ### group_by() `group_by()` converts a data frame object into groups. After grouping, functions performed on data frame are performed "by group" - part of __dplyr__ package within __tidyverse__; not part of __Base R__ - works best with pipes `%>%` and `summarise()` function [described below] Basic syntax: - `group_by(object, vars to group by separated by commas)` Typically, "group_by" variables are character, factor, or integer variables - Possible "group by" variables in `df_event` data: - university name/id; event type (e.g., public HS, private HS); state __Example__: in `df_event`, create frequency count of `event_type` ```{r, results="hide"} names(df_event) #without group_by() df_event %>% count(event_type) df_event %>% count(instnm) #group_by() university df_event %>% group_by(instnm) %>% count(event_type) ``` ### `group_by()` By itself `group_by()` doesn't do much; it just prints data - Below, group `df_event` data by university, event type, and event state ```{r, results="hide"} #without pipes group_by(df_event, univ_id, event_type, event_state) #with pipes df_event %>% group_by(univ_id, event_type, event_state) ``` But once an object is grouped, all subsequent functions are run separately "by group" ```{r, results="hide"} df_event %>% count() df_event %>% group_by(univ_id) %>% count() df_event %>% group_by(univ_id) %>% count() %>% str() df_event %>% group_by(univ_id, event_type) %>% count() df_event %>% group_by(univ_id, event_type) %>% count() %>% str() df_event %>% group_by(univ_id, event_type, event_state) %>% count() ``` ### Grouping not retained unless you __assign__ it Below, we'll use `class()` function to show whether data frame is grouped - will talk more about `class()` next week, but for now, just think of it as a function that provides information about an object - similar to `typeof()`, but `class()` provides different info about object Grouping is not retained unless you __assign__ it ```{r} class(df_event) df_event_grp <- df_event %>% group_by(univ_id, event_type, event_state) # using pipes class(df_event_grp) ``` Use `ungroup(object)` to un-group grouped data ```{r} df_event_grp <- ungroup(df_event_grp) class(df_event_grp) rm(df_event_grp) ``` ### `group_by()` student exercise 1. Group by "instnm" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? 1. Now group by "instnm" **and** "event_type" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? 1. **Bonus:** In the same code chunk, group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. ### `group_by()` student exercise solutions 1. Group by "instnm" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? ```{r} df_event %>% group_by(instnm) %>% count() ``` ### `group_by()` student exercise solutions 2. Now group by "instnm" **and** "event_type" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? ```{r} df_event %>% group_by(instnm, event_type) %>% count() ``` ### `group_by()` student exercise solutions 3. **Bonus:** Group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. ```{r} df_event %>% group_by(instnm, event_type) %>% filter(med_inc > 75000) %>% count() ``` ## summarise() ### `summarise()` function `summarise()` does calculations across rows; then collapses into single row ```{r, eval=FALSE, echo=FALSE} ?summarise ``` __Usage (i.e., syntax)__: `summarise(.data, ...)` __Arguments__ - `.data`: a data frame; omit if using `summarise()` after pipe `%>%` - `...`: Name-value pairs of summary functions. - The name will be the name of the variable in the result. - Value should be expression that returns a single value like `min(x)`, `n()` __Value__ (what `summarise()` returns/creates) - Object of same class as `.data.`; object will have one obs per "by group" __Useful functions (i.e., "helper functions")__ - Standalone functions called *within* `summarise()`, e.g., `mean()`, `n()` - Count function `n()` takes no arguments; returns number of rows in group __Example__: Count total number of events ```{r, results="hide"} summarise(df_event, num_events=n()) # without pipes sum_object <- df_event %>% summarise(num_events=n()) # using pipes df_event %>% summarise(num_events=n()) # using pipes ``` ### Investigate objects created by `summarise()` __Example__: Count total number of events ```{r, results="hide"} df_event %>% summarise(num_events=n()) df_event %>% summarise(num_events=n()) %>% str() ``` __Example__: What is max value of `med_inc` across all events ```{r, results="hide"} df_event %>% summarise(max_inc=max(med_inc, na.rm = TRUE)) df_event %>% summarise(max_inc=max(med_inc, na.rm = TRUE)) %>% str() ``` __Example__: Count total number of events AND max value of median income ```{r, results="hide"} df_event %>% summarise(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) df_event %>% summarise(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) %>% str() ``` __Takeaway__ - by default, objects created by `summarise()` are data frames that contain variables created within `summarise()` and one observation [per "by group"] ### Retaining objects created by `summarise()` Object created by summarise() not retained unless you __assign__ it ```{r} event_temp <- df_event %>% summarise(num_events=n(), mean_inc=mean(med_inc, na.rm = TRUE)) event_temp rm(event_temp) ``` ### `summarise()` student exercise 1. What is the min value of `med_inc` across all events? - Hint: Use min() 1. What is the mean value of `fr_lunch` across all events? - Hint: Use mean() ### `summarise()` student exercise 1. What is min value of `med_inc` across all events? ```{r} df_event %>% summarise(min_med_income = min(med_inc, na.rm = TRUE)) ``` ### `summarise()` student exercise 2. What is the mean value of `fr_lunch` across all events? - Hint: Use mean() ```{r} df_event %>% summarise(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE)) ``` # Combining group_by() and summarise() ### Combining `summarise()` and `group_by` `summarise()` on ungrouped vs. grouped data: - By itself, `summarise()` performs calculations across all rows of data frame then collapses the data frame to a single row - When data frame is grouped, `summarise()` performs calculations across rows within a group and then collapses to a single row for each group __Example__: Count the number of events for each university ```{r, results="hide"} df_event %>% summarise(num_events=n()) df_event %>% group_by(instnm) %>% summarise(num_events=n()) ``` - Investigate the object created above ```{r, results="hide"} df_event %>% group_by(instnm) %>% summarise(num_events=n()) %>% str() ``` - Or we could retain object for later use ```{r, results="hide"} event_by_univ <- df_event %>% group_by(instnm) %>% summarise(num_events=n()) str(event_by_univ) event_by_univ # print rm(event_by_univ) ``` ### Combining `summarise()` and `group_by` __Task__ - Count number of recruiting events by event_type for each university ```{r, results="hide"} df_event %>% group_by(instnm, event_type) %>% summarise(num_events=n()) df_event %>% group_by(instnm, event_state, event_type) %>% summarise(num_events=n()) #investigate object created df_event %>% group_by(instnm, event_type) %>% summarise(num_events=n()) %>% str() ``` __Task__ - By university and event type, count the number of events and calculate the avg. pct white in the zip-code ```{r, results="hide"} df_event %>% group_by(instnm, event_type) %>% summarise(num_events=n(), mean_pct_white=mean(pct_white_zip, na.rm = TRUE) ) #investigate object you created df_event %>% group_by(instnm, event_type) %>% summarise(num_events=n(), mean_pct_white=mean(pct_white_zip, na.rm = TRUE) ) %>% str() ``` ### Combining `summarise()` and `group_by` Recruiting events by UC Berkeley ```{r, results="hide"} df_event %>% filter(univ_id == 110635) %>% group_by(event_type) %>% summarise(num_events=n()) ``` Let's create a dataset of recruiting events at UC Berkeley ```{r, results="hide"} event_berk <- df_event %>% filter(univ_id == 110635) event_berk %>% count(event_type) ``` The "char" variable `event_inst` equals "In-State" if event is in same state as the university ```{r} event_berk %>% arrange(event_date) %>% select(pid, event_date, event_type, event_state, event_inst) %>% slice(1:8) ``` ## summarise() and Counts (and logical vectors) ### `summarise()`: Counts The count function `n()` takes no arguments and returns the size of the current group ```{r, results="hide"} event_berk %>% group_by(event_type, event_inst) %>% summarise(num_events=n()) ``` Object not retained unless we __assign__ ```{r, results="hide"} berk_temp <- event_berk %>% group_by(event_type, event_inst) %>% summarise(num_events=n()) berk_temp typeof(berk_temp) str(berk_temp) ``` Because counts are so important, `dplyr` package includes separate `count()` function that can be called outside `summarise()` function ```{r, results="hide"} event_berk %>% group_by(event_type, event_inst) %>% count() berk_temp2 <- event_berk %>% group_by(event_type, event_inst) %>% count() berk_temp == berk_temp2 # TAKEAWAY: these two objects are identical! rm(berk_temp,berk_temp2) ``` ### `summarise()`: count with logical vectors and `sum()` Logical vectors have values `TRUE` and `FALSE`. - When used with numeric functions, `TRUE` converted to 1 and `FALSE` to 0. `sum()` is a numeric function that returns the sum of values ```{r, results="hide"} sum(c(5,10)) sum(c(TRUE,TRUE,FALSE,FALSE)) ``` `is.na()` returns `TRUE` if value is `NA` and otherwise returns `FALSE` ```{r} is.na(c(5,NA,4,NA)) sum(is.na(c(5,NA,4,NA,5))) sum(!is.na(c(5,NA,4,NA,5))) ``` Application: How many missing/non-missing obs in variable [__very important__] ```{r, results="hide"} event_berk %>% group_by(event_type) %>% summarise( n_events = n(), n_miss_inc = sum(is.na(med_inc)), n_nonmiss_inc = sum(!is.na(med_inc)), n_nonmiss_fr_lunch = sum(!is.na(fr_lunch)) ) ``` ### `summarise()` and count student exercise Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names. 1. Using the `event_berk` object, filter observations where `event_state` is VA and group by `event_type`. 1. Using the summarise function to create a variable that represents the count for each `event_type`. 1. Create a variable that represents the sum of missing obs for `med_inc`. 1. create a variable that represents the sum of non-missing obs for `med_inc`. 1. **Bonus**: Arrange variable you created representing the count of each `event_type` in descending order. ### `summarise()` and count student exercise SOLUTION 1. Using the `event_berk` object filter observations where `event_state` is VA and group by `event_type`. 1. Using the summarise function, create a variable that represents the count for each `event_type`. 1. Now get the sum of missing obs for `med_inc`. 1. Now get the sum of non-missing obs for `med_inc`. ```{r} event_berk %>% filter(event_state == "VA") %>% group_by(event_type) %>% summarise( n_events = n(), n_miss_inc = sum(is.na(med_inc)), n_nonmiss_inc = sum(!is.na(med_inc))) %>% arrange(desc(n_events)) ``` ## summarise() and means ### `summarise()`: means The `mean()` function within `summarise()` calculates means, separately for each group ```{r} event_berk %>% group_by(event_inst, event_type) %>% summarise( n_events=n(), mean_inc=mean(med_inc, na.rm = TRUE), mean_pct_white=mean(pct_white_zip, na.rm = TRUE)) ``` ### `summarise()`: means and `na.rm` argument Default behavior of "aggregation functions" (e.g., `summarise()`) - if _input_ has any missing values (`NA`), than output will be missing. Many functions have argument `na.rm` (means "remove `NAs`") - `na.rm = FALSE` [the default for `mean()`] - Do not remove missing values from input before calculating - Therefore, missing values in input will cause output to be missing - `na.rm = TRUE` - Remove missing values from input before calculating - Therefore, missing values in input will not cause output to be missing ```{r, results="hide"} #na.rm = FALSE; the default setting event_berk %>% group_by(event_inst, event_type) %>% summarise( n_events=n(), n_miss_inc = sum(is.na(med_inc)), mean_inc=mean(med_inc, na.rm = FALSE), n_miss_frlunch = sum(is.na(fr_lunch)), mean_fr_lunch=mean(fr_lunch, na.rm = FALSE)) #na.rm = TRUE event_berk %>% group_by(event_inst, event_type) %>% summarise( n_events=n(), n_miss_inc = sum(is.na(med_inc)), mean_inc=mean(med_inc, na.rm = TRUE), n_miss_frlunch = sum(is.na(fr_lunch)), mean_fr_lunch=mean(fr_lunch, na.rm = TRUE)) ``` ### Student exercise 1. Using the `event_berk` object, group by `instnm`, `event_inst`, & `event_type`. 1. Create vars for number non_missing for these racial/ethnic groups (`pct_white_zip`, `pct_black_zip`, `pct_asian_zip`, `pct_hispanic_zip`, `pct_amerindian_zip`, `pct_nativehawaii_zip`) 1. Create vars for mean percent for each racial/ethnic group ### Student exercise solutions ```{r} event_berk %>% group_by(instnm, event_inst, event_type) %>% summarise( n_events=n(), n_miss_white = sum(!is.na(pct_white_zip)), mean_white = mean(pct_white_zip, na.rm = TRUE), n_miss_black = sum(!is.na(pct_black_zip)), mean_black = mean(pct_black_zip, na.rm = TRUE), n_miss_asian = sum(!is.na(pct_asian_zip)), mean_asian = mean(pct_asian_zip, na.rm = TRUE), n_miss_lat = sum(!is.na(pct_hispanic_zip)), mean_lat = mean(pct_hispanic_zip, na.rm = TRUE), n_miss_na = sum(!is.na(pct_amerindian_zip)), mean_na = mean(pct_amerindian_zip, na.rm = TRUE), n_miss_nh = sum(!is.na(pct_nativehawaii_zip)), mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>% head(6) ``` ## summarise() and logical vectors, part II ### `summarise()`: counts with logical vectors, part II Logical vectors (e.g., `is.na()`) useful for counting obs that satisfy some condition ```{r} is.na(c(5,NA,4,NA)) typeof(is.na(c(5,NA,4,NA))) sum(is.na(c(5,NA,4,NA))) ``` __Task__: Using object `event_berk`, create object `gt50p_lat_bl` with the following measures for each combination of `event_type` and `event_inst`: - count of number of rows for each group - count of rows non-missing for both `pct_black_zip` and `pct_hispanic_zip` - count of number of visits to communities where the `sum` of Black and Latinx people comprise more than 50% of the total population ```{r, results="hide"} gt50p_lat_bl <- event_berk %>% group_by (event_inst, event_type) %>% summarise( n_events=n(), n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)), n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE) ) gt50p_lat_bl # print object str(gt50p_lat_bl) ``` ### `summarise()`: logical vectors to count _proportions_ Synatx: `group_by(vars) %>% summarise(prop = mean(TRUE/FALSE conditon))` __Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000? Steps: 1. Filter public HS visits 2. group by in-state vs. out-of-state 3. Create measure ```{r} event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits group_by (event_inst) %>% # group by in-state vs. out-of-state summarise( n_events=n(), # number of events by group n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc, p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities ``` ### `summarise()`: logical vectors to count _proportions_ __What if we forgot to put `na.rm=TRUE` in the above task?__ \medskip __Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000? ```{r} event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits group_by (event_inst) %>% # group by in-state vs. out-of-state summarise( n_events=n(), # number of events by group n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc, p_incgt100k = mean(med_inc>100000)) # proportion visits to $100K+ commmunities ``` ### `summarise()`: Other "helper" functions Lots of other functions we can use within `summarise()` \medskip Common functions to use with `summarise()`: | Function | Description | |----------|-------------| | `n` | count | | `n_distinct` | count unique values | | `mean` | mean | | `median` | median | | `max` | largest value | | `min` | smallest value | | `sd` | standard deviation | | `sum` | sum of values | | `first` | first value | | `last` | last value | | `nth` | nth value | | `any` | condition true for at least one value | *Note: These functions can also be used on their own or with `mutate()`* ### `summarise()`: Other functions Maximum value in a group ```{r} max(c(10,50,8)) ``` __Task__: For each combination of in-state/out-of-state and event type, what is the maximum value of `med_inc`? ```{r} event_berk %>% group_by(event_type, event_inst) %>% summarise(max_inc = max(med_inc)) event_berk %>% group_by(event_type, event_inst) %>% summarise(max_inc = max(med_inc, na.rm = TRUE)) ``` What did we do wrong here? ### `summarise()`: Other functions Isolate first/last/nth observation in a group ```{r, results="hide"} x <- c(10,15,20,25,30) first(x) last(x) nth(x,1) nth(x,3) nth(x,10) ``` __Task__: after sorting object `event_berk` by `event_type` and `event_datetime_start`, what is the value of `event_date` for: - first event for each event type? - the last event for each event type? - the 50th event for each event type? ```{r, results="hide"} event_berk %>% arrange(event_type, event_datetime_start) %>% group_by(event_type) %>% summarise( n_events = n(), date_first= first(event_date), date_last= last(event_date), date_50th= nth(event_date, 50) ) ``` ### Student exercise Identify value of `event_date` for the _nth_ event in each by group __Specific task__: - arrange (i.e., sort) by `event_type` and `event_datetme_start`, then group by `event_type`, and then identify the value of `event_date` for: - the first event in each by group (`event_type`) - the second event in each by group - the third event in each by group - the fourth event in each by group - the fifth event in each by group ### Student exercise solution ```{r} event_berk %>% arrange(event_type, event_datetime_start) %>% group_by(event_type) %>% summarise( n_events = n(), date_1st= first(event_date), date_2nd= nth(event_date,2), date_3rd= nth(event_date,3), date_4th= nth(event_date,4), date_5th= nth(event_date,5)) ``` ## Attach aggregate measures to your data frame ### Attach aggregate measures to your data frame We can attach aggregate measures to a data frame by using group_by without summarise() What do I mean by "attaching aggregate measures to a data frame"? - Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group __Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type - resulting object should have same number of observations as `event_berk` Steps: 1. create measure of avg. income across all events without using `group_by()` or `summarise()` and assign as (new) object 1. Using object from previous step, create measure of avg. income across by event type using `group_by()` without `summarise()` and assign as new object ### Attach aggregate measures to your data frame __Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type 1. Create measure of average income across all events ```{r, results="hide"} event_berk_temp <- event_berk %>% arrange(event_date) %>% # sort by event_date (optional) select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl) mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure dim(event_berk_temp) event_berk_temp %>% head(5) ``` 2. Create measure of average income by event type ```{r, results="hide"} event_berk_temp <- event_berk_temp %>% group_by(event_type) %>% # grouping by event type mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure str(event_berk_temp) event_berk_temp %>% head(5) ``` ### Attach aggregate measures to your data frame __Task__: Using `event_berk_temp` from previous question, create a measure that identifies whether `med_inc` associated with the event is higher/lower than average income for all events of that type Steps: 1. Create measure of average income for each event type [already done] 1. Create 0/1 indicator that identifies whether median income at event location is higher than average median income for events of that type ```{r, results="hide"} # average income at recruiting events across all universities event_berk_tempv2 <- event_berk_temp %>% mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>% select(-(avg_inc)) # drop avg_inc (optional) event_berk_tempv2 # note how med_ic = NA are treated ``` Same as above, but this time create integer indicator rather than logical ```{r, results="hide"} event_berk_tempv2 <- event_berk_tempv2 %>% mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type)) event_berk_tempv2 %>% head(4) ``` ### Student exercise Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`? - Note: all events attached to a particular zip_code - `pct_white_zip`: pct of people in that zip_code who identify as white Steps in task: - Create measure of average pct white for each event_type - Compare whether pct_white_zip is higher or lower than this average ### Student exercise solution Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`? ```{r} event_berk_tempv3 <- event_berk %>% arrange(event_date) %>% # sort by event_date (optional) select(event_date, event_type, pct_white_zip) %>% #optional group_by(event_type) %>% # grouping by event type mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE), gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white)) event_berk_tempv3 %>% head(4) ```