--- title: "Lecture 5: Processing across rows" # potentially push to header subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor #colortheme: "dolphin" #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE} #THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamder_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE download.file(url = 'https://raw.githubusercontent.com/ozanj/rclass/master/lectures/beamer_header.tex', destfile = '../beamer_header.tex', mode = 'wb') ``` # Introduction ### Logistics __Required reading for next week:__ - Grolemund and Wickham 5.6 - 5.7 (grouped summaries and mutates) - Xie, Allaire, and Grolemund 4.1 (R Markdown, ioslides presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html) and 4.3 (R Markdown, Beamer presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/beamer-presentation.html). Why?: \medskip - Lectures for this class are `beamer_presentation` output type. \medskip - `ioslides_presentation` are the most basic presentation output format for RMarkdown, so learning about `ioslides` will help you understand `beamer` ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk__ ```{r, message=FALSE} library(tidyverse) ``` If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Data we will use today Data on off-campus recruiting events by public universities - Object `df_event` - One observation per university, recruiting event ```{r} rm(list = ls()) # remove all objects #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) #load("../../data/recruiting/recruit_event_allvars.Rdata") ``` ### Processing across variables vs. processing across observations Visits by UC Berkeley to public high schools ```{r, echo=FALSE} #df_event %>% count(event_type) df_event %>% filter(event_type == "public hs", univ_id == 110635) %>% mutate(pct_fr_lunch = fr_lunch/total_students_pub) %>% rename(tot_stu_pub = total_students_pub, state= event_state) %>% select(school_id, state, tot_stu_pub, fr_lunch, pct_fr_lunch, med_inc) %>% slice(1:5) ``` \medskip So far, we have focused on ``processing across variables'' - Performing calculations across columns (i.e., vars), typically within a row (i.e., observation) - Example: percent free-reduced lunch (above) \medskip Processing across obs (focus of today's lecture) - Performing calculations across rows (i.e., obs), often within a column (i.e., variable) - Example: Average household income of visited high schools, by state ### Why processing across observations __Note__ - in today's lecture, I'll use the terms "observations" and "rows" interchangeably __Creation of analysis datasets often requires calculations across obs__ Examples: - You have a dataset with one observation per student-term and want to create a variable of credits attempted per term - You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters - Number of off-campus recruiting events university makes to each state - Average household income at visited versus non-visited high schools __Creating graphs and tables of descripive stats usually require calculations across obs__ Example: Want to create a graph that shows number of recruiting events by event "type" (e.g., public HS, private HS) for each university - Start with `df_event` dataset that has one obervation per university, recruiting event - Create new data frame object that has one observation per university and event type and has variable for number of events - this variable calculated by counting number of rows in each combination of university and event type - This new data frame object is the input for creating desired graph # Introduce group_by() and summarize() ### Strategy for teaching processing across obs In `tidyverse` the `group_by()` and `summarize()` functions are the primary means of performing calculations across observations - Usually, processing across observations requires using `group_by()` and `summarize()` together - `group_by()` typically not very useful by itself - `summarize()` [with or without `group_by()`] can be helpful for creating summary statistics that are the inputs for tables or graphs you create How we'll teach: - introduce `group_by()` and `summarize()` separately - goal: you understand what each function does - then we'll combine them ## group_by ### group_by() ```{r, eval=FALSE, include=FALSE} ?group_by ``` __Description__: "`group_by()` takes an existing data frame and converts it into a grouped data frame where operations are performed "by group". `ungroup()` removes grouping." - part of __dplyr__ package within __tidyverse__; not part of __Base R__ - works best with pipes `%>%` and `summarize()` function [described below] __Basic syntax__: `group_by(.data, ...)` - `.data` argument refers to name of data frame - `...` argument refers to names of "group_by" variables, separated by commas - Can "group by" one or many variables - Typically, "group_by" variables are character, factor, or integer variables \medskip Possible "group by" variables in `df_event` data: - university name/id; event type (e.g., public HS, private HS); state __Example__: in `df_event`, create frequency count of `event_type` [output omitted] ```{r, results="hide"} names(df_event) #without group_by() df_event %>% count(event_type) df_event %>% count(instnm) #group_by() university df_event %>% group_by(instnm) %>% count(event_type) ``` ### `group_by()` By itself `group_by()` doesn't do much; it just prints data - Below, group `df_event` data by university, event type, and event state ```{r, results="hide"} #print object df_event #group_by (without pipes) group_by(df_event, univ_id, event_type, event_state) #group_by (with pipes) df_event %>% group_by(univ_id, event_type, event_state) df_event %>% group_by(univ_id, event_type, event_state) %>% glimpse() ``` But once an object is grouped, all subsequent functions are run separately "by group" - recall that `count()` counts number of observations by group ```{r, results="hide"} # count number of observations in group, ungrouped data df_event %>% count() #group by and then count obs df_event %>% group_by(univ_id) %>% count() df_event %>% group_by(univ_id) %>% count() %>% glimpse() df_event %>% group_by(univ_id, event_type) %>% count() df_event %>% group_by(univ_id, event_type) %>% count() %>% glimpse() df_event %>% group_by(univ_id, event_type, event_state) %>% count() #df_event %>% group_by(as_factor(univ_id), as_factor(event_type), as_factor(event_state), .drop = FALSE) %>% count() df_event %>% group_by(univ_id, event_type, event_state) %>% count() %>% glimpse() ``` ### Grouping not retained unless you __assign__ it Below, we'll use `class()` function to show whether data frame is grouped - will talk more about `class()` next week, but for now, just think of it as a function that provides information about an object - similar to `typeof()`, but `class()` provides different info about object \medskip Grouping is not retained unless you __assign__ it ```{r} class(df_event) df_event %>% group_by(univ_id, event_type, event_state) df_event_grp <- df_event %>% group_by(univ_id, event_type, event_state) # using pipes class(df_event_grp) ``` Use `ungroup(object)` to un-group grouped data ```{r} class(df_event_grp) df_event_grp <- ungroup(df_event_grp) class(df_event_grp) rm(df_event_grp) ``` ### `group_by()` student exercise 1. Group by "instnm" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? 1. Now group by "instnm" **and** "event_type" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? 1. **Bonus:** In the same code chunk, group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. ### `group_by()` student exercise solutions 1. Group by "instnm" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? ```{r} df_event %>% group_by(instnm) %>% count() ``` ### `group_by()` student exercise solutions 2. Now group by "instnm" **and** "event_type" and get a frequency count. - How many rows and columns do you have? What do the number of rows mean? ```{r} df_event %>% group_by(instnm, event_type) %>% count() ``` ### `group_by()` student exercise solutions 3. **Bonus:** Group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. ```{r} df_event %>% group_by(instnm, event_type) %>% filter(med_inc > 75000) %>% count() ``` ## summarize() ### `summarize()` function ```{r, eval=FALSE, echo=FALSE} ?summarize ``` __Description__: `summarize()` calculates across rows; then collapses into single row - `summarize()` create scalar vars summarizing variables of existing data frame - if you first group data frame using `group_by()`, `summarize()` creates summary vars separately for each group, returning object with one row per group - if data frame not grouped, `summarize()` will result in one row. __Syntax__: `summarize(.data, ...)` - `.data`: a data frame; omit if using `summarize()` after pipe `%>%` - `...`: Name-value pairs of summary functions separated by commas - "name" will be the name of new variable you will create - "value" should be expression that returns a single value like `min(x)`, `n()` - variable names do not need to be placed within quotes __Value__ (what `summarize()` returns/creates) - Object of same class as `.data.`; object will have one obs per "by group" __Useful functions (i.e., "helper functions")__ - Standalone functions called *within* `summarize()`, e.g., `mean()`, `n()` - e.g., count function `n()` takes no arguments; returns number of rows in group __Example__: Count total number of events (output omitted) ```{r, results="hide"} summarize(df_event, num_events=n()) # without pipes df_event %>% summarize(num_events=n()) # using pipes ``` ### Investigate objects created by `summarize()` __Example__: Count total number of events ```{r, results="hide"} df_event %>% summarize(num_events=n()) df_event %>% summarize(num_events=n()) %>% str() ``` __Example__: What is max value of `med_inc` across all events ```{r, results="hide"} df_event %>% summarize(max_inc=max(med_inc, na.rm = TRUE)) df_event %>% summarize(max_inc=max(med_inc, na.rm = TRUE)) %>% str() ``` __Example__: Count total number of events AND max value of median income ```{r, results="hide"} df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) %>% str() #keep object created by summarize df_event_temp <- df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = TRUE)) df_event_temp rm(df_event_temp) #what if we forgot na.rm = TRUE df_event %>% summarize(num_events=n(), max_inc=max(med_inc, na.rm = FALSE)) ``` __Takeaways__ - by default, objects created by `summarize()` are data frames that contain variables created within `summarize()` and one observation [per "by group"] - most "helper" functions (e.g., `max()`, `mean()` have option `na.rm` to keep/remove missing obs before performing calculations) - `na.rm = FALSE` (default); don't remove `NAs` prior to calculation - if any obs missing, then result of calculation is `NA` - `na.rm = TRUE` (default); remove `NAs` prior to calculation ### Retaining objects created by `summarize()` Object created by summarize() not retained unless you __assign__ it ```{r} event_temp <- df_event %>% summarize(num_events=n(), mean_inc=mean(med_inc, na.rm = TRUE)) event_temp rm(event_temp) ``` ### Using [] operator to filter observations within summarize Imagine we want to calculate avg. income, separately for in-state vs. out-of-state visits - first, let's use `filter()` to make sure we can identify in-state vs. out-of-state ```{r} #in state df_event %>% filter(event_state == instst) %>% count() %>% as_vector() #out state df_event %>% filter(event_state != instst) %>% count() %>% as_vector() ``` - calculate mean income for: all events; in-state events; out-of-state events ```{r} df_event %>% summarize(avg_inc = mean(med_inc, na.rm = TRUE), # all events avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), # in-state avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE) # out-state ) ``` ### Using `summarize()` to create descriptive statistics table Often helpful to use `summarize()` to calculate summary statistics that are the basis for a table of descriptive statistics __Task__: create a table of descriptive statistics about variable `med_inc` - want these measures: number of non-missing obs; mean; standard deviation ```{r, results="hide"} df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>% summarize( n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE all avg_inc = mean(med_inc, na.rm = TRUE), # MEAN std_inc = sd(med_inc, na.rm = TRUE), # STANDARD DEVIATION all events ) ``` __Task__: same as above but separate measures for: all events; in-state; out-of-state ```{r, results="hide"} df_event %>% mutate(non_miss_inc = is.na(med_inc)==0) %>% summarize( n = sum(non_miss_inc, na.rm = TRUE), #SAMPLE SIZE n_inst = sum(non_miss_inc[event_state == instst], na.rm = TRUE), n_outst = sum(non_miss_inc[event_state != instst], na.rm = TRUE), avg_inc = mean(med_inc, na.rm = TRUE), # MEAN avg_inc_inst = mean(med_inc[event_state == instst], na.rm = TRUE), avg_inc_outst = mean(med_inc[event_state != instst], na.rm = TRUE), std_inc = sd(med_inc, na.rm = TRUE), # STANDARD DEVIATION std_inc_inst = sd(med_inc[event_state == instst], na.rm = TRUE), std_inc_outst = sd(med_inc[event_state != instst], na.rm = TRUE) ) ``` ### `summarize()` student exercise 1. What is the min value of `med_inc` across all events? - Hint: Use min() 1. What is the mean value of `fr_lunch` across all events? - Hint: Use mean() ### `summarize()` student exercise 1. What is min value of `med_inc` across all events? ```{r} df_event %>% summarize(min_med_income = min(med_inc, na.rm = TRUE)) ``` ### `summarize()` student exercise 2. What is the mean value of `fr_lunch` across all events? - Hint: Use mean() ```{r} df_event %>% summarize(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE)) ``` # Combining group_by() and summarize() ### Combining `summarize()` and `group_by` `summarize()` on ungrouped vs. grouped data: - By itself, `summarize()` performs calculations across all rows of data frame then collapses the data frame to a single row - When data frame is grouped, `summarize()` performs calculations across rows within a group and then collapses to a single row for each group __Example__: Count the number of events for each university - remember: `df_event` has one observation per university, recruiting event ```{r, results="hide"} df_event %>% summarize(num_events=n()) df_event %>% group_by(instnm) %>% summarize(num_events=n()) ``` - Investigate the object created above ```{r, results="hide"} df_event %>% group_by(instnm) %>% summarize(num_events=n()) %>% str() ``` - Or we could retain object for later use ```{r, results="hide"} event_by_univ <- df_event %>% group_by(instnm) %>% summarize(num_events=n()) str(event_by_univ) event_by_univ # print rm(event_by_univ) ``` ### Combining `summarize()` and `group_by` __Task__ - Count number of recruiting events by institution and event_type ```{r, results="hide"} df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) #investigate object created df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% glimpse() ``` Note that data frame object created by `group_by()` and `summarize()` can be input to graph ```{r, eval=FALSE} #bar chart of number of events, all universities combined df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% ggplot(aes(x=event_type, y=num_events)) + # plot ylab("Number of events") + xlab("Event type") +geom_col() #bar chart of number of events, separete chart for each university df_event %>% group_by(instnm, event_type) %>% summarize(num_events=n()) %>% ggplot(aes(x=event_type, y=num_events)) + # plot ylab("Number of events") + xlab("Event type") + geom_col() + coord_flip() + facet_wrap(~ instnm) ``` ### Combining `summarize()` and `group_by` __Task__. Count number of recruiting events by institution, event_type, and whether event is in- or out-of-state (var=`event_inst`) - Note: in `group_by()`, the optional `drop` argument controls whether empty groups dropped. default is `drop = TRUE` ```{r, results="hide"} df_event %>% group_by(instnm, event_type, event_inst) %>% summarize(num_events=n()) df_event %>% group_by(instnm, event_type, event_inst, .drop = TRUE) %>% summarize(num_events=n()) df_event %>% group_by(as.factor(instnm), as.factor(event_type), as.factor(event_inst), .drop = FALSE) %>% summarize(num_events=n()) %>% arrange(num_events) # .drop=FALSE affects only grouping columns that are coded as factors # combinations that include non-factor grouping variables are still # silently dropped even with .drop=FALSE. ``` Make a graph, showing in/out state as fill color of bar ```{r, eval=FALSE} df_event %>% group_by(instnm, event_type, event_inst) %>% summarize(num_events=n()) %>% ggplot(aes(x=event_type, y=num_events, fill = event_inst)) + # plot ylab("Number of events") + xlab("Event type") + geom_col() + coord_flip() + facet_wrap(~ instnm) ``` ### Combining `summarize()` and `group_by` __Task__ - By university, event type, event_inst count the number of events and calculate the avg. pct white in the zip-code ```{r, results="hide"} df_event %>% group_by(instnm, event_type, event_inst) %>% summarize(num_events=n(), mean_pct_white=mean(pct_white_zip, na.rm = TRUE) ) #investigate object you created df_event %>% group_by(instnm, event_type, event_inst) %>% summarize(num_events=n(), mean_pct_white=mean(pct_white_zip, na.rm = FALSE) ) %>% glimpse() ``` ### Combining `summarize()` and `group_by` Recruiting events by UC Berkeley ```{r, results="hide"} df_event %>% filter(univ_id == 110635) %>% group_by(event_type) %>% summarize(num_events=n()) ``` Let's create a dataset of recruiting events at UC Berkeley ```{r, results="hide"} event_berk <- df_event %>% filter(univ_id == 110635) event_berk %>% count(event_type) ``` ## summarize() and Counts ### `summarize()`: Counts The count function `n()` takes no arguments and returns the size of the current group ```{r, results="hide"} event_berk %>% group_by(event_type, event_inst) %>% summarize(num_events=n()) ``` Because counts are so important, `dplyr` package includes separate `count()` function that can be called outside `summarize()` function ```{r, results="hide"} event_berk %>% group_by(event_type, event_inst) %>% count() ? ``` ### `summarize()`: count with logical vectors and `sum()` Logical vectors have values `TRUE` and `FALSE`. - When used with numeric functions, `TRUE` converted to 1 and `FALSE` to 0. `sum()` is a numeric function that returns the sum of values ```{r, results="hide"} sum(c(5,10)) sum(c(TRUE,TRUE,FALSE,FALSE)) ``` `is.na()` returns `TRUE` if value is `NA` and otherwise returns `FALSE` ```{r} is.na(c(5,NA,4,NA)) sum(is.na(c(5,NA,4,NA,5))) sum(!is.na(c(5,NA,4,NA,5))) ``` Application: How many missing/non-missing obs in variable [__very important__] ```{r, results="hide"} event_berk %>% group_by(event_type) %>% summarize( n_events = n(), n_miss_inc = sum(is.na(med_inc)), n_nonmiss_inc = sum(!is.na(med_inc)), n_nonmiss_fr_lunch = sum(!is.na(fr_lunch)) ) ``` ### `summarize()` and count student exercise Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names. 1. Using the `event_berk` object, filter observations where `event_state` is VA and group by `event_type`. 1. Using the summarize function to create a variable that represents the count for each `event_type`. 1. Create a variable that represents the sum of missing obs for `med_inc`. 1. create a variable that represents the sum of non-missing obs for `med_inc`. 1. **Bonus**: Arrange variable you created representing the count of each `event_type` in descending order. ### `summarize()` and count student exercise SOLUTION 1. Using the `event_berk` object filter observations where `event_state` is VA and group by `event_type`. 1. Using the summarize function, create a variable that represents the count for each `event_type`. 1. Now get the sum of missing obs for `med_inc`. 1. Now get the sum of non-missing obs for `med_inc`. ```{r} event_berk %>% filter(event_state == "VA") %>% group_by(event_type) %>% summarize( n_events = n(), n_miss_inc = sum(is.na(med_inc)), n_nonmiss_inc = sum(!is.na(med_inc))) %>% arrange(desc(n_events)) ``` ## summarize() and means ### `summarize()`: means The `mean()` function within `summarize()` calculates means, separately for each group ```{r} event_berk %>% group_by(event_inst, event_type) %>% summarize( n_events=n(), mean_inc=mean(med_inc, na.rm = TRUE), mean_pct_white=mean(pct_white_zip, na.rm = TRUE)) ``` ### `summarize()`: means and `na.rm` argument Default behavior of "aggregation functions" (e.g., `summarize()`) - if _input_ has any missing values (`NA`), than output will be missing. Many functions have argument `na.rm` (means "remove `NAs`") - `na.rm = FALSE` [the default for `mean()`] - Do not remove missing values from input before calculating - Therefore, missing values in input will cause output to be missing - `na.rm = TRUE` - Remove missing values from input before calculating - Therefore, missing values in input will not cause output to be missing ```{r, results="hide"} #na.rm = FALSE; the default setting event_berk %>% group_by(event_inst, event_type) %>% summarize( n_events=n(), n_miss_inc = sum(is.na(med_inc)), mean_inc=mean(med_inc, na.rm = FALSE), n_miss_frlunch = sum(is.na(fr_lunch)), mean_fr_lunch=mean(fr_lunch, na.rm = FALSE)) #na.rm = TRUE event_berk %>% group_by(event_inst, event_type) %>% summarize( n_events=n(), n_miss_inc = sum(is.na(med_inc)), mean_inc=mean(med_inc, na.rm = TRUE), n_miss_frlunch = sum(is.na(fr_lunch)), mean_fr_lunch=mean(fr_lunch, na.rm = TRUE)) ``` ### Student exercise 1. Using the `event_berk` object, group by `instnm`, `event_inst`, & `event_type`. 1. Create vars for number non_missing for these racial/ethnic groups (`pct_white_zip`, `pct_black_zip`, `pct_asian_zip`, `pct_hispanic_zip`, `pct_amerindian_zip`, `pct_nativehawaii_zip`) 1. Create vars for mean percent for each racial/ethnic group ### Student exercise solutions ```{r} event_berk %>% group_by(instnm, event_inst, event_type) %>% summarize( n_events=n(), n_miss_white = sum(!is.na(pct_white_zip)), mean_white = mean(pct_white_zip, na.rm = TRUE), n_miss_black = sum(!is.na(pct_black_zip)), mean_black = mean(pct_black_zip, na.rm = TRUE), n_miss_asian = sum(!is.na(pct_asian_zip)), mean_asian = mean(pct_asian_zip, na.rm = TRUE), n_miss_lat = sum(!is.na(pct_hispanic_zip)), mean_lat = mean(pct_hispanic_zip, na.rm = TRUE), n_miss_na = sum(!is.na(pct_amerindian_zip)), mean_na = mean(pct_amerindian_zip, na.rm = TRUE), n_miss_nh = sum(!is.na(pct_nativehawaii_zip)), mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>% head(6) ``` ## summarize() and logical vectors, part II ### `summarize()`: counts with logical vectors, part II Logical vectors (e.g., `is.na()`) useful for counting obs that satisfy some condition ```{r} is.na(c(5,NA,4,NA)) typeof(is.na(c(5,NA,4,NA))) sum(is.na(c(5,NA,4,NA))) ``` __Task__: Using object `event_berk`, calculate the following measures for each combination of `event_type` and `event_inst`: - count of number of rows for each group - count of rows non-missing for both `pct_black_zip` and `pct_hispanic_zip` - count of number of visits to communities where the `sum` of Black and Latinx people comprise more than 50% of the total population ```{r, results="hide"} event_berk %>% group_by (event_inst, event_type) %>% summarize( n_events=n(), n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)), n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE) ) ``` ### `summarize()`: logical vectors to count _proportions_ Synatx: `group_by(vars) %>% summarize(prop = mean(TRUE/FALSE conditon))` __Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000? Steps: 1. Filter public HS visits 2. group by in-state vs. out-of-state 3. Create measure ```{r} event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits group_by (event_inst) %>% # group by in-state vs. out-of-state summarize( n_events=n(), # number of events by group n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc, p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities ``` ### `summarize()`: logical vectors to count _proportions_ __What if we forgot to put `na.rm=TRUE` in the above task?__ \medskip __Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000? ```{r} event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits group_by (event_inst) %>% # group by in-state vs. out-of-state summarize( n_events=n(), # number of events by group n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc, p_incgt100k = mean(med_inc>100000, , na.rm=TRUE)) # proportion visits to $100K+ commmunities ``` ### `summarize()`: Other "helper" functions Lots of other functions we can use within `summarize()` \medskip Common functions to use with `summarize()`: | Function | Description | |----------|-------------| | `n` | count | | `n_distinct` | count unique values | | `mean` | mean | | `median` | median | | `max` | largest value | | `min` | smallest value | | `sd` | standard deviation | | `sum` | sum of values | | `first` | first value | | `last` | last value | | `nth` | nth value | | `any` | condition true for at least one value | *Note: These functions can also be used on their own or with `mutate()`* ### `summarize()`: Other functions Maximum value in a group ```{r} max(c(10,50,8)) ``` __Task__: For each combination of in-state/out-of-state and event type, what is the maximum value of `med_inc`? ```{r} event_berk %>% group_by(event_type, event_inst) %>% summarize(max_inc = max(med_inc)) # oops, we forgot to remove NAs! event_berk %>% group_by(event_type, event_inst) %>% summarize(max_inc = max(med_inc, na.rm = TRUE)) ``` What did we do wrong here? ### `summarize()`: Other functions Isolate first/last/nth observation in a group ```{r, results="hide"} x <- c(10,15,20,25,30) first(x) last(x) nth(x,1) nth(x,3) nth(x,10) ``` __Task__: after sorting object `event_berk` by `event_type` and `event_datetime_start`, what is the value of `event_date` for: - first event for each event type? - the last event for each event type? - the 50th event for each event type? ```{r, results="hide"} event_berk %>% arrange(event_type, event_datetime_start) %>% group_by(event_type) %>% summarize( n_events = n(), date_first= first(event_date), date_last= last(event_date), date_50th= nth(event_date, 50) ) ``` ### Student exercise Identify value of `event_date` for the _nth_ event in each by group __Specific task__: - arrange (i.e., sort) by `event_type` and `event_datetme_start`, then group by `event_type`, and then identify the value of `event_date` for: - the first event in each by group (`event_type`) - the second event in each by group - the third event in each by group - the fourth event in each by group - the fifth event in each by group ### Student exercise solution ```{r} event_berk %>% arrange(event_type, event_datetime_start) %>% group_by(event_type) %>% summarize( n_events = n(), date_1st= first(event_date), date_2nd= nth(event_date,2), date_3rd= nth(event_date,3), date_4th= nth(event_date,4), date_5th= nth(event_date,5)) ``` # Summarize multiple columns ### What are "scoped" variants of a function? ```{r, eval=FALSE, include=FALSE} ?scoped ``` "Scoped" variants of a function apply the function to a selection of variables. \medskip Three kinds of scoped variants exist: 1. Verbs (i.e., functions) suffixed with `_all()` apply an operation on all variables. - e.g.: `summarize_all()`, `mutate_all()` 1. Verbs suffixed with `_at()` (e.g., `summarize_at()`) apply an operation on a subset of variables specified with quoting function `vars()`. - This quoting function accepts helpers functions like `starts_with()` 1. Verbs suffixed with `_if()` apply an operation on the subset of variables for which a predicate function returns TRUE. \medskip Arguments of scoped variants - `.tbl` A tbl object (data frame) - `.funs` specifies which function(s) to perform (e.g., calculate mean) - Argument values: A function `fun`; a quosure style lambda ~ fun(.); or a list of either form (e.g., `list(mean,min,max). - `.vars` which variables to apply function to: - argument values: A list of columns generated by `vars()`, a character vector of column names, a numeric vector of column positions, or `NULL`. - `.predicate` A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns `TRUE` are selected. - `...` Additional arguments for function calls in `.funs`, evaluated once w/ tidy dots support ### What are "scoped" variants of a function? ```{r, eval=FALSE, include=FALSE} ?scoped ``` Why/when use "scoped" variants of a function - When you want to perform an operation on multiple variables without naming each individual variable \medskip "verbs" (i.e., functions) from the `dplyr` package that have scoped variants `_all()`, `_at()`, and `_if()` - `mutate()` and `transmute()` [see `?mutate_all`] - `summarize()` [see `?summarize_all`] - `filter()` - `group_by()` - `rename()` and `select()` - `arrange()` ### Scoped variants of summarize() ```{r, eval=FALSE, include=FALSE} ?summarize_all ``` __Description__. The "scoped variants" of `summarize()` apply `summarize()` to multiple variables. Three variants: - `summarize_all()` affects every variable - `summarize_at()` affects variables selected with a character vector or vars() - `summarize_if()` affects variables selected with a predicate function __Syntax__ - `summarize_all(.tbl, .funs, ...)` - `summarize_at(.tbl, .vars, .funs, ...)` - `summarize_if(.tbl, .predicate, .funs, ...)` __Arguments__ - `.tbl` A tbl object (data frame) - `.funs` specifies which function(s) to perform (e.g., calculate mean) - Argument values: A function `fun`; a quosure style lambda ~ fun(.); or a list of either form (e.g., `list(mean,min,max). - `.vars` which variables to apply function to: - argument values: A list of columns generated by `vars()`, a character vector of column names, a numeric vector of column positions, or `NULL`. - `.predicate` A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns `TRUE` are selected. - `...` Additional arguments for the function calls in `.funs`. - These are evaluated only once, with tidy dots support. ### summarize_all() affects every variable ```{r, eval=FALSE, include=FALSE} ?summarize_all ``` Syntax: `summarize_all(.tbl, .funs, ...)` - `.tbl` A tbl object (data frame) - `.funs` specifies which function(s) to perform. Argument values: - A function `fun`; a quosure style lambda ~ fun(.); a list (e.g., `list(mean,min)` - `...` Additional arguments for function calls in `.funs`. These are evaluated once __Task__: - For U. Pittsburgh (`univ_id = 215293`) events at public and private high schools, caclulate the __mean__ value of `med_inc` and `pct_white_zip` for each combination of `event_type` and `event_inst` ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = mean) ``` \medskip Try again, this time applying `na.rm = TRUE` - this is an example of a `...` argument "for the function calls in `.funs`." ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = mean, na.rm = TRUE) ``` ### summarize_all() affects every variable ```{r, eval=FALSE, include=FALSE} ?summarize_all ``` Syntax: `summarize_all(.tbl, .funs, ...)` - `.tbl` A tbl object (data frame) - `.funs` specifies which function(s) to perform. Argument values: - A function `fun`; a quosure style lambda ~ fun(.); a list (e.g., `list(mean,min)` - `...` Additional arguments for function calls in `.funs`. These are evaluated once __Task__: - For U. Pittsburgh (`univ_id = 215293`) events at public and private high schools, caclulate __mean__ and __standard deviation__ of `med_inc` and `pct_white_zip` for each combination of `event_type` and `event_inst` ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = list(mean,sd), na.rm = TRUE) ``` Use this syntax to control variable name suffixes: - `.funs = list(var_name_suffix = function_name,...)` ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = list(avg = mean, std = sd), na.rm = TRUE) ``` ### summarize_all() affects every variable __Task__: - Same task as before, but now calculate __mean__, __standard deviation__, __min__, and __max__ of `med_inc` and `pct_white_zip` for each combination of `event_type` and `event_inst` \medskip ```{r} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = list(avg = mean, std = sd, low = min, high = max), na.rm = TRUE) ``` ### summarize_all(), quosure style lambdas ~ func_name(.) Syntax: `summarize_all(.tbl, .funs, ...)` - `.funs` specifies which function(s) to perform. Argument values: - A function `fun`; a quosure style lambda `~ fun(.)`; a list (e.g., `list(mean,min)` \medskip __Task__: Calculate mean, number of obs, and number of non-missing obs for variables - Functions you specify within `.funs` require different options (e.g., some require `na.rm = TRUE` but others don't take arguments) - Within `.funs` argument, specify functions using ["quosure style lambda"](https://adv-r.hadley.nz/evaluation.html) - Syntax: `.funs = list(~ func_name(., options), ~ func_name(., options))` \medskip ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pop_total) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = list(~ mean(., na.rm = TRUE), ~ n(), ~ sum(!is.na(.)))) ``` \medskip Specify suffix of variable name ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,med_inc,pop_total) %>% group_by(event_type,event_inst) %>% summarize_all(.funs = list(avg = ~ mean(., na.rm = TRUE), nrow = ~ n(), n_nonmiss = ~ sum(!is.na(.)))) ``` ### summarize_at() affects selected variables Syntax: `summarize_at(.tbl, .vars, .funs, ...)` - `.tbl` A tbl object (data frame) - `.vars` which variables to operate on. Argument values: - A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL. - `.funs` specifies which function(s) to perform. Argument values: - A function `fun`; a quosure style lambda ~ fun(.); a list (e.g., `list(mean,min)` - `...` Additional arguments for function calls in `.funs`. These are evaluated once \medskip __Task__: For U. Pittsburgh events at public and private high schools, caclulate __mean__, __min__, and __max__ of variables `med_inc` and `event_date` for each combination of `event_type` and `event_inst` ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% group_by(event_type,event_inst) %>% summarize_at(.vars = vars(med_inc,event_date), .funs = list(avg = mean, low = min, high = max), na.rm = TRUE) ``` Alternative: ```{r, results="hide"} df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% group_by(event_type,event_inst) %>% summarize_at(.vars = c("med_inc","event_date"), .funs = list(avg = mean, low = min, high = max), na.rm = TRUE) ``` ### summarize_if() affects variables that satisfy some condition Useful if you want to apply functions to variables that are particular `type` or `class` \medskip Syntax: `summarize_if(.predicate, .tbl, .funs, ...)` - `.tbl` A tbl object (data frame) - `.predicate` A predicate function to be applied to columns or a logical vector. The variables for which .predicate is or returns `TRUE` are selected. - `.funs` specifies which function(s) to perform. - `...` Additional arguments for function calls in `.funs`. \bigskip __Task__: For events by U. Pittsburgh at public and private high schools, caclulate mean and standard deviation for __numeric variables__ ```{r, results="hide"} #First, which vars are numeric df_event %>% select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>% glimpse() df_event %>% filter(univ_id == 215293, event_type %in% c("private hs","public hs")) %>% select(event_type, event_inst,instnm,school_id,med_inc,pct_white_zip) %>% group_by(event_type,event_inst) %>% summarize_if(.predicate = is.numeric, .funs = list(avg = mean,std = sd), na.rm = TRUE) ``` # Attach aggregate measures to your data frame ### Attach aggregate measures to your data frame We can attach aggregate measures to a data frame by using group_by without summarize() What do I mean by "attaching aggregate measures to a data frame"? - Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group __Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type - resulting object should have same number of observations as `event_berk` Steps: 1. create measure of avg. income across all events without using `group_by()` or `summarize()` and assign as (new) object 1. Using object from previous step, create measure of avg. income across by event type using `group_by()` without `summarize()` and assign as new object ### Attach aggregate measures to your data frame __Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type 1. Create measure of average income across all events ```{r, results="hide"} event_berk_temp <- event_berk %>% arrange(event_date) %>% # sort by event_date (optional) select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl) mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure dim(event_berk_temp) event_berk_temp %>% head(5) ``` 2. Create measure of average income by event type ```{r, results="hide"} event_berk_temp <- event_berk_temp %>% group_by(event_type) %>% # grouping by event type mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure str(event_berk_temp) event_berk_temp %>% head(5) ``` ### Attach aggregate measures to your data frame __Task__: Using `event_berk_temp` from previous question, create a measure that identifies whether `med_inc` associated with the event is higher/lower than average income for all events of that type Steps: 1. Create measure of average income for each event type [already done] 1. Create 0/1 indicator that identifies whether median income at event location is higher than average median income for events of that type ```{r, results="hide"} # average income at recruiting events across all universities event_berk_tempv2 <- event_berk_temp %>% mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>% select(-(avg_inc)) # drop avg_inc (optional) event_berk_tempv2 # note how med_ic = NA are treated ``` Same as above, but this time create integer indicator rather than logical ```{r, results="hide"} event_berk_tempv2 <- event_berk_tempv2 %>% mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type)) event_berk_tempv2 %>% head(4) ``` ### Student exercise Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`? - Note: all events attached to a particular zip_code - `pct_white_zip`: pct of people in that zip_code who identify as white Steps in task: - Create measure of average pct white for each event_type - Compare whether pct_white_zip is higher or lower than this average ### Student exercise solution Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`? ```{r} event_berk_tempv3 <- event_berk %>% arrange(event_date) %>% # sort by event_date (optional) select(event_date, event_type, pct_white_zip) %>% #optional group_by(event_type) %>% # grouping by event type mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE), gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white)) event_berk_tempv3 %>% head(4) ```