---
title: "Lecture 5: Processing across rows" # potentially push to header
subtitle:  "Managing and Manipulating Data Using R"
author: 
date: 
fontsize: 8pt
classoption: dvipsnames  # for colors
urlcolor: blue
output:
  beamer_presentation:
    keep_tex: true
    toc: false
    slide_level: 3
    theme: default # AnnArbor 
    #colortheme: "dolphin" 
    #fonttheme: "structurebold"
    highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header
    df_print: tibble #default # tibble # 
    latex_engine: xelatex #  Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts.
    includes:
      in_header: ../beamer_header.tex
      #after_body: table-of-contents.txt 
---

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
  #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##"
```

```{r, echo=FALSE, include=FALSE}
#THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamder_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE
download.file(url = 'https://raw.githubusercontent.com/ozanj/rclass/master/lectures/beamer_header.tex', 
              destfile = '../beamer_header.tex',
              mode = 'wb')

```

# Introduction


### Logistics

__Required reading for next week:__

- Grolemund and Wickham 5.6 - 5.7 (grouped summaries and mutates)
- Xie, Allaire, and Grolemund 4.1 (R Markdown, ioslides presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html) and 4.3 (R Markdown, Beamer presentations) [LINK HERE](https://bookdown.org/yihui/rmarkdown/beamer-presentation.html)
    - Why? Lectures for this class are `beamer_presentation` output type.
    - `ioslides_presentation` are the most basic presentation output format for RMarkdown, so learning about `ioslides` will help you understand `beamer`
- Any slides from lecture we don't cover

<!-- __Explanation about `beamer_header.tex` in YAML header:__ -->

<!-- - What does it do? Why do we include this? -->
<!-- - Incorporating updates to `beamer_header.tex` -->

### What we will do today

\tableofcontents

```{r, eval=FALSE, echo=FALSE}
#Use this if you want TOC to show level 2 headings
\tableofcontents
#Use this if you don't want TOC to show level 2 headings
\tableofcontents[subsectionstyle=hide/hide/hide]
```

### Libraries we will use today

"Load" the package we will use today (output omitted)

- __you must run this code chunk__
```{r, message=FALSE}
library(tidyverse)
```
If package not yet installed, then must install before you load. Install in "console" rather than .Rmd file

- Generic syntax: `install.packages("package_name")`
- Install "tidyverse": `install.packages("tidyverse")`

Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:

- `install.packages("tidyverse")`
- `library(tidyverse)`

### Data we will use today

Data on off-campus recruiting events by public universities

- Object `df_event`
    - One observation per university, recruiting event


```{r}
rm(list = ls()) # remove all objects

#load dataset with one obs per recruiting event
load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData"))
#load("../../data/recruiting/recruit_event_allvars.Rdata")
```

### Processing across observations, introduction

Creation of analysis datasets often requires calculations across obs

Examples:

- You have a dataset with one observation per student-term and want to create a variable of credits attempted per term
- You have a dataset with one observation per student-term and want to create a variable of GPA for the semester or cumulative GPA for all semesters
- Number of off-campus recruiting events university makes to each state
- Average household income at visited versus non-visited high schools


__Note__

- in today's lecture, I'll use the terms "observations" and "rows" interchangeably

### Processing across variables vs. processing across observations

Visits by UC Berkeley to public high schools
```{r, echo=FALSE}
#df_event %>% count(event_type)
df_event %>% filter(event_type == "public hs", univ_id == 110635) %>% 
  mutate(pct_fr_lunch = fr_lunch/total_students_pub) %>% 
  rename(tot_stu_pub = total_students_pub, state= event_state) %>%
  select(school_id, state, tot_stu_pub, fr_lunch, pct_fr_lunch, med_inc) %>%
  slice(1:5)
```


- So far, we have focused on ``processing across variables''
    - Performing calculations across columns (i.e., vars), typically within a row (i.e., observation)
    - Example: percent free-reduced lunch (above)
- Processing across obs (focus of today's lecture)
    - Performing calculations across rows (i.e., obs), often within a column (i.e., variable)
    - Example: Average household income of visited high schools, by state
    
# Introduce group_by() and summarise()

### Strategy for teaching processing across obs

In `tidyverse` the `group_by()` and `summarise()` functions are the primary means of performing calculations across observations

- Usually, processing across observations requires using `group_by()` and `summarise()` together
- `group_by()` and `summarise()` usually aren't very useful by themselves (like peanut butter and jelly)

How we'll teach:

- introduce `group_by()` and `summarise()` separately
    - goal: you understand what each function does
- then we'll combine them

## group_by 

### group_by()

`group_by()` converts a data frame object into groups. After grouping, functions performed on  data frame are performed "by group"

- part of __dplyr__ package within __tidyverse__; not part of __Base R__
- works best with pipes `%>%` and `summarise()` function [described below]

Basic syntax:

- `group_by(object, vars to group by separated by commas)`

Typically, "group_by" variables are character, factor, or integer variables

- Possible "group by" variables in `df_event` data: 
    - university name/id; event type (e.g., public HS, private HS); state

__Example__: in `df_event`, create frequency count of `event_type`
```{r, results="hide"}
names(df_event)
#without group_by()
df_event %>% count(event_type)
df_event %>% count(instnm)
#group_by() university
df_event %>% group_by(instnm) %>% count(event_type)
```


### `group_by()`

By itself `group_by()` doesn't do much; it just prints data

- Below, group `df_event` data by university, event type, and event state


```{r, results="hide"}
#without pipes
group_by(df_event, univ_id, event_type, event_state)
#with pipes
df_event %>% group_by(univ_id, event_type, event_state) 

```

But once an object is grouped, all subsequent functions are run separately "by group"

```{r, results="hide"}
df_event  %>% count()
df_event %>% group_by(univ_id) %>% count()
df_event %>% group_by(univ_id) %>% count() %>% str()
df_event %>% group_by(univ_id, event_type) %>% count()
df_event %>% group_by(univ_id, event_type) %>% count() %>% str()
df_event %>% group_by(univ_id, event_type, event_state) %>% count()
```


### Grouping not retained unless you __assign__ it


Below, we'll use `class()` function to show whether data frame is grouped

- will talk more about `class()` next week, but for now, just think of it as a function that provides information about an object
- similar to `typeof()`, but `class()` provides different info about object

Grouping is not retained unless you __assign__ it
```{r}
class(df_event)
df_event_grp <- df_event %>% group_by(univ_id, event_type, event_state) # using pipes
class(df_event_grp)
```

Use `ungroup(object)` to un-group grouped data
```{r}
df_event_grp <- ungroup(df_event_grp)
class(df_event_grp)
rm(df_event_grp)
```
### `group_by()` student exercise


1. Group by "instnm" and get a frequency count.
    - How many rows and columns do you have? What do the number of rows mean?  
1. Now group by "instnm" **and** "event_type" and get a frequency count.  
    - How many rows and columns do you have? What do the number of rows mean?
1. **Bonus:** In the same code chunk, group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. 


### `group_by()` student exercise solutions

1. Group by "instnm" and get a frequency count.
    - How many rows and columns do you have? What do the number of rows mean?  
```{r}
df_event %>%
  group_by(instnm) %>%
  count() 
```

### `group_by()` student exercise solutions
2. Now group by "instnm" **and** "event_type" and get a frequency count.  
    - How many rows and columns do you have? What do the number of rows mean?
```{r}
df_event %>%
  group_by(instnm, event_type) %>%
  count() 
```

### `group_by()` student exercise solutions
3. **Bonus:** Group by "instnm" and "event_type", but this time filter for observations where "med_inc" is greater than 75000 and get a frequency count. 
```{r}
df_event %>%
  group_by(instnm, event_type) %>%
  filter(med_inc > 75000) %>%
  count()
```

## summarise()

### `summarise()` function

`summarise()` does calculations across rows; then collapses into single row
```{r, eval=FALSE, echo=FALSE}
?summarise
```


__Usage (i.e., syntax)__: `summarise(.data, ...)`

__Arguments__

- `.data`: a data frame; omit if using `summarise()` after pipe `%>%`
- `...`: Name-value pairs of summary functions. 
    - The name will be the name of the variable in the result. 
    - Value should be expression that returns a single value like `min(x)`, `n()`


__Value__ (what `summarise()` returns/creates)

- Object of same class as `.data.`; object will have one obs per "by group"

__Useful functions (i.e., "helper functions")__

- Standalone functions called *within* `summarise()`, e.g., `mean()`, `n()`
- Count function `n()` takes no arguments; returns number of rows in group

__Example__: Count total number of events
```{r, results="hide"}
summarise(df_event, num_events=n()) # without pipes
sum_object <- df_event %>% summarise(num_events=n()) # using pipes
df_event %>% summarise(num_events=n()) # using pipes
```


### Investigate objects created by `summarise()`

__Example__: Count total number of events
```{r, results="hide"}
df_event %>% summarise(num_events=n()) 
df_event %>% summarise(num_events=n()) %>% str()
```
__Example__: What is max value of `med_inc` across all events
```{r, results="hide"}
df_event %>% summarise(max_inc=max(med_inc, na.rm = TRUE))
df_event %>% summarise(max_inc=max(med_inc, na.rm = TRUE)) %>% str()
```
__Example__: Count total number of events AND max value of median income
```{r, results="hide"}
df_event %>% summarise(num_events=n(), 
                       max_inc=max(med_inc, na.rm = TRUE))
df_event %>% summarise(num_events=n(), 
                       max_inc=max(med_inc, na.rm = TRUE)) %>% str()
```
__Takeaway__

- by default, objects created by `summarise()` are data frames that contain variables created within `summarise()` and one observation [per "by group"]

### Retaining objects created by `summarise()`

Object created by summarise() not retained unless you __assign__ it
```{r}
event_temp <- df_event %>% summarise(num_events=n(), 
  mean_inc=mean(med_inc, na.rm = TRUE))

event_temp
rm(event_temp)
```
### `summarise()` student exercise

1. What is the min value of `med_inc` across all events?  
    - Hint: Use min()

1. What is the mean value of `fr_lunch` across all events?  
    - Hint: Use mean()
    
### `summarise()` student exercise

1. What is min value of `med_inc` across all events? 
```{r}
df_event %>%
  summarise(min_med_income = min(med_inc, na.rm = TRUE))
```

### `summarise()` student exercise

2. What is the mean value of `fr_lunch` across all events?  
    - Hint: Use mean()
```{r}
df_event %>%
  summarise(mean_fr_lunch = mean(fr_lunch, na.rm = TRUE))
```


# Combining group_by() and summarise()

### Combining `summarise()` and `group_by`

`summarise()` on ungrouped vs. grouped data:

- By itself, `summarise()` performs calculations across all rows of data frame then collapses the data frame to a single row
- When data frame is grouped, `summarise()` performs calculations across rows within a group and then collapses to a single row for each group

__Example__: Count the number of events for each university

```{r, results="hide"}
df_event %>% summarise(num_events=n())
df_event %>% group_by(instnm) %>% summarise(num_events=n())
```
- Investigate the object created above
```{r, results="hide"}
df_event %>% group_by(instnm) %>% summarise(num_events=n()) %>% str()
```
- Or we could retain object for later use
```{r, results="hide"}
event_by_univ <- df_event %>% group_by(instnm) %>% summarise(num_events=n())
str(event_by_univ)
event_by_univ # print
rm(event_by_univ)
```


### Combining `summarise()` and `group_by`

__Task__

- Count number of recruiting events by event_type for each university
```{r, results="hide"}
df_event %>% group_by(instnm, event_type) %>% 
  summarise(num_events=n())

df_event %>% group_by(instnm, event_state, event_type) %>% 
  summarise(num_events=n())

#investigate object created
df_event %>% group_by(instnm, event_type) %>% 
  summarise(num_events=n()) %>% str()
```

__Task__

- By university and event type, count the number of events and calculate the avg. pct white in the zip-code

```{r, results="hide"}
df_event %>% group_by(instnm, event_type) %>% 
  summarise(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = TRUE)
  )

#investigate object you created
df_event %>% group_by(instnm, event_type) %>% 
  summarise(num_events=n(),
    mean_pct_white=mean(pct_white_zip, na.rm = TRUE)
  ) %>% str()
```

### Combining `summarise()` and `group_by`

Recruiting events by UC Berkeley
```{r, results="hide"}
df_event %>% filter(univ_id == 110635) %>% 
  group_by(event_type) %>% summarise(num_events=n())
```

Let's create a dataset of recruiting events at UC Berkeley
```{r, results="hide"}
event_berk <- df_event %>% filter(univ_id == 110635)

event_berk %>% count(event_type)
```

The "char" variable `event_inst` equals "In-State" if event is in same state as the university
```{r}
event_berk %>% arrange(event_date) %>% 
  select(pid, event_date, event_type, event_state, event_inst) %>%
  slice(1:8)
```
## summarise() and Counts (and logical vectors)

### `summarise()`: Counts

The count function `n()` takes no arguments and returns the size of the current group
```{r, results="hide"}
event_berk %>% group_by(event_type, event_inst) %>% 
  summarise(num_events=n())
```

Object not retained unless we __assign__
```{r, results="hide"}
berk_temp <- event_berk %>% group_by(event_type, event_inst) %>% 
  summarise(num_events=n())
berk_temp
typeof(berk_temp)
str(berk_temp)
```
Because counts are so important, `dplyr` package includes separate `count()` function that can be called outside `summarise()` function
```{r, results="hide"}
event_berk %>% group_by(event_type, event_inst) %>% count()

berk_temp2 <- event_berk %>% group_by(event_type, event_inst) %>% count()

berk_temp == berk_temp2 # TAKEAWAY: these two objects are identical!
rm(berk_temp,berk_temp2)
```

### `summarise()`: count with logical vectors and `sum()`

Logical vectors have values `TRUE` and `FALSE`. 

- When used with numeric functions, `TRUE` converted to 1 and `FALSE` to 0.

`sum()` is a numeric function that returns the sum of values
```{r, results="hide"}
sum(c(5,10))
sum(c(TRUE,TRUE,FALSE,FALSE))
```
`is.na()` returns `TRUE` if value is `NA` and otherwise returns `FALSE`
```{r}
is.na(c(5,NA,4,NA))

sum(is.na(c(5,NA,4,NA,5)))
sum(!is.na(c(5,NA,4,NA,5)))
```
Application: How many missing/non-missing obs in variable [__very important__]
```{r, results="hide"}
event_berk %>% group_by(event_type) %>% 
  summarise(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc)),
    n_nonmiss_fr_lunch = sum(!is.na(fr_lunch))
  )
```

### `summarise()` and count student exercise 

Use one code chunk for this exercise. You could tackle this a step at a time and run the entire code chunk when you have answered all parts of this question. Create your own variable names.  


1. Using the `event_berk` object, filter observations where `event_state` is VA and group by `event_type`.  
    1. Using the summarise function to create a variable that represents the count for each `event_type`.  
    1. Create a variable that represents the sum of missing obs for `med_inc`.  
    1. create a variable that represents the sum of non-missing obs for `med_inc`.  
    1. **Bonus**: Arrange variable you created representing the count of each `event_type` in descending order. 


### `summarise()` and count student exercise SOLUTION
1. Using the `event_berk` object filter observations where `event_state` is VA and group by `event_type`.  
    1. Using the summarise function, create a variable that represents the count for each `event_type`.  
    1. Now get the sum of missing obs for `med_inc`.  
    1. Now get the sum of non-missing obs for `med_inc`.  

    
```{r}
event_berk %>%
  filter(event_state == "VA") %>%
  group_by(event_type) %>%
  summarise(
    n_events = n(),
    n_miss_inc = sum(is.na(med_inc)),
    n_nonmiss_inc = sum(!is.na(med_inc))) %>%
  arrange(desc(n_events))
```


## summarise() and means 

### `summarise()`: means 

The `mean()` function within `summarise()` calculates means, separately for each group

```{r}
event_berk %>% group_by(event_inst, event_type) %>% summarise(
  n_events=n(),
  mean_inc=mean(med_inc, na.rm = TRUE),
  mean_pct_white=mean(pct_white_zip, na.rm = TRUE))
```


### `summarise()`: means and `na.rm` argument

Default behavior of "aggregation functions" (e.g., `summarise()`)

- if _input_ has any missing values (`NA`), than output will be missing.

Many functions have argument `na.rm` (means "remove `NAs`")

- `na.rm = FALSE` [the default for `mean()`]
    - Do not remove missing values from input before calculating
    - Therefore, missing values in input will cause output to be missing
- `na.rm = TRUE`
    - Remove missing values from input before calculating
    - Therefore, missing values in input will not cause output to be missing

```{r, results="hide"}
#na.rm = FALSE; the default setting
event_berk %>% group_by(event_inst, event_type) %>% summarise(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = FALSE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = FALSE))
#na.rm = TRUE
event_berk %>% group_by(event_inst, event_type) %>% summarise(
  n_events=n(),
  n_miss_inc = sum(is.na(med_inc)),
  mean_inc=mean(med_inc, na.rm = TRUE),
  n_miss_frlunch = sum(is.na(fr_lunch)),
  mean_fr_lunch=mean(fr_lunch, na.rm = TRUE))
```

### Student exercise

1. Using the `event_berk` object, group by `instnm`, `event_inst`, & `event_type`.  
    1. Create vars for number non_missing for these racial/ethnic groups (`pct_white_zip`, `pct_black_zip`, `pct_asian_zip`, `pct_hispanic_zip`, `pct_amerindian_zip`, `pct_nativehawaii_zip`)  
    1. Create vars for mean percent for each racial/ethnic group 
    
### Student exercise solutions


```{r}
event_berk %>% group_by(instnm, event_inst, event_type) %>%
  summarise(
  n_events=n(),
  n_miss_white = sum(!is.na(pct_white_zip)),
  mean_white = mean(pct_white_zip, na.rm = TRUE),
  n_miss_black = sum(!is.na(pct_black_zip)),
  mean_black = mean(pct_black_zip, na.rm = TRUE),
  n_miss_asian = sum(!is.na(pct_asian_zip)),
  mean_asian = mean(pct_asian_zip, na.rm = TRUE),
  n_miss_lat = sum(!is.na(pct_hispanic_zip)),
  mean_lat = mean(pct_hispanic_zip, na.rm = TRUE),
  n_miss_na = sum(!is.na(pct_amerindian_zip)),
  mean_na = mean(pct_amerindian_zip, na.rm = TRUE),
  n_miss_nh = sum(!is.na(pct_nativehawaii_zip)),
  mean_nh = mean(pct_nativehawaii_zip, na.rm = TRUE)) %>%
  head(6)
```


## summarise() and logical vectors, part II

### `summarise()`: counts with logical vectors, part II

Logical vectors (e.g., `is.na()`) useful for counting obs that satisfy some condition
```{r}
is.na(c(5,NA,4,NA))
typeof(is.na(c(5,NA,4,NA)))
sum(is.na(c(5,NA,4,NA)))
```


__Task__: Using object `event_berk`, create object `gt50p_lat_bl` with the following measures for each combination of `event_type` and `event_inst`:

- count of number of rows for each group
- count of rows non-missing for both `pct_black_zip` and `pct_hispanic_zip`
- count of number of visits to communities where the `sum` of Black and Latinx people comprise more than 50% of the total population


```{r, results="hide"}
gt50p_lat_bl <- event_berk %>% group_by (event_inst, event_type) %>% 
  summarise(
    n_events=n(),
    n_nonmiss_latbl = sum(!is.na(pct_black_zip) & !is.na(pct_hispanic_zip)),
    n_majority_latbl= sum(pct_black_zip+ pct_hispanic_zip>50, na.rm = TRUE)
  ) 
gt50p_lat_bl # print object
str(gt50p_lat_bl)
```
### `summarise()`: logical vectors to count _proportions_

Synatx: `group_by(vars) %>% summarise(prop = mean(TRUE/FALSE conditon))`

__Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?

Steps:

1. Filter public HS visits
2. group by in-state vs. out-of-state
3. Create measure
```{r}
event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarise(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000, na.rm=TRUE)) # proportion visits to $100K+ commmunities 
```

### `summarise()`: logical vectors to count _proportions_

__What if we forgot to put `na.rm=TRUE` in the above task?__

\medskip __Task__: separately for in-state/out-of-state, what proportion of visits to public high schools are to communities with median income greater than $100,000?


```{r}
event_berk %>% filter(event_type == "public hs") %>% # filter public hs visits
  group_by (event_inst) %>% # group by in-state vs. out-of-state
  summarise(
    n_events=n(), # number of events by group
    n_nonmiss_inc = sum(!is.na(med_inc)), # w/ nonmissings values median inc,
    p_incgt100k = mean(med_inc>100000)) # proportion visits to $100K+ commmunities 
```

### `summarise()`: Other "helper" functions

Lots of other functions we can use within `summarise()`

\medskip Common functions to use with `summarise()`:

| Function | Description |
|----------|-------------|
| `n`  |    count   |
| `n_distinct`  |   count unique values    |
| `mean`  | mean      |
| `median`  |   median    |
| `max` | largest value |
| `min` | smallest value |
| `sd`  | standard deviation |
| `sum`  |  sum of values    |
| `first` | first value |
| `last` | last value |
| `nth`  |  nth value     |
| `any`  |  condition true for at least one value |

*Note: These functions can also be used on their own or with `mutate()`*

### `summarise()`: Other functions

Maximum value in a group
```{r}
max(c(10,50,8))
```
__Task__: For each combination of in-state/out-of-state and event type, what is the maximum value of `med_inc`?

```{r}
event_berk %>% group_by(event_type, event_inst) %>% 
  summarise(max_inc = max(med_inc))

event_berk %>% group_by(event_type, event_inst) %>% 
  summarise(max_inc = max(med_inc, na.rm = TRUE))
```
What did we do wrong here?

### `summarise()`: Other functions

Isolate first/last/nth observation in a group
```{r, results="hide"}
x <- c(10,15,20,25,30)
first(x)
last(x)
nth(x,1)
nth(x,3)
nth(x,10)
```
__Task__: after sorting object `event_berk` by `event_type` and `event_datetime_start`, what is the value of `event_date` for:

- first event for each event type?
- the last event for each event type?
- the 50th event for each event type?

```{r, results="hide"}
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarise(
    n_events = n(),
    date_first= first(event_date),
    date_last= last(event_date),
    date_50th= nth(event_date, 50)
  )
```


### Student exercise

Identify value of `event_date` for the _nth_ event in each by group

__Specific task__:  

- arrange (i.e., sort) by `event_type` and `event_datetme_start`, then group by `event_type`, and then identify the value of `event_date` for: 
    - the first event in each by group (`event_type`)
    - the second event in each by group
    - the third event in each by group
    - the fourth event in each by group
    - the fifth event in each by group

### Student exercise solution

```{r}
event_berk %>% arrange(event_type, event_datetime_start) %>%
  group_by(event_type) %>%
  summarise(
    n_events = n(),
    date_1st= first(event_date),
    date_2nd= nth(event_date,2),
    date_3rd= nth(event_date,3),
    date_4th= nth(event_date,4),
    date_5th= nth(event_date,5))

```

## Attach aggregate measures to your data frame

### Attach aggregate measures to your data frame

We can attach aggregate measures to a data frame by using group_by without summarise()

What do I mean by "attaching aggregate measures to a data frame"?

- Calculate measures at the by_group level, but attach them to original object rather than creating an object with one row for each by_group

__Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type

- resulting object should have same number of observations as `event_berk`

Steps:

1. create measure of avg. income across all events without using `group_by()` or `summarise()` and assign as (new) object
1. Using object from previous step, create measure of avg. income across by event type using `group_by()` without `summarise()` and assign as new object

### Attach aggregate measures to your data frame

__Task__: Using `event_berk` data frame, create (1) a measure of average income across all events and (2) a measure of average income for each event type

1. Create measure of average income across all events
```{r, results="hide"}
event_berk_temp <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type,med_inc) %>% # select vars to be retained (optioanl) 
  mutate(avg_inc = mean(med_inc, na.rm=TRUE)) # create avg. inc measure

dim(event_berk_temp)
event_berk_temp %>% head(5)
```
2. Create measure of average income by event type
```{r, results="hide"}
event_berk_temp <- event_berk_temp %>% 
  group_by(event_type) %>% # grouping by event type
  mutate(avg_inc_type = mean(med_inc, na.rm=TRUE)) # create avg. inc measure
  
str(event_berk_temp)
event_berk_temp %>% head(5)
```


### Attach aggregate measures to your data frame

__Task__: Using `event_berk_temp` from previous question, create a measure that identifies whether `med_inc` associated with the event is higher/lower than average income for all events of that type 

Steps:

1. Create measure of average income for each event type [already done]
1. Create 0/1 indicator that identifies whether median income at event location is higher than average median income for events of that type

```{r, results="hide"}
# average income at recruiting events across all universities
event_berk_tempv2 <- event_berk_temp %>% 
  mutate(gt_avg_inc_type = med_inc > avg_inc_type) %>% 
  select(-(avg_inc)) # drop avg_inc (optional)
event_berk_tempv2 # note how med_ic = NA are treated
```

Same as above, but this time create integer indicator rather than logical
```{r, results="hide"}
event_berk_tempv2 <- event_berk_tempv2 %>% 
  mutate(gt_avg_inc_type = as.integer(med_inc > avg_inc_type)) 
event_berk_tempv2  %>% head(4)
```

### Student exercise

Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`?

- Note: all events attached to a particular zip_code
- `pct_white_zip`: pct of people in that zip_code who identify as white


Steps in task:

- Create measure of average pct white for each event_type
- Compare whether pct_white_zip is higher or lower than this average


### Student exercise solution

Task: is `pct_white_zip` at a particular event higher or lower than the average pct_white_zip for that `event_type`?

```{r}
event_berk_tempv3 <- event_berk %>% 
  arrange(event_date) %>% # sort by event_date (optional)
  select(event_date, event_type, pct_white_zip) %>% #optional
  group_by(event_type) %>% # grouping by event type
  mutate(avg_pct_white = mean(pct_white_zip, na.rm=TRUE),
         gt_avg_pctwhite_type = as.integer(pct_white_zip > avg_pct_white)) 
event_berk_tempv3 %>% head(4)
```