--- title: "Lecture 4 problem set" author: "INSERT YOUR NAME HERE" date: "October 18, 2019" urlcolor: blue output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) ``` # General instructions In this homework, you will specify `pdf_document` as the output format. You must have LaTeX installed in order to create pdf documents. If you have not yet installed MiKTeX/MacTeX, I recommend installing TinyTeX, which is much simpler to install! - Instructions for installation of TinTeX can be found [HERE](https://bookdown.org/yihui/rmarkdown/installation.html#installation) - General Instructions for Problem Sets [Here](https://github.com/ozanj/rclass/raw/master/lectures/problemset_resources.pdf) # Make changes to YAML header Read XAG section 3.3 before answering these questions 1. Add a table of contents to YAML header 1. table of contents should have "depth" of 2 1. Add section numbering to headers 1. Change "data frame printing" option to "tibble" # Load packages, load data, and rename variables 1. Load the tidyverse package ```{r} #install.packages("tidyverse") #install if you do not have tidyverse installed library(tidyverse) ``` 2. Load the data frame data frame `df_school_all` - The URL for this data frame is: (https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData) - The data frame `df_school_all` has one observation for each high school (public and private). - The variables that begin with `visits_by_...` identify how many off-campus recruiting visits the high school received from a particular public university. For example, UC Berkeley has the ID `110635` so the variable `visits_by_110635` identifies how many visits the high school received from UC Berkeley. - The variable `total_visits` identifies the number of visits the high school received from all (16) public research universities in this data collection sample. ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) ``` 3. Run the following code which drops some variables, renames other variables, and assigns these changes to the existing object `df_school_all` and then print the names of all the variables using the `names()` function. ```{r} df_school_all <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564 ) names(df_school_all) ``` # Filter and arrange questions For the questions below, imagine that you have been asked by a major news outlet to identify which high schools receive the most off-campus recruiting visits from the 16 public universities in the sample. Therefore, you will focus on the variable `total_visits`, which counts the total number of visits to the high school across all public 16 public research universities in the sample - For questions that ask you to print the "top 10" observations, you can either: - just print the object and rely on the fact that the default option for printing tibbles is to print the first 10 observations - OR you can wrap the command in the `head()` function and explicitly tell R to print 10 observations. 1. Without using pipes (`%>%`), sort (i.e., `arrange()` function) descending by `total_visits` and print the the following variables for the top 10 schools in terms of total number of visits: - variables to print: `name`, `state_code`, `city`, `school_type`,`total_visits`, `med_inc`, `pct_white`, `pct_black`, `pct_hispanic`, `pct_asian`, `pct_amerindian` - Note: You can do this in one step by wrapping the `select()` function around the `arrange()` (i.e., sort) function; or you can do this in two steps by creating a new data frame first. ```{r} ``` 2. Answer the question above, but this time use pipes (`%>%`) to answer the question in one line of code ```{r} ``` 3. Without using pipes, print the following (same variables as above): - (A) the top 10 public high schools in terms of total number of visits and then - (B) the top 10 private high schoools in terms of total number of visits ```{r} ``` 4. Answer the question above, but this time using pipes (`%>%`) to answer the question in one line of code for part (A) and one line of code for part (B) ```{r} ``` 5. Using pipe operator (`%>%`), print the following (same variables as above; one line of code for each part (A), (B), (C), (D)): - (A) the top 10 public high schools in Massachusetts in terms of total number of visits and then - (B) the top 10 private high schools in Massachusetts in terms of total number of visits - (C) the top 10 public high schools in California in terms of total number of visits and then - (D) the top 10 private high schools in California in terms of total number of visits ```{r} ``` # Creating variables using mutate() The focus of this set of questions will be practicing creating some variables from the data frame `df_school_all`. You will be using the `mutate()` function, often combined with the `if_else()` function. Additionally, questions will ask you to investigate the values of "input" variables before creating new "analysis" variables using `mutate()` Before presenting questions, here are some examples of code that may be useful in checking variable values. The below lines of code count: - the number of observations in the data frame `df_school_all` - the number of observations that have missing values for the variable `state_code` - the number of observations that have missing values for the variable `school_type` - a frequency count of the variable `school_type` ```{r} df_school_all %>% count() count(df_school_all) # same as above df_school_all %>% filter(is.na(state_code)) %>% count() # number with NA for state_code df_school_all %>% filter(is.na(school_type)) %>% count() # number with NA for school_type df_school_all %>% count(school_type) # frequency count of school_type ``` 1. Using `mutate()` with `ifelse()` create a 0/1 indicator called `ca_school` that indicates whether the high school is in California and then use `count()` to create a frequency table for the values of `ca_school` (you don't need to assign/retain the new variable) ```{r} ``` 2. Using `mutate()` with `ifelse()` create a 0/1 indicator called `ca_pub_school` that indicates whether the school is a public high school in California and then use `count()` to create a frequency table for the values of `ca_pub_school` (you don't need to assign/retain the new variable) ```{r} ``` 3. By combining the `is.na()` function with the `filter()` function, identify the number of observations that have missing values for the following variables: - `pct_black`, `pct_hispanic`, `pct_amerindian` ```{r} ``` 4. Create a new variable pct_bl_hisp_nat that represents the percent of students at the school that identify as black, hispanic, or american indian. Retain this variable by assigning it to the object `df_school_all` ```{r} ``` 5. Create a new 0/1 indicator variable gt50pct_bl_hisp_nat that identifies whether more than 50% of students identify as black, hispanic, or american indian and create a frequency count of this variable (no need to retain this variable) ```{r} ``` 6. Create the following 0/1 indicator variables, retain them (assign to object `df_school_all`), and then create frequency counts of these variables: - Variable `miss_took_math` for whether the school has missing values for the variable `num_took_math` - Variable `miss_prof_math` for whether the school has missing values for the variable `num_prof_math` - Variable `miss_took_or_prof_math` for whether the school has missing values for the variable `num_took_math` OR `num_prof_math` ```{r} ``` 7. create a variable of `pct_prof_math` that measures the percent of students who score proficient in the state math assessment(assign to object `df_school_all`). ```{r} ``` 8. create a frequency count of value of the variable `pct_prof_math` separately for the three following filters: - Observations where `miss_took_math==1` - Observations where `miss_prof_math==1` - Observations where `miss_took_or_prof_math==1` ```{r} ``` # Using case_when() function within mutate() For this set of questions, you will work with the data frame `wwlist` which has one observation for each prospective student purchased by Western Washington University from the College Board. The objective of this set of questions is to create a three-category variable that identifies whether the prospect lives: - (1) in-state (i.e., in Washington), (2) out-of-state but in a US state/territory; (3) not in the US 1. Load the data frame `wwlist` which has information on prospects purchased by Western Washington University ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) ``` 2. Apply the `str()` function to the variables `state` and `for_country`; and using the `count()` function to create frequency tables for the variables `state` - `state` - `for_country` ```{r} ``` 3. Using the `filter()` function and `is.na()` function do the following: - count how many missing observations (`NAs`) the variable `state` has - count how many missing observations the variable `for_country` has ```{r} ``` 4. Create a frequency count for the variable `for_country` for the observations where `state` equals `NA` (hint: use the `is.na()`) function ```{r} ``` 5. Create a frequency count for the variable `for_country` for the observations where `state` does not equal `NA` (hint: use `!is.na()`) function ```{r} ``` 6. Count the number of observations that have the value "No Response" for the variable `for_country` ```{r} ``` 7. Using the `case_when` function within `mutate()` create a character variable called `residency` that has the following values: "in_state"; "out_state_us"; "not_in_us" - This variable should have the value `NA` for observations where `for_country=="No Response"` - Retain this variable (assign to object `wwlist`) and create a frequency count of this variable ```{r} ``` Once finished, knit to (pdf) and upload both .Rmd and PDF files to class website under the week 3 tab *Remeber to use this naming convention "lastname_firstname_ps3"*