---
title: "Biostat 200C Homework 1"
subtitle: Due Apr 17 @ 11:59PM
output: 
  html_document:
    toc: true
    toc_depth: 4
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```

To sumbit homework, please email Rmd and html files to TA Shanpeng Li by the deadline.

## Q1. Concavity of logistic regression log-likelihood 

### Q1.1

Write down the log-likelihood function of logistic regresion for binomial responses.

### Q1.2

Derive the gradient vector and Hessian matrix of the log-likelhood function with respect to the regression coefficients $\boldsymbol{\beta}$. 

### Q1.3

Show that the log-likelihood function of logistic regression is a concave function in regression coefficients $\boldsymbol{\beta}$. (Hint: show that the negative Hessian is a positive semidefinite matrix.)

## Q2. CFR of COVID-19

Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as
$$
\frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}.
$$
Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it. 

In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.

### Data sources

- `04-04-2020.csv.gz`: The data on COVID-19 confirmed cases and deaths on 2020-04-04 is retrieved from the [Johns Hopkins COVID-19 data repository](https://github.com/CSSEGISandData/COVID-19). It was downloaded from this [link](https://github.com/CSSEGISandData/COVID-19) (commit 0174f38).

- `us-county-health-rankings-2020.csv.gz`: The 2020 County Health Ranking Data was released by [County Health Rankings](https://www.countyhealthrankings.org). The data was downloaded from the [Kaggle Uncover COVID-19 Challenge](https://www.kaggle.com/roche-data-science-coalition/uncover) (version 1).

### Sample code for data preparation

Load the `tidyverse` package for data manipulation and visualization.
```{r}
# tidyverse of data manipulation and visualization
library(tidyverse)
```
Read in the data of COVID-19 cases reported on 2020-04-04.
```{r}
county_count <- read_csv("./04-04-2020.csv.gz") %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(FIPS = as.numeric(FIPS)) %>%
  filter(Country_Region == "US") %>%
  print(width = Inf)
```
Standardize the variable names by changing them to lower case.
```{r}
names(county_count) <- str_to_lower(names(county_count))
```
Sanity check by displaying the unique US states and territories:
```{r}
county_count %>%
  select(province_state) %>%
  distinct() %>%
  arrange(province_state) %>%
  print(n = Inf)
```
We want to exclude entries from `Diamond Princess`, `Grand Princess`, `Guam`, `Northern Mariana Islands`, `Puerto Rico`, `Recovered`, and `Virgin Islands`, and only consider counties from 50 states and DC.
```{r}
county_count <- county_count %>%
  filter(!(province_state %in% c("Diamond Princess", "Grand Princess", 
                                 "Recovered", "Guam", "Northern Mariana Islands", 
                                 "Puerto Rico", "Virgin Islands"))) %>%
  print(width = Inf)
```
Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.
```{r}
county_count %>%
  # turn into long format for easy plotting
  pivot_longer(confirmed:recovered, 
               names_to = "case", 
               values_to = "count") %>%
  group_by(province_state) %>%
  ggplot() + 
  geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) + 
  # scale_y_log10() + 
  labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") + 
  theme(axis.text.x = element_text(angle = 90))
```

Read in the 2020 county-level health ranking data.
```{r}
county_info <- read_csv("./us-county-health-rankings-2020.csv.gz") %>%
  filter(!is.na(county)) %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(fips = as.numeric(fips)) %>%
  select(fips, 
         state,
         county,
         percent_fair_or_poor_health, 
         percent_smokers, 
         percent_adults_with_obesity, 
         # food_environment_index,
         percent_with_access_to_exercise_opportunities, 
         percent_excessive_drinking,
         # teen_birth_rate, 
         percent_uninsured,
         # primary_care_physicians_rate,
         # preventable_hospitalization_rate,
         # high_school_graduation_rate,
         percent_some_college,
         percent_unemployed,
         percent_children_in_poverty,
         # `80th_percentile_income`,
         # `20th_percentile_income`,
         percent_single_parent_households,
         # violent_crime_rate,
         percent_severe_housing_problems,
         overcrowding,
         # life_expectancy,
         # age_adjusted_death_rate,
         percent_adults_with_diabetes,
         # hiv_prevalence_rate,
         percent_food_insecure,
         # percent_limited_access_to_healthy_foods,
         percent_insufficient_sleep,
         percent_uninsured_2,
         median_household_income,
         average_traffic_volume_per_meter_of_major_roadways,
         percent_homeowners,
         # percent_severe_housing_cost_burden,
         population_2,
         percent_less_than_18_years_of_age,
         percent_65_and_over,
         percent_black,
         percent_asian,
         percent_hispanic,
         percent_female,
         percent_rural) %>%
  print(width = Inf)
```

For stability in estimating CFR, we restrict to counties with $\ge 5$ confirmed cases.
```{r}
county_count <- county_count %>%
  filter(confirmed >= 5)
```
We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key. 
```{r}
county_data <- county_count %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
```
Numerical summaries of each variable:
```{r}
summary(county_data)
```
List rows in `county_data` that don't have a match in `county_count`:
```{r}
county_data %>%
  filter(is.na(state) & is.na(county)) %>%
  print(n = Inf)
```
We found there are some rows that miss `fips`. 
```{r}
county_count %>%
  filter(is.na(fips)) %>%
  select(fips, admin2, province_state) %>%
  print(n = Inf)
```
We need to (1) manually set the `fips` for some counties, (2) discard those `Unassigned`, `unassigned` or `Out of`, and (3) try to join with `county_info` again.
```{r}
county_data <- county_count %>%
  # manually set FIPS for some counties
  mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
  mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%  
  mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
  #mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%  
  mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
  mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
  mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
  filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
```
Summarize again
```{r}
summary(county_data)
```
If there are variables with missing value for many counties, we go back and remove those variables from consideration.

Let's create a final data frame for analysis.
```{r}
county_data <- county_data %>%
  mutate(state = as.factor(state)) %>%
  select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
```
Display the 10 counties with highest CFR.
```{r}
county_data %>%
  mutate(cfr = deaths / confirmed) %>%
  select(county, state, confirmed, deaths, cfr) %>%
  arrange(desc(cfr)) %>%
  top_n(10)
```
Write final data into a csv file for future use.
```{r}
write_csv(county_data, "covid19-county-data-20200404.csv.gz")
```

### Q2.1 

Read and run above code to generate a data frame `county_data` that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information. 

### Q2.2

What assumptions of CFR might be violated by defining CFR as `deaths/confirmed` from this data set? With acknowledgement of these severe limitations, we continue to use `deaths/confirmed` as a very rough proxy of CFR.

### Q2.3 

What assumptions of logistic regression may be violated by this data set?

### Q2.4 

Run a logistic regression, using variables `state`, ..., `percent_rural` as predictors. 

### Q2.5 

Interpret the regression coefficients of 3 significant predictors with p-value <0.01.

### Q2.6 

Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model. 

### Q2.7

Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.

### Q2.8

Construct confidence intervals of regression coefficients.

### Q2.9 

Plot the deviance residuals against the fitted values. Are there potential outliers?

### Q2.10

Plot the half-normal plot. Are there potential outliers in predictor space?

### Q2.11

Find the best sub-model using the AIC criterion.

### Q2.12

Find the best sub-model using the lasso with cross validation.