---
title: "Problem Set 4"
date: ""
format: html
---
[Download source](https://raw.githubusercontent.com/ywanglab/STAT3000/refs/heads/main/psets/hw4.qmd)
## Introduction
In this problem set, you will work with US Census and CDC COVID-19 data. The goal is not to produce a formal causal analysis, but rather to build a clean weekly state-level dataset and use graphics to investigate patterns in COVID-19 burden and vaccination uptake.
We will focus on the following questions:
1. How did COVID-19 case, hospitalization, and death rates evolve during 2020 and 2021?
2. Were there regional differences in COVID-19 burden and vaccination progress?
3. During large national waves, did states with higher vaccination or booster coverage tend to have lower hospitalization or death rates?
## Objective
You will create a single weekly data frame for each jurisdiction in 2020 and 2021. Your final table should contain, at a minimum, the following variables:
- `date`
- `state`
- `state_name`
- `region`
- `region_name`
- `population`
- `cases`
- `hosp`
- `deaths`
- `vax`
- `booster`
You will then use this table to create several figures.
## Instructions {.unnumbered}
### 1. Obtain a Census API key
Go to and request a Census API key.
Do **not** hard-code your key directly into your homework file. Instead, create a file in your working directory called `_census-key.R` containing exactly one line of code:
``` r
census_key <- "YOUR_REAL_CENSUS_KEY"
```
Note that the API key must be enclosed inside the quotes. Your homework code should then load the key by sourcing that file.
### 2. Create a project structure
In your course Git repository, create a folder `hw4` that includes the following directories:
- `data`
- `code`
- `figs`
Inside `code`, include:
- `funcs.R`
- `wrangle.R`
- `analysis.qmd`
Your final wrangling script should save the completed dataset as an RDS file in `data`, and your figures should be saved as PNG files in `figs`.
### 3. General guidance
- The CDC datasets do **not** all use the same variable names, so inspect the column names carefully.
- The CDC APIs can change over time, so part of the assignment is to verify the current schema before wrangling.
- Include comments in your code so a grader can follow your logic.
- In each code chunk, change `#| eval: false` to `#| eval: true`.
- Remove `NULL` in each code chunk after you add your own code.
------------------------------------------------------------------------
## Part I: Download the data
For this part, put your code in `code/wrangle.R` unless otherwise specified.
### 1. Load the required packages and source the Census key
Add code that loads the packages you will need and defines `census_key` by sourcing `census-key.R`.
```{r}
#| eval: false
# Load packages you plan to use
library(tidyverse)
library(httr2)
library(janitor)
library(jsonlite)
library(purrr)
library(lubridate)
library(knitr)
# Define census_key by sourcing the local file
# Your code here
```
### 2. Download Census population data
Use the Census API to request 2020 and 2021 state population estimates. The code below constructs and performs the request. Add **at least two lines of your own code** to do the following:
- check that the response was successful,
- determine the content type,
- and extract the body into an object called `population_raw`.
```{r}
#| eval: false
census_url <- "https://api.census.gov/data/2021/pep/population"
population_request <- request(census_url) |>
req_url_query(
get = "POP_2020,POP_2021,NAME",
"for" = "state:*",
key = census_key
)
population_response <- population_request |>
req_perform()
# Add your code below
# 1. Check the response status
# 2. Determine the content type
# 3. Extract the body into population_raw
```
### 3. Tidy the population data
Convert `population_raw` into a tidy tibble called `population` with one row per state-year. Your final table should include:
- `state_name`
- `state`
- `year`
- `population`
Also add state abbreviations, making sure to handle DC and Puerto Rico correctly.
```{r}
#| eval: false
## Create tidy population table
population <- population_raw |>
# remove the header row problem
# convert to tibble
# pivot to long format
# parse year and population as numeric
# add state abbreviations
# keep a tidy set of variables
NULL
```
### 4. Download the regions file
The following JSON file lists the Public Health Service regions:
```{r}
#| eval: false
regions_url <- "https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/regions.json"
```
Write code to create a data frame called `regions` with the variables:
- `state_name`
- `region`
- `region_name`
One of the region names is long. Replace it with a shorter version.
```{r}
#| eval: false
regions_json <- fromJSON(regions_url, simplifyVector = FALSE)
regions <-
# convert the parsed list into a tidy data frame
# shorten the longest region name
NULL
```
### 5. Join population and region information
Join `regions` onto `population`.
```{r}
#| eval: false
population <-
# join population and regions
NULL
```
### 6. Write a reusable CDC downloader
Save the following function in `code/funcs.R`. This function should download a CDC Socrata endpoint, remove the default row limit by setting a very large `$limit`, parse the JSON, and return a tibble with clean column names.
```{r}
#| eval: false
get_cdc_data <- function(endpoint) {
request(endpoint) |>
req_url_query("$limit" = 10000000) |>
req_perform() |>
resp_check_status() |>
resp_body_json(simplifyVector = TRUE) |>
as_tibble() |>
janitor::clean_names()
}
```
In `wrangle.R`, source `funcs.R`.
```{r}
#| eval: false
source("code/funcs.R")
```
### 7. Download the CDC datasets
Use the function above to download the four datasets below. Save them as `cases_raw`, `hosp_raw`, `deaths_raw`, and `vax_raw`.
- cases: `https://data.cdc.gov/resource/pwn4-m3yp.json`
- hospitalizations: `https://data.cdc.gov/resource/39z2-9zu6.json`
- deaths: `https://data.cdc.gov/resource/r8kw-7aab.json`
- vaccinations: `https://data.cdc.gov/resource/rh2h-3yt2.json`
The download code is provided. Add **one or two lines of your own** to inspect the result, for example by checking dimensions or variable names.
```{r}
#| eval: false
cases_raw <- get_cdc_data("https://data.cdc.gov/resource/pwn4-m3yp.json")
hosp_raw <- get_cdc_data("https://data.cdc.gov/resource/39z2-9zu6.json")
deaths_raw <- get_cdc_data("https://data.cdc.gov/resource/r8kw-7aab.json")
vax_raw <- get_cdc_data("https://data.cdc.gov/resource/rh2h-3yt2.json")
# Add one or two quick inspection lines here
```
------------------------------------------------------------------------
## Part II: Wrangle the weekly state-level dataset
In this section, you will create a weekly panel for 2020 and 2021.
### 8. Compare the raw schemas
Create a small summary table that reports, for each raw CDC dataset:
1. the column containing the jurisdiction identifier,
2. whether the data are daily or weekly,
3. and the columns that contain time information.
You may render the table with `knitr::kable()`.
```{r}
#| eval: false
raw_summary <- tibble(
outcome = c("cases", "hospitalizations", "deaths", "vaccinations")
# add the remaining columns
)
# render the table
```
### 9. Wrangle the cases data
Create a table called `cases` that contains one row per state-week with:
- `state`
- `mmwr_year`
- `mmwr_week`
- `date` (the week-ending date)
- `cases` (weekly total cases)
Keep only states that appear in the `population` table.
Hints:
- The date information may need to be parsed.
- Use `epiweek()` and `epiyear()` from **lubridate**.
- The cases dataset is already weekly, but you should still verify the time variable you use.
```{r}
#| eval: false
## Wrangle weekly cases
cases <- cases_raw |>
# choose the relevant state column
# choose the relevant date column
# choose the relevant cases column
# parse dates
# restrict to jurisdictions in population
# compute MMWR year and week
# aggregate to one row per state-week if needed
NULL
```
### 10. Wrangle the hospitalization data
Create a table called `hosp` with the same weekly structure as `cases`, but containing weekly hospitalizations in a variable named `hosp`.
Because hospitalization data are daily, collapse them to state-week totals. Remove weeks with fewer than 7 reporting days.
```{r}
#| eval: false
## Wrangle weekly hospitalizations
hosp <- hosp_raw |>
# identify the state, date, and hospitalization columns
# parse the dates
# compute MMWR year and week
# aggregate to weekly totals
# count number of reporting days per week
# remove incomplete weeks
NULL
```
### 11. Wrangle the deaths data
Create a table called `deaths` with weekly COVID-19 deaths by state. Keep the variables:
- `state`
- `mmwr_year`
- `mmwr_week`
- `date`
- `deaths`
Be careful: this dataset may contain national rows and grouping variables.
```{r}
#| eval: false
## Wrangle weekly deaths
# Possible tasks:
# - keep only state-level observations
# - filter to a weekly grouping if needed
# - choose the COVID death count column
# - parse the week-ending date
# - keep the needed variables
deaths <- deaths_raw |>
NULL
```
### 12. Wrangle the vaccination data
Create a table called `vax` containing the most recent vaccination record within each state-week. Keep:
- `state`
- `mmwr_year`
- `mmwr_week`
- `date`
- `vax` (primary series cumulative count)
- `booster` (booster cumulative count)
```{r}
#| eval: false
## Wrangle weekly vaccination data
vax <- vax_raw |>
# identify state and date columns
# choose cumulative primary-series and booster variables
# parse dates
# compute MMWR year and week
# keep the last record within each state-week
NULL
```
### 13. Create the full state-week panel
We will include all state-week combinations from 2020 and 2021, even if one of the datasets is missing a value for a particular week.
Use the following scaffold to create all possible weeks and join population information:
```{r}
#| eval: false
all_dates <- data.frame(
date = seq(make_date(2020, 1, 25), make_date(2021, 12, 31), by = "week")
) |>
mutate(date = ceiling_date(date, unit = "week", week_start = 7) - days(1)) |>
mutate(mmwr_year = epiyear(date), mmwr_week = epiweek(date))
dates_and_pop <- cross_join(
all_dates,
data.frame(state = unique(population$state))
) |>
left_join(population, by = c("state", "mmwr_year" = "year"))
```
Now join `cases`, `hosp`, `deaths`, and `vax` to create a final table called `dat`.
Requirements:
- order observations by `state` and `date`,
- fill vaccination and booster counts forward within state,
- replace missing counts for `cases`, `hosp`, and `deaths` with 0,
- save the final table as `data/dat.rds`.
```{r}
#| eval: false
## Join the weekly tables
dat <- dates_and_pop |>
# join weekly cases
# join weekly hospitalizations
# join weekly deaths
# join weekly vaccinations
# arrange within state by date
# fill cumulative vaccination variables forward
# replace missing weekly counts with 0
NULL
## Save the dataset
# Your code here
```
------------------------------------------------------------------------
## Part III: Create figures in `analysis.qmd`
In your `analysis.qmd` file, load the weekly dataset from `data/dat.rds`. Each figure should have:
- a short descriptive paragraph,
- a code chunk that is shown in the rendered document,
- and appropriate titles and axis labels.
```{r}
#| eval: false
# Setup chunk for analysis.qmd
library(tidyverse)
library(lubridate)
library(scales)
dat <- readRDS("data/dat.rds")
```
### 14. Figure 1: National weekly burden
Create a trend plot showing national weekly **cases**, **hospitalizations**, and **deaths** per 100,000 people. Place the three panels on top of one another.
Hint: first aggregate to the national level by week, then use `pivot_longer()` and `facet_wrap()`.
```{r}
#| eval: false
## Figure 1: National weekly burden
fig1_dat <- dat |>
# aggregate to national weekly rates per 100,000
NULL
# make the plot
```
### 15. Figure 2: Regional case rates over time
Create a line plot of weekly **case rates per 100,000** over time for each `region_name`. Use one line per region.
```{r}
#| eval: false
## Figure 2: Regional case rates
fig2_dat <- dat |>
# aggregate to regional weekly rates
NULL
# make the plot
```
### 16. Figure 3: Vaccination and booster progress by region
For each region, compute the percent of the regional population that completed the primary series and the percent that received a booster dose. Plot both over time.
One reasonable approach is to create a long data frame with one row per region-date-series.
```{r}
#| eval: false
## Figure 3: Vaccination and booster progress
fig3_dat <- dat |>
# compute regional percentages
NULL
# make the plot
```
### 17. Figure 4: Top 10 states by cumulative death rate
By the end of 2021, which states had the highest cumulative COVID-19 death rates per 100,000? Create a bar plot of the top 10 states.
```{r}
#| eval: false
## Figure 4: Top 10 cumulative death rates
fig4_dat <- dat |>
# compute cumulative deaths by state
# convert to a per-100,000 rate
# keep the top 10
NULL
# make the plot
```
### 18. Figure 5: Vaccination and hospitalization during a major wave
Use your earlier figures to identify **one major national COVID-19 wave in 2021** during which a meaningful share of the population had already completed the primary series.
For your chosen period:
- compute **hospitalizations per day per 100,000** for each state,
- compute the **primary-series vaccination rate** in each state at the end of the period,
- create a scatter plot with vaccination rate on the x-axis and hospitalization rate on the y-axis.
Briefly explain how you chose the time window.
```{r}
#| eval: false
## Figure 5: Vaccination vs hospitalization
start_date <- as_date("YYYY-MM-DD")
end_date <- as_date("YYYY-MM-DD")
fig5_dat <- dat |>
# filter to the selected period
# summarize hospitalization burden
# join vaccination coverage at the end date
# compute rates
NULL
# make the plot
```
### 19. Figure 6: Booster coverage and death rates
Repeat the previous exercise, but now examine the relationship between:
- **booster coverage** at the end of the period, and
- **COVID-19 deaths per day per 100,000** during the period.
Use a late-2021 window for which booster coverage is nontrivial.
```{r}
#| eval: false
## Figure 6: Booster coverage vs death rate
start_date <- as_date("YYYY-MM-DD")
end_date <- as_date("YYYY-MM-DD")
fig6_dat <- dat |>
# filter to the selected period
# summarize deaths during the period
# join booster coverage at the end date
# compute rates
NULL
# make the plot
```
------------------------------------------------------------------------
## Deliverables
1. Push your `hw4` folder to your Git repo that contains:
- `code/funcs.R`
- `code/wrangle.R`
- `code/analysis.qmd`
- `data/dat.rds`
- at least six PNG files in `figs`
Submit a link to your GitHub repo.
2. A rendered PDF of this QMD file that should display both code and output:
```
quarto render hw4.qmd --to pdf
```