---
title: "Lab 1 Tutorial | PAF 516"
subtitle: "Measurement, Indexing & Multi-Scale Mapping"
format:
  html:
    theme: readable
    highlight-style: tango
    toc: true
    toc-title: "Steps"
    self-contained: true
    number-sections: false
execute:
  echo: true
  warning: false
  message: false
  fig-width: 10
---

::: {.callout-tip}
## How to use this file
**First time:** Click the **Render** button (or Cmd+Shift+K / Ctrl+Shift+K) to produce a complete HTML with all results. Then come back and run chunk by chunk to understand each step.

**Chunk by chunk:** Place cursor inside any chunk → Cmd+Return (Mac) / Ctrl+Enter (PC).

This file is for learning — do not edit it. When you are ready for the assignment, download `Lab1_Assignment.qmd` from the [Lab 1 Assignment page](https://antjam-howell.github.io/paf-516-labs/lab1-assignment.html).
:::


```{r}
#| label: setup
#| include: false
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 10)
options(tigris_use_cache = TRUE)
options(tigris_progress_bar = FALSE)
```


```{r}
#| label: packages
# ── One-time course environment setup ─────────────────────────────────────
if (!file.exists("~/.paf516_ready")) {
  if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv")
  tmp <- tempfile(fileext = ".lock")
  download.file(
    url      = "https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/renv.lock",
    destfile = tmp, mode = "wb"
  )
  renv::restore(lockfile = tmp, prompt = FALSE)
  file.create("~/.paf516_ready")
}

library(tidyverse)
library(tidycensus)
library(sf)
library(tigris)
library(psych)
library(corrplot)
library(viridis)
library(ggplot2)
library(patchwork)
```

```{r}
#| label: build-tools
#| eval: false
# ── One-time build tools setup ─────────────────────────────────────────────
# R needs a system-level compiler to install certain packages in this course.
# This is a one-time setup — you will not need to repeat it in future labs.
#
# Windows: run these two lines in your RStudio Console:
#   install.packages("installr")
#   installr::install.Rtools()
#
# Mac: run this line in your Terminal (Applications → Utilities → Terminal):
#   xcode-select --install
#
# Once complete, restart RStudio, then run the packages chunk above.
```

```{r}
#| label: api-key
# Get a free Census API key at: https://api.census.gov/data/key_signup.html
# After receiving your key by email, run this line ONCE in the RStudio console:
# census_api_key("YOUR_KEY_HERE", install = TRUE)
# Then restart R: Session → Restart R
# You only need to do this once — the key is saved to your .Renviron file.
```

```{r}
#| label: fallback-data
# ── FALLBACK: Load pre-built datasets if Census API is unavailable ─────
# Uncomment Option B lines below and skip Steps 1–4.
#
# Option A — from local file (save .rds files to PAF516/Lab1/data/ first):
# county_complete <- readRDS("../data/hardship_counties_us.rds")
# CenDF_complete  <- readRDS("../data/hardship_tracts_az.rds")
# cbg_complete    <- readRDS("../data/hardship_cbg_maricopa.rds")
#
# Option B — directly from GitHub:
# county_complete <- readRDS(url("https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/Lab1/data/hardship_counties_us.rds"))
# CenDF_complete  <- readRDS(url("https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/Lab1/data/hardship_tracts_az.rds"))
# cbg_complete    <- readRDS(url("https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/Lab1/data/hardship_cbg_maricopa.rds"))
#
# If using fallback data, skip to Step 5.
```


# Step 1: Retrieve Census Variables

We build an **Economic Hardship Index** from 5 computed rates at the Arizona tract level. Each rate requires a numerator/denominator pair from the ACS, so the variable list contains 10 census codes — but they collapse into 5 measures: poverty rate, unemployment rate, no high school diploma rate, no health insurance rate, and median income.

```{r}
#| label: step.01
#| results: hide

hardship_vars <- c(
  pov_below50    = "C17002_002",   # Income-to-poverty ratio under .50
  pov_50to99     = "C17002_003",   # Income-to-poverty ratio .50 to .99
  pov_den        = "C17002_001",   # Total population (poverty universe)
  median_income  = "B19013_001",   # Median household income
  pct_unemp_num  = "B23025_005",   # Unemployed population
  pct_unemp_den  = "B23025_002",   # In labor force
  pct_no_hs_num  = "B15003_002",   # Population with less than high school
  pct_no_hs_den  = "B15003_001",   # Total population 25+
  pct_no_ins_num = "B27010_017",   # Without health insurance (18–34)
  pct_no_ins_den = "B27010_001"    # Total population (insurance universe)
)

CenDF <- get_acs(
  geography    = "tract",
  variables    = hardship_vars,
  state        = "AZ",
  year         = 2023,
  survey       = "acs5",
  geometry     = TRUE,
  progress_bar = FALSE
)
```

```{r}
#| label: step.01-offline
#| eval: false
# ── Offline fallback — run this block instead if the Census API is down ──
# Replace the get_acs() chunk above with these readRDS() calls.
# All datasets are pre-built and stored in data/ for offline use.
#
# CenDF_complete <- readRDS("../data/hardship_tracts_az.rds")
# CenDF_complete <- readRDS("../data/CenDF_complete_5var.rds")
```

**Variable justification:**

| Variable | Census Code | Why Included |
|----------|------------|--------------|
| Poverty rate | C17002_002 + _003 / _001 | Direct measure of economic deprivation |
| Unemployment rate | B23025_005 / B23025_002 | Labor market hardship |
| No high school diploma | B15003_002 / B15003_001 | Education predicts economic outcomes |
| No health insurance | B27010_017 / B27010_001 | Reflects economic vulnerability |
| Median household income | B19013_001 | Core income measure (reverse coded) |


# Step 2: Compute Rates and Standardize

Reshape from long to wide, compute percentage variables, and standardize to z-scores.

```{r}
#| label: step.02

CenDF_wide <- CenDF %>%
  select(-moe) %>%
  pivot_wider(names_from = variable, values_from = estimate)

CenDF_wide <- CenDF_wide %>%
  mutate(
    poverty_rate = (pov_below50 + pov_50to99) / pov_den,
    unemp_rate   = pct_unemp_num / pct_unemp_den,
    no_hs_rate   = pct_no_hs_num / pct_no_hs_den,
    no_ins_rate  = pct_no_ins_num / pct_no_ins_den
  )

# Select indicator variables and standardize
index_vars <- CenDF_wide %>%
  st_drop_geometry() %>%
  select(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income) %>%
  drop_na()

index_z <- index_vars %>%
  mutate(across(everything(), ~ as.numeric(scale(.)))) %>%
  mutate(median_income = -1 * median_income)  # reverse-code: high income = low hardship
```


# Step 3: Cronbach's Alpha

Assess whether the 5 items reliably measure the same underlying construct.

```{r}
#| label: step.03

alpha_result <- psych::alpha(index_z)
print(alpha_result)
```

```{r}
#| label: alpha-summary

alpha_value <- alpha_result$total$raw_alpha
cat("Cronbach's alpha:", round(alpha_value, 3), "\n")
```

**Interpreting the output:** The `raw_alpha` is our Cronbach's alpha. The "Reliability if an item is dropped" section shows what alpha would be without each variable — if dropping an item increases alpha, it may not fit the construct well.


# Step 4: Build the Index and Visualize

```{r}
#| label: step.04a

cor_matrix <- cor(index_z, use = "complete.obs")

corrplot(
  cor_matrix,
  method      = "color",
  type        = "upper",
  addCoef.col = "black",
  tl.col      = "black",
  tl.srt      = 45,
  title       = "Inter-Item Correlation Matrix: Economic Hardship Index",
  mar         = c(0, 0, 2, 0)
)
```

```{r}
#| label: step.04b

index_z$hardship_index <- rowMeans(index_z, na.rm = TRUE)

ggplot(index_z, aes(x = hardship_index)) +
  geom_histogram(bins = 40, fill = "#8c1d40", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean(index_z$hardship_index, na.rm = TRUE),
             linetype = "dashed", color = "#ffc627", linewidth = 1) +
  theme_minimal() +
  labs(
    title    = "Distribution of Economic Hardship Index",
    subtitle = "Arizona Census Tracts (ACS 2023)",
    x        = "Economic Hardship Index (z-score scale)",
    y        = "Number of Tracts",
    caption  = "Dashed line = mean"
  )
```

```{r}
#| label: step.04c

CenDF_complete <- CenDF_wide %>%
  drop_na(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income)
CenDF_complete$hardship_index <- index_z$hardship_index
```


# Step 5: National County-Level Map

Pull the same 5 variables at the **county level** for the entire US and map national hardship patterns.

```{r}
#| label: step.05-pull
#| results: hide

county_df <- get_acs(
  geography    = "county",
  variables    = hardship_vars,
  year         = 2023,
  survey       = "acs5",
  geometry     = TRUE,
  progress_bar = FALSE
)
```

```{r}
#| label: step.05-pull-offline
#| eval: false
# ── Offline fallback — run this block instead if the Census API is down ──
# Replace the get_acs() chunk above with these readRDS() calls.
# All datasets are pre-built and stored in data/ for offline use.
#
# county_complete <- readRDS("../data/hardship_counties_us.rds")
# county_complete <- readRDS("../data/county_complete.rds")
```

```{r}
#| label: step.05-index

county_wide <- county_df %>%
  select(-moe) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  mutate(
    poverty_rate = (pov_below50 + pov_50to99) / pov_den,
    unemp_rate   = pct_unemp_num / pct_unemp_den,
    no_hs_rate   = pct_no_hs_num / pct_no_hs_den,
    no_ins_rate  = pct_no_ins_num / pct_no_ins_den
  )

county_index <- county_wide %>%
  st_drop_geometry() %>%
  select(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income) %>%
  drop_na() %>%
  mutate(across(everything(), ~ as.numeric(scale(.)))) %>%
  mutate(median_income = -1 * median_income)

county_index$hardship_index <- rowMeans(county_index, na.rm = TRUE)

county_complete <- county_wide %>%
  drop_na(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income)
county_complete$hardship_index <- county_index$hardship_index
```

```{r}
#| label: step.05-map

county_shifted <- shift_geometry(county_complete)

ggplot(county_shifted) +
  geom_sf(aes(fill = hardship_index), color = NA, linewidth = 0) +
  scale_fill_viridis_c(option = "inferno", name = "Hardship\nIndex", na.value = "grey80") +
  theme_void() +
  labs(
    title    = "Economic Hardship Across U.S. Counties",
    subtitle = "ACS 5-Year (2023) | Higher values = greater hardship",
    caption  = "Alaska and Hawaii shown as insets"
  )
```


# Step 6: Arizona Census Tract Map

```{r}
#| label: step.06-map

ggplot(CenDF_complete) +
  geom_sf(aes(fill = hardship_index), color = NA) +
  scale_fill_viridis(option = "inferno", direction = -1, name = "Hardship\nIndex") +
  theme_void() +
  labs(
    title    = "Economic Hardship Index by Census Tract",
    subtitle = "Arizona — ACS 5-Year (2023)",
    caption  = "Higher values = greater economic hardship"
  )
```


# Step 7: Maricopa County Block Group Map

Block groups are the smallest ACS geography, revealing fine-grained variation within tracts.

```{r}
#| label: step.07-pull
#| results: hide

cbg_df <- get_acs(
  geography    = "block group",
  variables    = hardship_vars,
  state        = "AZ",
  county       = "Maricopa",
  year         = 2023,
  survey       = "acs5",
  geometry     = TRUE,
  progress_bar = FALSE
)
```

```{r}
#| label: step.07-pull-offline
#| eval: false
# ── Offline fallback — run this block instead if the Census API is down ──
# Replace the get_acs() chunk above with these readRDS() calls.
# All datasets are pre-built and stored in data/ for offline use.
#
# cbg_complete <- readRDS("../data/hardship_cbg_maricopa.rds")
# cbg_complete <- readRDS("../data/cbg_complete.rds")
```

```{r}
#| label: step.07-index

cbg_wide <- cbg_df %>%
  select(-moe) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  mutate(
    poverty_rate = (pov_below50 + pov_50to99) / pov_den,
    unemp_rate   = pct_unemp_num / pct_unemp_den,
    no_hs_rate   = pct_no_hs_num / pct_no_hs_den,
    no_ins_rate  = pct_no_ins_num / pct_no_ins_den
  )

cbg_index <- cbg_wide %>%
  st_drop_geometry() %>%
  select(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income) %>%
  drop_na() %>%
  mutate(across(everything(), ~ as.numeric(scale(.)))) %>%
  mutate(median_income = -1 * median_income)

cbg_index$hardship_index <- rowMeans(cbg_index, na.rm = TRUE)

cbg_complete <- cbg_wide %>%
  drop_na(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income)
cbg_complete$hardship_index <- cbg_index$hardship_index
```

```{r}
#| label: step.07-map

ggplot(cbg_complete) +
  geom_sf(aes(fill = hardship_index), color = NA) +
  scale_fill_viridis(option = "inferno", direction = -1, name = "Hardship\nIndex") +
  theme_void() +
  labs(
    title    = "Economic Hardship Index by Block Group",
    subtitle = "Maricopa County, AZ — ACS 5-Year (2023)",
    caption  = "Higher values = greater economic hardship"
  )
```


# Step 8: Enriched Index — Does Measurement Choice Change the Map?

The 5-variable baseline captures five well-established dimensions of economic hardship. But hardship is multi-dimensional. A broader index might also reflect:

| New Variable | Dimension | Census Code |
|---|---|---|
| Renter-occupied rate | Housing instability / wealth exclusion | B25003_003 / B25003_001 |
| No vehicle access | Transportation barrier | B08201_002 / B08201_001 |
| Single-parent household rate | Family structure stress | B11012_010 / B11001_001 |
| Public assistance income rate | Direct economic need | B19057_002 / B19057_001 |
| No internet access rate | Digital divide / economic barrier | B28002_013 / B28002_001 |

Adding these five produces a **10-variable enriched index**. The key question: does this broader measurement change which communities we identify as hardship hotspots — and how much?

```{r}
#| label: step.08-pull
#| results: hide

hardship_vars_10 <- c(
  # ── Original 5 variable pairs ──────────────────────────────────────
  pov_below50    = "C17002_002",   pov_50to99     = "C17002_003",
  pov_den        = "C17002_001",   median_income  = "B19013_001",
  pct_unemp_num  = "B23025_005",   pct_unemp_den  = "B23025_002",
  pct_no_hs_num  = "B15003_002",   pct_no_hs_den  = "B15003_001",
  pct_no_ins_num = "B27010_017",   pct_no_ins_den = "B27010_001",
  # ── 5 new dimensions ───────────────────────────────────────────────
  renter_num     = "B25003_003",   renter_den     = "B25003_001",
  noveh_num      = "B08201_002",   noveh_den      = "B08201_001",
  sinpar_num     = "B11012_010",   sinpar_den     = "B11001_001",
  pubast_num     = "B19057_002",   pubast_den     = "B19057_001",
  noinet_num     = "B28002_013",   noinet_den     = "B28002_001"
)

CenDF_10 <- get_acs(
  geography    = "tract",
  variables    = hardship_vars_10,
  state        = "AZ",
  year         = 2023,
  survey       = "acs5",
  geometry     = TRUE,
  progress_bar = FALSE
)
```

```{r}
#| label: step.08-pull-offline
#| eval: false
# ── Offline fallback — run this block instead if the Census API is down ──
# Replace the get_acs() chunk above with these readRDS() calls.
# All datasets are pre-built and stored in data/ for offline use.
#
# CenDF_complete_10 <- readRDS("../data/CenDF_complete_10var.rds")
```

```{r}
#| label: step.08-build10

CenDF_wide_10 <- CenDF_10 %>%
  select(-moe) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  mutate(
    # Original 5
    poverty_rate = (pov_below50 + pov_50to99) / pov_den,
    unemp_rate   = pct_unemp_num  / pct_unemp_den,
    no_hs_rate   = pct_no_hs_num  / pct_no_hs_den,
    no_ins_rate  = pct_no_ins_num / pct_no_ins_den,
    # New 5
    renter_rate  = renter_num / renter_den,
    noveh_rate   = noveh_num  / noveh_den,
    sinpar_rate  = sinpar_num / sinpar_den,
    pubast_rate  = pubast_num / pubast_den,
    noinet_rate  = noinet_num / noinet_den
  )

CenDF_complete_10 <- CenDF_wide_10 %>%
  drop_na(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate,
          median_income, renter_rate, noveh_rate,
          sinpar_rate, pubast_rate, noinet_rate)

index_z_10 <- CenDF_complete_10 %>%
  st_drop_geometry() %>%
  select(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate,
         median_income, renter_rate, noveh_rate,
         sinpar_rate, pubast_rate, noinet_rate) %>%
  mutate(across(everything(), ~ as.numeric(scale(.)))) %>%
  mutate(median_income = -1 * median_income)   # reverse-code

# Compare alpha: 5-variable vs 10-variable
alpha_5  <- psych::alpha(index_z  |> select(-hardship_index))
alpha_10 <- psych::alpha(index_z_10)

cat("5-variable alpha: ", round(alpha_5$total$raw_alpha,  3), "\n")
cat("10-variable alpha:", round(alpha_10$total$raw_alpha, 3), "\n")
cat("Change:           ", round(alpha_10$total$raw_alpha - alpha_5$total$raw_alpha, 3), "\n")

# Build enriched index
index_z_10$hardship_index_10 <- rowMeans(index_z_10, na.rm = TRUE)
CenDF_complete_10$hardship_index_10 <- index_z_10$hardship_index_10
```

```{r}
#| label: step.08-compare-maps
#| fig-width: 16
#| fig-height: 6

p_5var <- ggplot(CenDF_complete) +
  geom_sf(aes(fill = hardship_index), color = NA) +
  scale_fill_viridis_c(option = "inferno", name = "Hardship",
                       na.value = "grey80") +
  theme_void() +
  labs(title = "5-Variable Index",
       subtitle = "Poverty · Unemployment · Education\nInsurance · Income")

p_10var <- ggplot(CenDF_complete_10) +
  geom_sf(aes(fill = hardship_index_10), color = NA) +
  scale_fill_viridis_c(option = "inferno", name = "Hardship",
                       na.value = "grey80") +
  theme_void() +
  labs(title = "10-Variable Enriched Index",
       subtitle = "Adds renter · no vehicle · single-parent\npublic assistance · no internet")

p_5var + p_10var +
  plot_annotation(
    title    = "Original vs. Enriched Economic Hardship Index",
    subtitle = "Does expanding measurement change the spatial story?",
    caption  = "ACS 5-Year (2023) | Arizona Census Tracts"
  )
```

```{r}
#| label: step.08-difference
#| fig-width: 9
#| fig-height: 6

compare_df <- CenDF_complete_10 %>%
  select(GEOID, geometry, hardship_index_10) %>%
  left_join(
    CenDF_complete %>% st_drop_geometry() %>% select(GEOID, hardship_index),
    by = "GEOID"
  ) %>%
  mutate(diff_10_minus_5 = hardship_index_10 - hardship_index)

ggplot(compare_df) +
  geom_sf(aes(fill = diff_10_minus_5), color = NA) +
  scale_fill_gradient2(
    low = "#4575b4", mid = "#f7f7f7", high = "#d73027",
    midpoint = 0, name = "10-var\nminus\n5-var"
  ) +
  theme_void() +
  labs(
    title    = "Where Does Measurement Choice Matter Most?",
    subtitle = "Red = tracts scoring higher with enriched index\nBlue = tracts scoring lower",
    caption  = "ACS 5-Year (2023) | Difference = 10-variable minus 5-variable index"
  )
```

**Interpreting the difference map:** Red tracts are communities where the enriched index reveals *more hardship* than the baseline captured — likely places with high renter rates, low vehicle access, or many single-parent households that census income and poverty measures undercount. Blue tracts are places where those new dimensions don't add to the hardship signal — often wealthier suburbs where some of those variables don't apply.

---

::: {.callout-note}
## Ready for the assignment?
Head to the [Lab 1 Assignment page](https://antjam-howell.github.io/paf-516-labs/lab1-assignment.html) to download `Lab1_Assignment.qmd`. The assignment asks you to add **SNAP receipt** as an 11th variable and show whether one more dimension still meaningfully shifts the difference map.
:::