---
title: "Lab 3 Tutorial | PAF 516"
subtitle: "Spatial Data Integration & Environmental Justice"
format:
  html:
    theme: readable
    highlight-style: tango
    toc: true
    toc-title: "Steps"
    self-contained: true
    number-sections: false
execute:
  echo: true
  warning: false
  message: false
  fig-width: 10
---

::: {.callout-tip}
## How to use this file
**First time:** Click **Render** (Cmd+Shift+K / Ctrl+Shift+K) to see all results. Then run chunk by chunk to understand each step.

This file is for learning — do not edit it. When ready for the assignment, go to:
[https://antjam-howell.github.io/paf-516-labs/lab3-assignment.html](https://antjam-howell.github.io/paf-516-labs/lab3-assignment.html)
:::

```{r}
#| label: setup
#| include: false
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 10)
options(tigris_use_cache = TRUE, tigris_progress_bar = FALSE)
```

```{r}
#| label: packages
# ── One-time course environment setup ─────────────────────────────────────
if (!file.exists("~/.paf516_ready")) {
  if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv")
  tmp <- tempfile(fileext = ".lock")
  download.file(
    url      = "https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/renv.lock",
    destfile = tmp, mode = "wb"
  )
  renv::restore(lockfile = tmp, prompt = FALSE)
  file.create("~/.paf516_ready")
}

library(tidyverse)
library(tidycensus)
library(sf)
library(tigris)
library(psych)
library(patchwork)
library(viridis)
```

```{r}
#| label: api-key
# Get a free Census API key at: https://api.census.gov/data/key_signup.html
# Run ONCE in RStudio console: census_api_key("YOUR_KEY_HERE", install = TRUE)
# Then restart R.
```

```{r}
#| label: fallback-data
# Option A (local): cbg_raw <- readRDS("../data/cbg_hardship_maricopa.rds")
# Option B (GitHub):
# cbg_raw <- readRDS(url("https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/Lab3/data/cbg_hardship_maricopa.rds"))
# If using fallback, skip to Step 2.
```


# Step 1: Pull Maricopa County Block Group Data

```{r}
#| label: step.01-vars

hardship_vars <- c(
  pov_below50    = "C17002_002",  pov_50to99  = "C17002_003",
  pov_den        = "C17002_001",  median_income = "B19013_001",
  pct_unemp_num  = "B23025_005",  pct_unemp_den = "B23025_002",
  pct_no_hs_num  = "B15003_002",  pct_no_hs_den = "B15003_001",
  pct_no_ins_num = "B27010_017",  pct_no_ins_den= "B27010_001"
)
```

```{r}
#| label: step.01-pull
#| results: hide

cbg_raw <- get_acs(
  geography = "block group", variables = hardship_vars,
  state = "AZ", county = "Maricopa",
  year = 2023, survey = "acs5", geometry = TRUE, progress_bar = FALSE
)
```

```{r}
#| label: step.01-pull-offline
#| eval: false
# ── Offline fallback — run this block instead if the Census API is down ──
# Replace the get_acs() chunk above with these readRDS() calls.
# All datasets are pre-built and stored in data/ for offline use.
#
# cbg_complete <- readRDS("../data/cbg_complete.rds")
# cbg_proj     <- readRDS("../data/cbg_proj.rds")
```


# Step 2: Build the 5-Variable Economic Hardship Index

```{r}
#| label: step.02-wrangle

cbg_wide <- cbg_raw %>%
  select(-moe) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  mutate(
    poverty_rate = (pov_below50 + pov_50to99) / pov_den,
    unemp_rate   = pct_unemp_num / pct_unemp_den,
    no_hs_rate   = pct_no_hs_num / pct_no_hs_den,
    no_ins_rate  = pct_no_ins_num / pct_no_ins_den
  )

cbg_complete <- cbg_wide %>%
  drop_na(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income)

index_vars <- cbg_complete %>%
  st_drop_geometry() %>%
  select(poverty_rate, unemp_rate, no_hs_rate, no_ins_rate, median_income)

index_z <- index_vars %>%
  mutate(across(everything(), ~ as.numeric(scale(.)))) %>%
  mutate(median_income = -1 * median_income)

index_z$hardship_index <- rowMeans(index_z[, 1:5], na.rm = TRUE)
cbg_complete$hardship_index <- index_z$hardship_index

cat("CBGs with valid index:", nrow(cbg_complete), "\n")
```


# Step 3: Load External Point Data (Environmental Facilities)

```{r}
#| label: step.04-facilities

facilities <- tibble(
  name = c("APS Ocotillo Power Plant","Intel Chandler Fab 32",
            "Honeywell Aerospace","Univar Solutions",
            "Goodyear Airport Industrial Park","Former Motorola 52nd Street Site",
            "19th Avenue Landfill","Arizona Plating Works",
            "Phoenix Sky Harbor Area (TRI cluster)","Mesa NW Industrial (TRI cluster)"),
  type = c("Power Generation","Semiconductor Mfg","Aerospace Mfg",
            "Chemical Distribution","Industrial/Airport","Superfund",
            "Superfund/Landfill","Superfund","TRI Cluster","TRI Cluster"),
  lat  = c(33.3919,33.2974,33.4352,33.4423,33.4350,
            33.4162,33.5070,33.4445,33.4372,33.4512),
  lon  = c(-111.8828,-111.8156,-112.0694,-111.9738,-112.3797,
            -111.9763,-112.1171,-112.0738,-112.0167,-111.8935)
)
cat("Facilities:", nrow(facilities), "\n")
```

::: {.callout-warning}
## Note on facility data
This is a **representative sample** of 10 major EPA-regulated and Superfund sites in Maricopa County — not a comprehensive listing. The full EPA TRI database contains hundreds of facilities. Using only 10 sites produces sparse inner buffers (roughly 14 block groups within 0.5 miles, ~59 within 1 mile), so treat the distance-decay pattern as illustrative of the method rather than a definitive EJ finding. For a real analysis, download the full TRI dataset from `https://www.epa.gov/toxics-release-inventory-tri-program`.
:::

```{r}
#| label: step.04-note
#| include: false
# Placeholder — note displayed above
```


# Step 4: Convert to sf and Align CRS

```{r}
#| label: step.05-crs

facilities_sf   <- st_as_sf(facilities, coords = c("lon", "lat"), crs = 4326)
facilities_proj <- st_transform(facilities_sf, crs = 2868)
cbg_proj        <- st_transform(cbg_complete, crs = 2868) |> st_make_valid()
cat("Both layers in EPSG:2868 (Arizona Central, US survey feet)\n")
```


# Step 5: Compute Distance to Nearest Facility

Rather than just asking "is this block group inside a buffer?" we compute the **exact distance from each block group's centroid to the nearest EPA facility**. This gives a continuous proximity measure and enables the distance-decay analysis in Step 7.

```{r}
#| label: step.06-distance

# Compute centroids of each block group
cbg_centroids <- st_centroid(cbg_proj)

# Distance matrix: every CBG centroid to every facility (feet)
dist_matrix <- st_distance(cbg_centroids, facilities_proj)

# Keep only the distance to the NEAREST facility
cbg_proj$dist_nearest_ft <- apply(dist_matrix, 1, min) |> as.numeric()
cbg_proj$dist_nearest_mi <- cbg_proj$dist_nearest_ft / 5280

cat("Median distance to nearest facility:", round(median(cbg_proj$dist_nearest_mi), 2), "miles\n")
cat("% of CBGs within 1 mile:", round(100 * mean(cbg_proj$dist_nearest_mi < 1), 1), "%\n")
cat("% of CBGs within 0.5 miles:", round(100 * mean(cbg_proj$dist_nearest_mi < 0.5), 1), "%\n")
```


# Step 6: Facility Density Map + Hardship Overlay

The compelling EJ visual: where are facilities located, and does hardship follow the same pattern?

```{r}
#| label: step.07-map
#| fig-width: 14
#| fig-height: 6

cbg_wgs84 <- st_transform(cbg_proj, 4326)
fac_wgs84 <- st_transform(facilities_proj, 4326)

# Hardship choropleth with facility locations overlaid
p_hardship_fac <- ggplot() +
  geom_sf(data = cbg_wgs84, aes(fill = hardship_index), color = NA) +
  scale_fill_viridis_c(option = "inferno", name = "Hardship\nIndex",
                       na.value = "grey80") +
  geom_sf(data = fac_wgs84, color = "white", fill = "#ffc627",
          shape = 21, size = 3.5, stroke = 1.2) +
  theme_void() +
  labs(
    title    = "Economic Hardship & EPA Facility Locations",
    subtitle = "Maricopa County Block Groups — Gold dots = EPA-regulated / Superfund facilities",
    caption  = "ACS 5-Year (2023) + EPA TRI / Superfund data"
  )

# Distance to nearest facility choropleth
p_distance <- ggplot(cbg_wgs84) +
  geom_sf(aes(fill = dist_nearest_mi), color = NA) +
  scale_fill_viridis_c(option = "plasma", direction = -1,
                       name = "Miles to\nNearest\nFacility",
                       na.value = "grey80") +
  geom_sf(data = fac_wgs84, color = "white", fill = "black",
          shape = 21, size = 3, stroke = 1) +
  theme_void() +
  labs(
    title    = "Distance to Nearest EPA Facility",
    subtitle = "Darker = closer to facility",
    caption  = "CRS: EPSG:2868 → displayed in WGS84"
  )

p_hardship_fac + p_distance +
  plot_annotation(
    title = "Do High-Hardship Block Groups Cluster Near EPA Facilities?"
  )
```


# Step 7: Distance-Decay Analysis — The EJ Finding

**The core environmental justice question:** Does hardship *decrease* as you move farther from EPA facilities? If yes, proximity to environmental burden compounds socioeconomic disadvantage — exactly what EJ research predicts.

```{r}
#| label: step.08-decay

# Create distance bands (the exposure zones you are comparing)
cbg_decay <- cbg_proj |>
  st_drop_geometry() |>
  mutate(
    dist_band = cut(
      dist_nearest_mi,
      breaks = c(0, 0.5, 1.0, 2.0, 5.0, Inf),
      labels = c("< 0.5 mi", "0.5–1 mi", "1–2 mi", "2–5 mi", "> 5 mi"),
      right  = TRUE, include.lowest = TRUE
    )
  )

# Compute mean hardship and poverty rate within each distance band
decay_summary <- cbg_decay |>
  group_by(dist_band) |>
  summarise(
    n_cbg           = n(),
    mean_hardship   = round(mean(hardship_index,   na.rm = TRUE), 3),
    mean_poverty    = round(mean(poverty_rate * 100, na.rm = TRUE), 1),
    mean_unemp      = round(mean(unemp_rate * 100,   na.rm = TRUE), 1),
    .groups = "drop"
  )

print(decay_summary)
```

```{r}
#| label: step.08-decay-plot
#| fig-width: 12
#| fig-height: 5

# Hardship distance-decay bar chart
p_decay_hardship <- ggplot(decay_summary, aes(x = dist_band, y = mean_hardship)) +
  geom_col(aes(fill = mean_hardship), color = "white") +
  scale_fill_viridis_c(option = "inferno", direction = 1, guide = "none") +
  geom_text(aes(label = round(mean_hardship, 2)),
            vjust = -0.4, fontface = "bold", size = 3.5) +
  theme_minimal(base_size = 12) +
  labs(
    title    = "Economic Hardship Decreases with Distance from EPA Facilities",
    subtitle = "Mean economic hardship index by distance band | Maricopa County Block Groups",
    x        = "Distance to Nearest EPA Facility",
    y        = "Mean Economic Hardship Index",
    caption  = "Distance-decay pattern: UGCoP predicts this relationship changes with buffer threshold choice"
  )

# Poverty rate distance-decay
p_decay_poverty <- ggplot(decay_summary, aes(x = dist_band, y = mean_poverty)) +
  geom_col(aes(fill = mean_poverty), color = "white") +
  scale_fill_viridis_c(option = "rocket", direction = 1, guide = "none") +
  geom_text(aes(label = paste0(round(mean_poverty, 1), "%")),
            vjust = -0.4, fontface = "bold", size = 3.5) +
  theme_minimal(base_size = 12) +
  labs(
    title    = "Poverty Rate by Distance from EPA Facilities",
    subtitle = "Higher poverty near facilities — lower near clean suburbs",
    x        = "Distance to Nearest EPA Facility",
    y        = "Mean Poverty Rate (%)"
  )

p_decay_hardship + p_decay_poverty +
  plot_annotation(
    caption = "ACS 5-Year (2023) + EPA TRI data | Maricopa County Block Groups"
  )
```

**Reading the chart:** If hardship and poverty are highest in the "< 0.5 mi" bar and decrease across the distance bands, the data support the EJ hypothesis: **poorer, more disadvantaged communities are disproportionately located near EPA-regulated facilities**. Wealthier neighborhoods are farther away.


# Step 8: Multi-Buffer UGCoP Comparison

**The UGCoP question:** How sensitive is this finding to the buffer threshold? We run the analysis at three distances — 0.5, 1.0, and 2.0 miles — and compare results.

```{r}
#| label: step.09-ugcop

# Compute mean hardship for CBGs inside vs. outside each buffer distance
buffer_comparison <- tibble(
  buffer_mi = c(0.5, 1.0, 2.0)
) |>
  rowwise() |>
  mutate(
    inside  = mean(cbg_proj$hardship_index[cbg_proj$dist_nearest_mi <= buffer_mi], na.rm = TRUE),
    outside = mean(cbg_proj$hardship_index[cbg_proj$dist_nearest_mi >  buffer_mi], na.rm = TRUE),
    n_inside  = sum(cbg_proj$dist_nearest_mi <= buffer_mi, na.rm = TRUE),
    n_outside = sum(cbg_proj$dist_nearest_mi >  buffer_mi, na.rm = TRUE),
    difference = inside - outside
  ) |>
  ungroup() |>
  mutate(across(c(inside, outside, difference), ~ round(., 3)))

cat("=== UGCoP: How buffer distance changes the EJ finding ===\n\n")
print(buffer_comparison |>
        select(buffer_mi, n_inside, inside, outside, difference) |>
        rename(`Buffer (mi)` = buffer_mi,
               `N inside`   = n_inside,
               `Mean hardship inside` = inside,
               `Mean hardship outside`= outside,
               `Difference (in - out)` = difference))
```

```{r}
#| label: step.09-ugcop-plot
#| fig-width: 11
#| fig-height: 5

# Reshape for grouped bar chart
buf_long <- buffer_comparison |>
  select(buffer_mi, inside, outside) |>
  pivot_longer(cols = c(inside, outside),
               names_to  = "group",
               values_to = "mean_hardship") |>
  mutate(
    group = factor(group, levels = c("inside","outside"),
                   labels = c("Inside buffer","Outside buffer"))
  )

ggplot(buf_long,
       aes(x = factor(buffer_mi), y = mean_hardship, fill = group)) +
  geom_col(position = "dodge", color = "white", width = 0.65) +
  geom_text(aes(label = round(mean_hardship, 2)),
            position = position_dodge(width = 0.65),
            vjust = -0.4, size = 3.5, fontface = "bold") +
  scale_fill_manual(
    values = c("Inside buffer" = "#d73027", "Outside buffer" = "#4575b4"),
    name   = NULL
  ) +
  scale_x_discrete(labels = paste0(c(0.5, 1.0, 2.0), "-mile buffer")) +
  theme_minimal(base_size = 12) +
  labs(
    title    = "UGCoP: Mean Hardship Inside vs. Outside Buffer — Three Distances",
    subtitle = "Does the 'proximity penalty' shrink or grow as the buffer expands?",
    x        = "Buffer Distance",
    y        = "Mean Economic Hardship Index",
    caption  = "Red = inside buffer | Blue = outside | ACS 2023 + EPA TRI data"
  )
```

**Reading the chart:** If the gap between inside and outside is consistent across all three buffer distances, the finding is **robust** — it doesn't depend on where you draw the line. If the gap disappears or reverses at one distance, the finding is **threshold-sensitive** and the UGCoP is producing a misleading result at that threshold. This is exactly what Kwan (2012) warned about.

---

::: {.callout-note}
## Ready for the assignment?
Head to the [Lab 3 Assignment page](https://antjam-howell.github.io/paf-516-labs/lab3-assignment.html) to download `Lab3_Assignment.qmd`. The assignment asks you to test a different distance threshold and interpret what the distance-decay pattern tells you about the UGCoP.
:::