--- title: "Lab 4 Tutorial | PAF 516" subtitle: "Spatial Autocorrelation & Hot Spot Analysis" format: html: theme: readable highlight-style: tango toc: true toc-title: "Steps" self-contained: true number-sections: false execute: echo: true warning: false message: false fig-width: 10 --- ::: {.callout-tip} ## How to use this file **First time:** Click **Render** (Cmd+Shift+K / Ctrl+Shift+K) to see all results. Then run chunk by chunk to understand each step. This file is for learning — do not edit it. When ready for the assignment, go to: [https://antjam-howell.github.io/paf-516-labs/lab4-assignment.html](https://antjam-howell.github.io/paf-516-labs/lab4-assignment.html) ::: ```{r} #| label: setup #| include: false knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 10) options(tigris_use_cache = TRUE, tigris_progress_bar = FALSE) ``` ```{r} #| label: packages # ── One-time course environment setup ───────────────────────────────────── if (!file.exists("~/.paf516_ready")) { if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv") tmp <- tempfile(fileext = ".lock") download.file( url = "https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/renv.lock", destfile = tmp, mode = "wb" ) renv::restore(lockfile = tmp, prompt = FALSE) file.create("~/.paf516_ready") } library(tidyverse) library(tidycensus) library(sf) library(spdep) # Spatial dependence: weights, Moran's I, LISA library(viridis) library(patchwork) ``` ```{r} #| label: api-key # Get a free Census API key at: https://api.census.gov/data/key_signup.html # Run ONCE in RStudio console: census_api_key("YOUR_KEY_HERE", install = TRUE) # Then restart R. ``` ```{r} #| label: fallback-data # Option A (local): cbg_raw <- readRDS("../data/cbg_hardship_maricopa_2023.rds") # Option B (GitHub): # cbg_raw <- readRDS(url("https://raw.githubusercontent.com/AntJam-Howell/paf-516-labs/main/Lab4/data/cbg_hardship_maricopa_2023.rds")) # If using fallback, skip to Step 2. ``` # Step 1: Pull Maricopa County Block Group Data By now you have done this four times. We pull the same five ACS hardship variables used in Lab 1, but this time at the **Census Block Group (CBG)** level for Maricopa County. CBGs are smaller than tracts (roughly 600--3,000 people), giving us finer spatial resolution to detect local clustering. ```{r} #| label: step.01-vars hardship_vars <- c( pov_below50 = "C17002_002", pov_50to99 = "C17002_003", pov_den = "C17002_001", median_income = "B19013_001", unemp_num = "B23025_005", unemp_den = "B23025_002", no_hs_num = "B15003_002", no_hs_den = "B15003_001", no_ins_num = "B27010_017", no_ins_den = "B27010_001" ) ``` ```{r} #| label: step.01-pull #| results: hide cbg_raw <- get_acs( geography = "block group", variables = hardship_vars, state = "AZ", county = "Maricopa", year = 2023, survey = "acs5", geometry = TRUE, progress_bar = FALSE ) ``` ```{r} #| label: step.01-pull-offline #| eval: false # ── Offline fallback — run this block instead if the Census API is down ── # Replace the get_acs() chunk above with these readRDS() calls. # All datasets are pre-built and stored in data/ for offline use. # # cbg_raw <- readRDS("../data/cbg_index_lab4_complete.rds") ``` # Step 2: Build the 5-Variable Economic Hardship Index ```{r} #| label: step.02-wrangle cbg_wide <- cbg_raw %>% select(-moe) %>% pivot_wider(names_from = variable, values_from = estimate) %>% mutate( poverty_rate = (pov_below50 + pov_50to99) / pov_den, unemp_rate = unemp_num / unemp_den, no_hs_rate = no_hs_num / no_hs_den, no_ins_rate = no_ins_num / no_ins_den ) cbg_index <- cbg_wide %>% filter(!is.na(poverty_rate), !is.na(unemp_rate), !is.na(no_hs_rate), !is.na(no_ins_rate), !is.na(median_income)) %>% mutate( z_poverty = as.numeric(scale(poverty_rate)), z_unemp = as.numeric(scale(unemp_rate)), z_no_hs = as.numeric(scale(no_hs_rate)), z_no_ins = as.numeric(scale(no_ins_rate)), z_income = -1 * as.numeric(scale(median_income)) ) %>% mutate( hardship_index = (z_poverty + z_unemp + z_no_hs + z_no_ins + z_income) / 5 ) cat("CBGs with valid index:", nrow(cbg_index), "\n") ``` ```{r} #| label: step.02-map #| fig-height: 8 ggplot(cbg_index) + geom_sf(aes(fill = hardship_index), color = NA) + scale_fill_viridis(option = "inferno", name = "Hardship\nIndex") + labs( title = "Economic Hardship Index — Maricopa County CBGs", subtitle = "Higher values = greater hardship (z-score composite)", caption = "Source: ACS 2019-2023 5-Year Estimates" ) + theme_void() ``` The map shows a clear visual pattern: high-hardship CBGs appear to cluster together in certain parts of the county, particularly in south Phoenix and parts of the west valley. But is this clustering statistically significant, or could it be random? That is the question spatial autocorrelation answers. # Step 3: Build Spatial Weights Before we can test for spatial clustering, we need to define what "neighbor" means. We use **queen contiguity**: two CBGs are neighbors if they share any boundary point (including corners), similar to how a queen moves in chess. ```{r} #| label: step.03-weights # Fix any geometry errors before building neighbor lists cbg_index <- st_make_valid(cbg_index) # poly2nb(): identify which CBGs share a boundary nb <- poly2nb(cbg_index, queen = TRUE) # nb2listw(): convert to row-standardized spatial weights # style="W" means each neighbor gets weight 1/n (n = number of neighbors) # zero.policy=TRUE handles any island CBGs with no neighbors lw <- nb2listw(nb, style = "W", zero.policy = TRUE) ``` ::: {.callout-tip} ## Row-standardization explained If CBG A has 3 neighbors, each gets weight 1/3. If CBG B has 6 neighbors, each gets weight 1/6. This way, the **spatial lag** (weighted average of neighbors) is a true average regardless of how many neighbors a polygon has. ::: # Step 4: Global Moran's I Global Moran's I is a single summary statistic for the **entire study area**. It answers: "Overall, is the economic hardship index spatially clustered, randomly distributed, or dispersed?" - **I near +1** = strong positive spatial autocorrelation (clustering) - **I near 0** = spatial randomness - **I near -1** = spatial dispersion (checkerboard) ```{r} #| label: step.04-morans moran_global <- moran.test( cbg_index$hardship_index, lw, zero.policy = TRUE ) moran_global ``` ```{r} #| label: step.04-interpret cat("Moran's I statistic:", round(moran_global$estimate["Moran I statistic"], 4), "\n") cat("Expected I under randomness:", round(moran_global$estimate["Expectation"], 4), "\n") cat("p-value:", format.pval(moran_global$p.value, digits = 3), "\n") ``` **Interpretation:** The Global Moran's I is positive and the p-value is extremely small (well below 0.05), meaning we reject the null hypothesis of spatial randomness. Hardship values are **significantly clustered** in Maricopa County --- CBGs with high hardship tend to be surrounded by other high-hardship CBGs, and vice versa. # Step 5: Moran Scatter Plot The Moran scatter plot visualizes the relationship between each CBG's hardship value (x-axis) and the average hardship of its neighbors --- the **spatial lag** (y-axis). The slope of the fitted line equals Global Moran's I. The four quadrants define the type of spatial association: - **HH (upper-right):** High value surrounded by high neighbors (hot spot) - **LL (lower-left):** Low value surrounded by low neighbors (cold spot) - **HL (lower-right):** High value surrounded by low neighbors (spatial outlier) - **LH (upper-left):** Low value surrounded by high neighbors (spatial outlier) ```{r} #| label: step.05-scatter #| fig-height: 7 moran.plot( cbg_index$hardship_index, lw, zero.policy = TRUE, xlab = "Economic Hardship Index (standardized)", ylab = "Spatial Lag of Economic Hardship Index", main = "Moran Scatter Plot — Economic Hardship", pch = 20, col = "steelblue" ) ``` Most points fall in the HH and LL quadrants, confirming positive spatial autocorrelation. The relatively few points in the HL and LH quadrants are spatial outliers --- CBGs that differ markedly from their neighbors. # Step 6: Local Moran's I (LISA) Global Moran's I tells us clustering exists *overall*, but not *where*. **Local Indicators of Spatial Association (LISA)** decompose the global statistic into a value for **each CBG**, telling us: 1. Whether that specific CBG is part of a significant cluster 2. What *type* of cluster it belongs to (HH, LL, HL, or LH) ```{r} #| label: step.06-lisa lisa <- localmoran( cbg_index$hardship_index, lw, zero.policy = TRUE ) # Attach LISA results to the spatial data frame cbg_index$lisa_i <- lisa[, "Ii"] cbg_index$lisa_z <- lisa[, "Z.Ii"] cbg_index$lisa_p <- lisa[, "Pr(z != E(Ii))"] # Compute the spatial lag for quadrant classification cbg_index$lag_hardship <- lag.listw(lw, cbg_index$hardship_index, zero.policy = TRUE) ``` ```{r} #| label: step.06-classify cbg_index <- cbg_index %>% mutate( val_centered = hardship_index - mean(hardship_index, na.rm = TRUE), lag_centered = lag_hardship - mean(lag_hardship, na.rm = TRUE), quadrant = case_when( val_centered > 0 & lag_centered > 0 ~ "HH", val_centered < 0 & lag_centered < 0 ~ "LL", val_centered > 0 & lag_centered < 0 ~ "HL", val_centered < 0 & lag_centered > 0 ~ "LH", TRUE ~ "NS" ), lisa_cluster = if_else(lisa_p <= 0.05, quadrant, "NS") ) table(cbg_index$lisa_cluster) ``` # Step 7: LISA Cluster Map This is the payoff map. Each CBG is colored by its LISA classification: - **Red (HH):** Hot spots --- high-hardship CBGs surrounded by high-hardship neighbors - **Blue (LL):** Cold spots --- low-hardship CBGs surrounded by low-hardship neighbors - **Light coral (HL):** High-hardship outliers in low-hardship areas - **Light blue (LH):** Low-hardship outliers in high-hardship areas - **Grey (NS):** Not statistically significant at p <= 0.05 ```{r} #| label: step.07-lisa-map #| fig-height: 9 lisa_colors <- c( "HH" = "red", "LL" = "blue", "HL" = "#F08080", "LH" = "#ADD8E6", "NS" = "grey90" ) cbg_index$lisa_cluster <- factor( cbg_index$lisa_cluster, levels = c("HH", "LL", "HL", "LH", "NS") ) ggplot(cbg_index) + geom_sf(aes(fill = lisa_cluster), color = "white", linewidth = 0.05) + scale_fill_manual( values = lisa_colors, labels = c("HH (Hot Spot)", "LL (Cold Spot)", "HL (High-Low Outlier)", "LH (Low-High Outlier)", "NS (Not Significant)"), name = "LISA Cluster" ) + labs( title = "LISA Cluster Map — Economic Hardship Index", subtitle = "Maricopa County CBGs | Queen contiguity, p <= 0.05", caption = "Source: ACS 2019-2023 5-Year Estimates" ) + theme_void() + theme(legend.position = "right") ``` ```{r} #| label: step.07-side-by-side #| fig-height: 8 p1 <- ggplot(cbg_index) + geom_sf(aes(fill = hardship_index), color = NA) + scale_fill_viridis(option = "inferno", name = "Hardship\nIndex") + labs(title = "Economic Hardship Index (Raw Values)") + theme_void() p2 <- ggplot(cbg_index) + geom_sf(aes(fill = lisa_cluster), color = "white", linewidth = 0.05) + scale_fill_manual(values = lisa_colors, name = "LISA\nCluster") + labs(title = "LISA Clusters (Significant Only)") + theme_void() p1 + p2 ``` The side-by-side comparison shows the power of LISA analysis: the left map shows the raw pattern (which we could see visually), while the right map tells us which parts of that pattern are **statistically significant**. Many areas that *look* like they might be clusters turn out to be not significant at the 0.05 level. --- ::: {.callout-note} ## Ready for the assignment? Head to the [Lab 4 Assignment page](https://antjam-howell.github.io/paf-516-labs/lab4-assignment.html) to download `Lab4_Assignment.qmd`. The assignment asks you to run LISA on a single component variable (poverty rate) and compare the hot spot pattern to the composite index. :::