---
title: "Week 15: Activity"
format: gfm
editor: source
editor_options: 
  chunk_output_type: console
---

### Last Time

- Intro to Areal Data
- Areal Data Visualization
- Assessing Spatial Structure in Areal Data

### This Time

- Assessing Spatial Structure in Areal Data
- Spatial Smoothing with Areal Data


---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
library(ggmap)
library(knitr)
library(gtools)
library(mgcv)
library(mnormt)
library(arm)
library(rstanarm)
library(rstan)
library(viridis)
library(spdep)
library(urbnmapr)

```


## Spatial Association

There are two common statistics used for assessing spatial association:
Moran's I and Geary's C.

\vfill

Moran's I
$I =n \sum_i \sum_j w_{ij} (Y_i - \bar{Y})(Y_j -\bar{Y}) / (\sum_{i\neq j \;w_{ij}})\sum_i(Y_i - \bar{Y})^2$

\vfill

Moran's I is analogous to correlation, where values close to 1 exhibit spatial clustering and values near -1 show spatial regularity (checkerboard effect).

\vfill

Geary's C
$C=(n-1)\sum_i \sum_j w_{ij}(Y_i-Y_j)^2 / 2(\sum_{i \neq j \; w_{ij}})\sum_i (Y_i - \bar{Y})^2$

\vfill

Geary's C is more similar to a variogram (has a connection to Durbin-Watson in 1-D). The statistics ranges from 0 to 2; values close to 2 exhibit clustering and values close to 0 show repulsion. A Geary's C near 1 would be expected with no spatial structure.

\vfill

## Spatial Association Exercise

Consider the following scenarios and use the following 4-by-4 grid

```{r, echo = F}
d4 <- tibble(xmin = rep(c(3.5, 2.5, 1.5, 0.5), each = 4),
             x = rep(4:1, each =4),      
             xmax = rep(c(3.5, 2.5, 1.5, 0.5), each = 4) +1,
             ymin = rep(c(3.5, 2.5, 1.5, 0.5), 4), 
             y = rep(4:1, 4),
             ymax = rep(c(3.5, 2.5, 1.5, 0.5), 4) +1,
             rpos = rep(4:1, 4),
             cpos = rep(1:4, each = 4),
             id=16:1)
ggplot() + 
  scale_x_continuous(name="column") + 
  scale_y_continuous(name="row",breaks = 1:4,labels = c('4','3','2', '1') ) +
  geom_rect(data=d4, mapping=aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), color="black", alpha=0.05) +
  geom_text(data=d4, aes(x=xmin+(xmax-xmin)/2, y=ymin+(ymax-ymin)/2, label=id), size=4) +
  theme_minimal()
```

and proximity matrix 

```{r}
W <- matrix(0, 16, 16)
for (i in 1:16){
  W[i,] <- as.numeric((d4$rpos[i] == d4$rpos & (abs(d4$cpos[i] - d4$cpos) == 1)) | 
                        (d4$cpos[i] == d4$cpos & (abs(d4$rpos[i] - d4$rpos) == 1)))
}
head(W)
```
\vfill

for each scenario plot the grid, calculate I and G, along with permutation-based p-values.


1. Simulate data where the responses are i.i.d. N(0,1). 

```{r, echo = T}
#| warning: false
d4$z <- rnorm(16)

ggplot() + 
  geom_tile(data=d4, mapping=aes(x = x, y = y, fill = z)) +
  geom_text(data=d4, aes(x=xmin+(xmax-xmin)/2, y=ymin+(ymax-ymin)/2, label=id), size=4) +
  theme_minimal() + scale_fill_gradient2(midpoint=0, low="blue", mid="white",
                     high="red", space ="Lab" )

moran.test(d4$z, mat2listw(W), alternative = 'two.sided')
#moran.plot(d4$z, mat2listw(W))
geary.test(d4$z, mat2listw(W), alternative = 'two.sided')
```


\vfill


2. Simulate data and calculate I and G for a 4-by-4 grid with a chess board approach, where "black squares" $\sim N(-2,1)$ and "white squares" $\sim N(2,1)$.


\vfill


3. Simulate multivariate normal response on a 4-by-4 grid where $y \sim N(0, (I- \rho W)^{-1})$, where $\rho = .3$ is a correlation parameter and $W$ is a proximity matrix.


## Spatial Smoothing

Spatial smoothing results in a "smoother" spatial surface, by sharing information from across the neighborhood structure.

\vfill

This smoothing is akin to fitted values (expected values) in a traditional modeling framework.

\vfill

One option is replacing $Y_i$ with

$$\hat{Y}_i = \sum_j w_{ij} Y_j / w_{i+}$$
where $w_{i+} = \sum_j w_{ij}$.


\vfill

What are some pros and cons of this smoother?

\vfill

### "Exponential" smoother

Another option would be to use:
$$\hat{Y}_i^* = (1 - \alpha) Y_i + \hat{Y}_i$$

\vfill

Compare $\hat{Y}_i^*$ with $\hat{Y}_i$.

\vfill

What is the impact of $\alpha?$ _This is essentially the exponential smoother from time series._

## Areal Data Example


Recall the motivating image, 

```{r, out.width = "95%", echo = F, fig.align = 'center', fig.cap='source: https://www.politico.com/election-results/2018/montana/'}
knitr::include_graphics("MT_Map.png") 
```

using the following dataset

```{r}
Tester <- read_csv('Tester_Results.csv')
Tester <- Tester %>% 
  mutate(Tester_Prop = TESTER / (TESTER + ROSENDALE + BRECKENRIDGE),
         county_fips = as.character(FIPS))
```

recreate the image. Recall `urbnmapr::counties` has a shape file with county level boundaries.

recreate the image.

## Adjacency Matrix
Follow the code below to create an adjacency matrix for Montana. 

```{r, eval = T, echo = T}

MT_sf <- get_urbn_map("counties", sf = TRUE) |>
  filter(state_abbv == 'MT') |>
  arrange(county_fips)

mt_W <- nb2mat(poly2nb(MT_sf), style = 'B')
MT_list <- nb2listw(poly2nb(MT_sf), style = 'B')

```


## Moran's I / Geary's C

Using the Tester - Rosendale election results and the adjacency matrix compute and interpret Moran's I and Geary's C with the proportion voting for Tester.


## Now consider some covariates to explain the response

Consider a linear model with county population

```{r, echo = T}
library(usmap)
usmap::countypop
```

Extract the residuals create a choropleth and test for spatial structure.


<!-- ### Areal Data Models: Disease Mapping -->

<!-- Areal data with counts is often associated with disease mapping, where there are two quantities for each areal unit: -->
<!-- $Y_i =$  observed number of cases of disease in county i and $E_i =$ expected number of cases of disease in county i. -->

<!-- Note this can also be used to model generic count data on area units. -->


<!-- \vfill -->
<!-- One way to think about the expected counts is -->
<!-- $$E_i = n_i \bar{r} = n_i \left(\sum_j y_j / \sum_j n_j  \right),$$ -->
<!-- where $\bar{r}$ is the overall disease rate and $n_i$ is the population for region $i$. -->

<!-- \vfill -->
<!-- However note that $\bar{r},$  and hence, $E_i$ is a not fixed, but is a function of the data. This is called *internal standardization*. -->

<!-- \vfill -->

<!-- An alternative is to use some standard rate for a given age group, such that $E_i = \sum_j n_{ij} r_j.$ This is *external standardization.* -->

<!-- \vfill -->


<!-- Often counts are assumed to follow the Poisson model where -->
<!-- $$Y_i|\eta_i \sim Poisson(E_i \eta_i),$$ -->
<!-- where $\eta_i$ is the relative risk of the disease in region $i$.  This quantity is known as the *standardized morbidity ratio* (SMR). -->

<!-- \vfill -->

<!-- Then the MLE of $\eta_i$ is $Y_i /E_i$. -->


<!-- #### Poisson-Gamma Model -->

<!-- Consider the following framework -->
<!-- $Y_i | \eta_i \sim Pois(E_i \eta_i)$ and $\eta_i \sim Gamma(a,b)$ -->
<!-- where the gamma distribution has mean $a/b$ and variance is $a/b^2$. -->

<!-- \vfill -->

<!-- This can be reparameterized such that $a = \mu^2 / \sigma^2$ and $b = \mu /\sigma^2$. -->


<!-- \vfill -->

<!-- For the Poisson sampling model, the gamma prior is conjugate. This means that the posterior distribution $p(\eta_i | y_i)$ is also a gamma distribution, and in particular, the posterior distribution $p(\eta_i|y_i)$ is $Gamma(y_i + a, E_i + b)$. -->

<!-- \vfill -->

<!-- The mean of this distribution is -->
<!-- \begin{eqnarray*} -->
<!-- E(\eta_i | \boldsymbol{y}) = E(\eta_i| y_i) &=& \frac{y_i + a}{E_i + b} = \frac{y_i + \frac{\mu^2}{\sigma^2}}{E_i + \frac{\mu}{\sigma^2}}\\ -->
<!-- &=& \frac{E_i (\frac{y_i}{E_i})}{E_i + \frac{\mu}{\sigma^2}} + \frac{(\frac{\mu}{\sigma^2})\mu}{E_i + \frac{\mu}{\sigma2}}\\ -->
<!-- &=& w_i SMR_i + (1 - w_i) \mu, -->
<!-- \end{eqnarray*} -->
<!-- where $w_i = \frac{E_i}{E_i + (\mu / \sigma^2)}$ -->


<!-- \vfill -->

<!-- Thus the point estimate is a weighted average of the data-based SMR for region $i$ and the prior mean $\mu$. -->

<!-- #### Poisson-lognormal models -->
<!-- The model can be written as  -->
<!-- \begin{eqnarray*} -->
<!-- Y_i | \psi_i &\sim& Poisson(E_i \exp(\psi_i))\\ -->
<!-- \psi_i &=& \boldsymbol{x_i^T}\boldsymbol{\beta} + \theta_i + \phi_i -->
<!-- \end{eqnarray*} -->

<!-- where $\boldsymbol{x_i}$ are spatial covariates, $\theta_i$ corresponds to region wide heterogeneity, and $\psi_i$ captures local clustering. -->


<!-- ### Brook's Lemma and Markov Random Fields -->

<!-- To consider areal data from a model-based perspective, it is necessary to obtain the joint distribution of the responses  -->
<!-- $$p(y_1, \dots, y_n).$$ -->

<!-- \vfill -->

<!-- From the joint distribution, the *full conditional distribution* -->
<!-- $$p(y_i|y_j, j \neq i),$$ -->
<!-- is uniquely determined. -->

<!-- \vfill -->

<!-- Brook's Lemma states that the joint distribution can be obtained from the full conditional distributions. -->

<!-- \vfill -->

<!-- When the areal data set is large, working with the full conditional distributions can be preferred to the full joint distribution. -->

<!-- \vfill -->
<!-- More specifically, the response $Y_i$ should only directly depend on the neighbors, hence, -->
<!-- $$p(y_i|y_j, j \neq i) = p(y_i|y_j, j \in \delta_i)$$ -->
<!-- where $\delta_i$ denotes the neighborhood around $i$. -->


<!-- ### Markov Random Field  -->

<!-- The idea of using the local specification for determining the global form of the distribution is Markov random field. -->

<!-- \vfill -->

<!-- An essential element of a MRF is a *clique*, which is a group of units where each unit is a neighbor of all units in the clique -->

<!-- \vfill -->

<!-- A *potential function* is a function that is exchangeable in the arguments. With continuous data a common potential is $(Y_i - Y_j)^2$ if $i \sim j$ ($i$ is a neighbor of $j$). -->

<!-- \vfill -->

<!-- #### Gibbs Distribution -->

<!-- A joint distribution $p(y_1, \dots, y_n)$ is a Gibbs distribution if it is a function of $Y_i$ only through the potential on cliques. -->

<!-- \vfill -->

<!-- Mathematically, this can be expressed as: -->
<!-- $$p(y_1, \dots, y_n) \propto \exp \left(\gamma \sum_k \sum_{\alpha \in \mathcal{M}_k} \phi^{(k)}(y_{\alpha_1},y_{\alpha_2}, \dots, y_{\alpha_k} ) \right),$$ -->
<!-- where $\phi^{(k)}$ is a potential of order $k$, $\mathcal{M}_k$ is the collection of all subsets of size $k = {1, 2, ...}$(typically restricted to 2 in spatial settings), $\alpha$ indexes the set in $\mathcal{M}_k$. -->

<!-- \vfill -->

<!-- \newpage -->

<!-- ### Hammersley-Clifford Theorem -->

<!-- The Hammersley-Clifford Theorem demonstrates that if we have a MRF that defines a unique joint distribution, then that joint distribution is a Gibbs distribution. -->

<!-- \vfill -->

<!-- The converse was later proved, showing that a MRF could be sampled from the associated Gibbs distribution (origination of Gibbs sampler). -->

<!-- \vfill -->

<!-- ### Model Specification -->

<!-- With continuous data, a common choice for the joint distribution is the pairwise difference -->
<!-- $$p(y_1, \dots, y_n) \propto \exp \left(-\frac{1}{2\tau^2} \sum_{i,j}(y_i - y_j)^2 I(i \sim j) \right)$$ -->
<!-- \vfill -->

<!-- Then the full conditional distributions can be written as -->
<!-- $$p(y_i|y_j, j \neq i) = N \left(\sum_{j \in \delta_i} y_i / m_i, \tau^2 / m_i \right)$$ -->
<!-- where $m_i$ are the number of neighbors for unit $i$. -->

<!-- \vfill -->
<!-- This results in a spatial smoother, where the mean of a response is the average of the neighbors. -->

<!-- ## Conditional Autoregressive Models -->
<!-- ## Gaussian Model -->

<!-- Suppose the full conditionals are specifed as -->
<!-- $$Y_i|y_j, j\neq i \sim N \left(\sum_j b_{ij} y_j, \tau_i^2 \right)$$ -->

<!-- \vfill -->

<!-- Then using Brooks' Lemma, the joint distribution is -->
<!-- $$p(y_1, \dots, y_n) \propto \exp \left(-\frac{1}{2}\boldsymbol{y}^T D^{-1} (I - B) \boldsymbol{y} \right),$$ -->
<!-- where $B$ is a matrix with entries $b_{ij}$ and D is a diagonal matrix with diagonal elements $D_{ii} = \tau_i^2$. -->

<!-- \vfill -->

<!-- The previous equation suggests a multivariate normal distribution, but $D^{-1}(I - B)$ should be symmetric. -->

<!-- \vfill -->

<!-- Symmetry requires $$\frac{b_{ij}}{\tau^2_i}=\frac{b_{ji}}{\tau^2_j}, \; \; \forall \; \; i , j$$ -->

<!-- \vfill -->

<!-- In general, $B$ is not symmetric, but setting $b_{ij} = w_{ij}/ w_{i+}$ and $\tau_i^2 = \tau^2 / w_{i+}$ satisfies the symmetry assumptions (given that we assume W is symmetric) -->

<!-- \vfill -->

<!-- \newpage -->

<!-- Now the full conditional distribution can be written as -->
<!-- $$Y_i|y_j, j\neq i \sim N \left(\sum_j w_{ij} y_j / w_{i+}, \tau^2 / w_{i+} \right)$$ -->

<!-- \vfill -->

<!-- Similarly the joint distribution is now -->
<!-- $$p(y_1, \dots, y_n) \propto \exp \left(-\frac{1}{2 \tau^2}\boldsymbol{y}^T  (D_w - W) \boldsymbol{y} \right)$$ -->
<!-- where $D_w$ is a diagonal matrix with diagonal entries $(D_w)_{ii} = w_{i+}$  -->

<!-- \vfill -->

<!-- The joint distribution can also be re-written as -->
<!-- $$p(y_1, \dots, y_n) \propto \exp \left(-\frac{1}{2 \tau^2} \sum_{i \neq j} w_{ij} (y_i - y_j)^2\right)$$ -->

<!-- \vfill -->

<!-- However, both these formulations results in an improper distribution. This could be solved with a constraint, such as $Y_i = 0$. -->

<!-- \vfill -->

<!-- The result is the joint distribution is improper, despite proper full conditional distributions. This model specification is often referred to as an *intrinsically autoregressive* model (IAR). -->

<!-- \vfill -->

<!-- \newpage -->

<!-- ## IAR -->

<!-- The IAR cannot be used to model data directly, rather this is used a prior specification and attached to random effects specified at the second stage of the hierarchical model. -->

<!-- \vfill -->

<!-- The impropriety can be remedied by defining a parameter $\rho$ such that $(D_w - W)$ becomes $(D_w - \rho W)$ such that this matrix is nonsingular. -->

<!-- \vfill -->

<!-- The parameter $\rho$ can be considered an extra parameter in the CAR model. -->

<!-- \vfill -->

<!-- With or without $\rho,$ $p(\boldsymbol{y})$ (or the Bayesian posterior when the CAR specification is placed on the spatial random effects) is proper. -->

<!-- \vfill -->

<!-- When using $\rho$, the full conditional becomes $$Y_i|y_j, j\neq i \sim N \left(\rho \sum_j w_{ij} y_j / w_{i+}, \tau^2 / w_{i+} \right)$$ -->

<!-- \vfill -->

<!-- ## Simultaneous Autoregression Model -->

<!-- Rather than specifying the distribution on $\boldsymbol{Y}$, as in the CAR specification, the distribution can be specified for $\boldsymbol{\epsilon}$ which induces a distribution for $\boldsymbol{Y}$. -->

<!-- \vfill -->

<!-- Let $\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \tilde{D})$, where $\tilde{D}$ is a diagonal matrix with elements $(\tilde{D})_{ii} = \sigma^2_i$. -->

<!-- \vfill -->

<!-- Now $Y_i = \sum_j b_{ij} Y_j + \epsilon_i$ or equivalently $(I-B)\boldsymbol{Y} = \boldsymbol{\epsilon}$. -->

<!-- \vfill -->

<!-- \newpage -->
<!-- ## SAR Model -->

<!-- If the matrix $(I - B)$ is full rank, then  -->
<!-- $$\boldsymbol{Y} \sim N \left(\boldsymbol{0},(I - B)^{-1} \tilde{D} ((I - B)^{-1})^T \right)$$ -->

<!-- \vfill -->

<!-- If $\tilde{D} = \sigma^2 I$, then $\boldsymbol{Y} \sim N \left(\boldsymbol{0},\sigma^2 \left[(I - B)  (I - B)^T \right]^{-1} \right)$ -->

<!-- \vfill -->

<!-- #### Choosing B -->

<!-- There are two common approaches for choosing B -->

<!-- \vfill -->

<!-- 1. $B = \rho W,$ where $W$ is a contiguity matrix with entries 1 and 0. The parameter $\rho$ is called the spatial autoregression parameter. -->

<!-- \vfill -->

<!-- 2. $B = \alpha \tilde{W}$ where $(\tilde{W})_{ij} = w_{ij} / w_{i+}.$ The $\alpha$ parameter is called the spatial autocorrelation parameter. -->

<!-- \vfill -->

<!-- SAR Models are often introduced in a regression context, where the residuals $(\boldsymbol{U})$ follow a SAR model. -->

<!-- \vfill -->

<!-- Let $\boldsymbol{U} = \boldsymbol{Y} - X \boldsymbol{\beta}$ and then $\boldsymbol{U} = B \boldsymbol{U} + \boldsymbol{\epsilon}$ which results in -->
<!-- $$\boldsymbol{Y} = B \boldsymbol{Y} + (I-B) X \boldsymbol{\beta} + \boldsymbol{\epsilon}$$ -->

<!-- \vfill -->

<!-- Hence the model contains a spatial weighting of neighbors $(B \boldsymbol{Y})$ and a regression component $((I-B) X \boldsymbol{\beta} )$. -->

<!-- \vfill -->

<!-- The SAR specification is not typically used in a GLM setting. -->

<!-- \vfill -->

<!-- SAR models are well suited for maximum likelihood -->

<!-- \vfill -->

<!-- Without specifying a hierarchical form, Bayesian sampling of random effects is more difficult than the CAR specification. -->