--- title: "Week 11b: Activity" format: pdf editor: source editor_options: chunk_output_type: console --- ### Last Time - Spatial EDA - GP models to spatial data - Spatial Prediction / Model Choice ### This week - More spatial prediction - Anisotropic Spatial Models --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(message = FALSE, warning = F) library(tidyverse) library(scoringRules) library(knitr) ``` ## Conditional Multivariate Normal Theory: Kriging Recall, the conditional distribution, $p(\boldsymbol{Y_1}| \boldsymbol{Y_2}, \boldsymbol{\beta}, \sigma^2, \phi, \tau^2)$ is normal with: - $E[\boldsymbol{Y_1}| \boldsymbol{Y_2}] = \boldsymbol{\mu_1} + \Sigma_{12} \Sigma_{22}^{-1} (\boldsymbol{Y_2} - \mu_2)$ \vfill - $Var[\boldsymbol{Y_1}| \boldsymbol{Y_2}] = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$ \vfill ## Posterior Predictive Distribution The (posterior) predictive distribution $p(Y(\boldsymbol{s_0})|y)$ can be written as $$p(Y(\boldsymbol{s_0})|y) = \int p(Y(\boldsymbol{s_0})|y, \boldsymbol{\theta})p( \boldsymbol{\theta}|y) d \boldsymbol{\theta}$$ where $\boldsymbol{\theta} = \{\boldsymbol{\beta}, \sigma^2, \phi, \tau^2\}$. \vfill The posterior predictive distribution gives a probabilistic forecast for the outcome of interest that does not depend on any unknown parameters. \vfill These previous STAN code uses the Kriging idea to create a posterior predictive distribution at other locations, which results in a probabilitic prediction. \vfill ## Model Evaluation for Prediction We will use cross-validation or a test/training approach to compare predictive models. \vfill Consider three data structures: continuous, count, and binary; how should we evaluate predictions in these situations? ## Loss Functions - Loss functions penalize predictions that deviate from the true values. \vfill - For continuous or count data, squared error loss and absolute error loss are common. \vfill - With binary data, a zero-one loss is frequently used. However, these metrics are all focused on point estimates. \vfill If we think about outcomes distributionally, empirical coverage probability can be considered. For instance, our 95 % prediction intervals should, on average, have roughly 95 % coverage. \vfill With interval predictions, the goal is to have a concentrated predictive distribution around the outcome. \vfill ## CRPS The Continuous Rank Probability Score (CRPS) defined as $$CRPS(F,y) = \int_{-\infty}^{\infty}\left(F(u) - 1(u \geq y) \right)^2 du,$$ where $F$ is the CDF of the predictive distribution, is a metric that measures distance between an observed value and a distribution. ```{r, echo = T} crps_sample(y = 0, dat = 0) crps_sample(y = 0, dat = 2) crps_sample(y = 0, dat = c(2,-2)) crps_sample(y = 0, dat = c(2,0,-2)) crps_sample(y = 0, dat = c(2,0,0,0,0,-2)) ``` Consider four situations and sketch the predictive distribution and the resultant CRPS for each scenario. How does the MSE function in each setting? 1. Narrow predictive interval centered around outcome. \vfill 2. Wide predictive interval centered around outcome. \vfill 3. Narrow predictive interval with outcome in tail. \vfill 4. Wide predictive interval with outcome in tail. \vfill ## Covariance Functions Given the assumption that a Gaussian process is reasonable for the spatial process, a valid covariance function needs to be specified. \vfill Up to this point, we have largely worked with isotropic covariance functions. In particular, the exponential covariance functions has primarily been used. However, a Gaussian process is flexible and can use any valid covariance function. \vfill A valid covariance function $C(\boldsymbol{h})$, needs to be a positive definite function, which includes the following properties 1. $C(\boldsymbol{0}) \geq 0$ 2. $|C(\boldsymbol{h})| \leq C(\boldsymbol{0})$ There are three approaches for building correlation functions. For all cases let $C_1, \dots, C_m$ be valid correlation functions: \vfill 1. *Mixing:* $C(\boldsymbol{h}) = \sum_{i} p_i C_i$ is also valid if $\sum_{i} p_i =1$. \vfill 2. *Products:* $C(\boldsymbol{h}) = \prod_{i} C_i$ \vfill 3. *Convolution:* $C_{12}(\boldsymbol{h}) = \int C_1(\boldsymbol{h} -\boldsymbol{t})C_2(\boldsymbol{t}) d\boldsymbol{t}$ this is based on a Fourier transform. \vfill ## Smoothness Many one-parameter isotropic covariance functions will be quite similar. Another consideration for choosing the correlation function is the theoretical smoothness property of the correlation. \vfill The Matern class of covariance functions contains a parameter, $\nu,$ to control smoothness. With $\nu = \infty$ this is a Gaussian correlation function and with $\nu = 1/2$ this results in an exponential correlation function. \vfill "Expressed in a different way, use of the Matern covariance function as a model enables the data to inform about $\nu$; we can learn about process smoothness despite observing the process at only a finite number of locations." \vfill ## Anisotropy Anisotropy means that the covariance function is not just a function of the distance $||h||$, but also the direction. \vfill Geometric anisotropy refers to the case where the coordinate space is anisotropic, but can be transformed to an isotropic space. \vfill If the differences in spatial structure are directly related to two coordinate sets (lat and long), we can create a stationary, anistropic covariance function \vfill Let $$cor(Y(\boldsymbol{s + h}), Y(\boldsymbol{s})) = \rho_1(h_y) \rho_2(h_x),$$ where $\rho_1()$ and $\rho_2()$ are proper correlation functions. \vfill In general consider the correlation function, $$\rho(\boldsymbol{h}; \phi) = \phi_0(||L\boldsymbol{h}||; \phi)$$ where $L$ is a $d \times d$ matrix that controls the transformation. \vfill Let $\boldsymbol{Y}(\boldsymbol{s}) = \mu(\boldsymbol{s}) + w(\boldsymbol{s}) + \epsilon(\boldsymbol{s})$, and $\boldsymbol{Y}(\boldsymbol{s}) \sim N(\mu(\boldsymbol{s}), \Sigma(\tau^2, \sigma^2, \phi, B))$, where $B = L^T L$. \vfill The covariance matrix is defined as $\Sigma(\tau^2, \sigma^2, \phi, B)) = \tau^2 I + \sigma^2 H((\boldsymbol{h}^T B \boldsymbol{h}^T)^{\frac{1}{2}}),$ where $H((\boldsymbol{h}^T B \boldsymbol{h}^T)^{\frac{1}{2}})$ has entries of $\rho((\boldsymbol{h_{ij}}^T B \boldsymbol{h_{ij}}^T)^{\frac{1}{2}}))$ with $\rho()$ being a valid covariance function, typically including $\phi$ and $\boldsymbol{h_{ij}} = \boldsymbol{s_i} - \boldsymbol{s_j}$. \vfill \vfill $B$ is often referred to as a transformation matrix which rotates and scales the coordinates, such that the resulting transformation can be simplified to a distance. \vfill \newpage ## Geometric Anisotropy Visual - Consider four points positioned on a unit circle. ```{r, fig.width=4, fig.height = 4, fig.align = 'center'} x = c(-1, 0, 0, 1) y = c(0, -1, 1, 0) gg_circle <- function(r, xc, yc, color="black", fill=NA, ...) { x <- xc + r*cos(seq(0, pi, length.out=100)) ymax <- yc + r*sin(seq(0, pi, length.out=100)) ymin <- yc + r*sin(seq(0, -pi, length.out=100)) annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...) } data.frame(x=x, y=y) %>% ggplot(aes(x=x,y=y)) + gg_circle(r=1, xc=0, yc=0, color = 'gray') + geom_point(shape = c('1','2','3','4'), size=5) + theme_minimal() ``` \vfill Now consider a set of correlation functions. For each, calculate the correlation matrix and discuss the impact of $B$ on the correlation. Furthermore, how does B change the geometry of the correlation between points 1, 2, 3, and 4? \vfill 1. $\rho() = \exp(-\boldsymbol{h_{ij}}^T B \boldsymbol{h_{ij}}^T)^{\frac{1}{2}})),$ where $B = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ \end{pmatrix}$ \vfill 2. $\rho() = \exp(-\boldsymbol{h_{ij}}^T B \boldsymbol{h_{ij}}^T)^{\frac{1}{2}})),$ where $B = \begin{pmatrix} 2 & 0 \\ 0 & 1 \\ \end{pmatrix}$ \vfill 3. $\rho() = \exp(-\boldsymbol{h_{ij}}^T B \boldsymbol{h_{ij}}^T)^{\frac{1}{2}})),$ where $B = \begin{pmatrix} 3 & 1 \\ 1 & 1 \\ \end{pmatrix}$ \vfill \newpage \vfill Okay, so if we suspect that geometric anisotropy is present, how do we fit the model? That is, what is necessary in estimating this model? \vfill - In addition to $\sigma^2$ and $\tau^2$ we need to fit $B$. \vfill - While $B$ is a matrix, it is just another unknown parameter. \vfill - To fit a Bayesian model we need a prior distribution for $B$. One option for the positive definite matrix is the Wishart distribution, which is a bit like a matrix-variate gamma distribution.