--- title: "Week 9: Activity Key" format: gfm editor: source editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.height = 6, fig.width = 6) library(tidyverse) library(knitr) library(mnormt) library(plgp) library(reshape2) library(rstan) eps <- .Machine$double.eps ``` ### Last Week's Recap - Gaussian Process Intro - Bayesian inference with `stan` - Correlation functions ### This week - GPs in 2D - Bayesian inference with `stan` --- ### Geostatistical Data At last, we will look at simulated 2-d "spatial" data. ##### 1. Create Sampling Locations ```{r} #set.seed(03062025) num.locations <- 50 coords <- data.frame(x = runif(num.locations), y = runif(num.locations)) coords %>% ggplot(aes(x=x,y=y)) + geom_point() + ggtitle('Hypothetical Sampling Locations') + xlim(0,1) + ylim(0,1) + theme_bw() ``` ##### 2. Calculate Distances ```{r, echo = T} dist.mat <- sqrt(plgp::distance(coords)) ``` ##### 3. Define Covariance Function and Set Parameters Use exponential covariance with no nugget: ```{r} sigma.sq <- 1 phi <- .1 Sigma <- sigma.sq * exp(- dist.mat/phi) + diag(eps, num.locations) ``` ##### 4. Sample realization of the process - This requires a distributional assumption, we will use the Gaussian distribution ```{r} Y <- rmnorm(n=1, mean = 0, varcov = Sigma) ``` - What about the rest of the locations on the map? ##### 5. Vizualize Spatial Process Start with a coarse grid and them move to a finer grid ```{r} coords %>% mutate(Y = Y) %>% ggplot(aes(x=x,y=y)) + geom_point(aes(color=Y), size=2) + ggtitle(label = 'Simulated Spatial Process', subtitle = 'Exponential Covariance: sigma.sq = 1, phi = .1') + xlim(0,1) + ylim(0,1) + scale_colour_gradient2() + theme_dark() ``` \newpage Now we can look at more sampling locations ```{r, echo = F} dim.grid <- 10 grid.coords <- data.frame(x.grid = rep(seq(.05, .95, length.out=dim.grid), dim.grid), y.grid = rep(seq(.05, .95, length.out = dim.grid), each = dim.grid)) dist.grid <- sqrt(plgp::distance(grid.coords)) sigma.sq <- 1 phi <- .1 Sigma <- sigma.sq * exp(- dist.grid/phi) + diag(eps, dim.grid ^ 2) Y <- rmnorm(n=1, mean = 0, varcov = Sigma) grid.coords %>% mutate(Y = Y) %>% ggplot(aes(x=x.grid,y=y.grid)) + geom_point(aes(color=Y), size=3) + ggtitle('Simulated Spatial Process', subtitle = 'Exponential Covariance: sigma.sq = 1, phi = .1') + xlim(0,1) + ylim(0,1) + scale_colour_gradient2() + theme_dark() ``` ```{r, echo = F} dim.grid <- 50 grid.coords <- data.frame(x.grid = rep(seq(.05, .95, length.out=dim.grid), dim.grid), y.grid = rep(seq(.05, .95, length.out = dim.grid), each = dim.grid)) dist.grid <- sqrt(plgp::distance(grid.coords)) sigma.sq <- 1 phi <- .1 Sigma <- sigma.sq * exp(- dist.grid/phi) + diag(eps, dim.grid ^ 2) Y <- rmnorm(n=1, mean = 0, varcov = Sigma ) grid.coords %>% mutate(Y = Y) %>% ggplot(aes(x=x.grid,y=y.grid)) + geom_point(aes(color=Y), size=3) + ggtitle('Simulated Spatial Process', subtitle = 'Exponential Covariance: sigma.sq = 1, phi = .1') + xlim(0,1) + ylim(0,1) + scale_colour_gradient2() + theme_dark() ``` How does the spatial process change with: - another draw with same parameters? - a different value of $\phi$ - a different value of $\sigma^2$ --- ### Visual Overview of Bayesian Inference Using some Bridger Bowl weather data we will provide a visual overview of Bayesian Inference. *The goal will be to model the average winter high temperature at the base of Bridger Bowl.* 1. Prior Specification \vfill - First sketch a prior distribution that encapsulates your belief about what you believe the average high temperature would be. *Note this should obey law of total probability* \vfill - *Next we generally need to parameterize (perhaps approximately) this belief with some sort of probability distribution.* \vfill ```{r} temp_seq <- -10:50 prob_seq <- dnorm(temp_seq, mean = 20, sd = 10) tibble(temp = temp_seq, prob = prob_seq) %>% ggplot(aes(temp, prob)) + geom_line() + theme_bw() + ggtitle("Andy's Prior Belief: N(20, 10^2)") ``` *Formally, my prior is on the mean high temp, which we will denote $\mu$.* $$\mu \sim N(20, 10^2)$$ \vfill 2. Specify the sampling distribution for the data or perhaps in more familiar language, state the likelihood for the statistical model \vfill - *We will assume that the temperature readings are continuous (or "nearly continuous")* \vfill - *It seems reasonable to start with a normal distribution, so:* $$X|\mu, \sigma^2 \sim N(\mu, \sigma^2)$$ \vfill - Note that we also need to estimate $\sigma$ in this model and need a prior for that parameter too. \vfill - Grab some weather data from Bridger Bowl (roughly the first half of January 2021) ```{r} temp <- c(26, 45, 44, 36, 22, 25, 31, 31, 37, 34, 35, 37, 32, 31) ``` \vfill - Any concerns about using this data to inform our research question? \vfill 3. Posterior Inference \vfill - Using classical inference, how would you estimate $\mu$. \vfill - *Using maximum likelihood, $\hat{\mu}_{MLE} = \bar{X}$ = `r round(mean(temp))`.* \vfill - *With Bayesian inference, our posterior belief is based on the data __and__ our prior belief. Note this can be a blessing or a curse.* \vfill - Formally, we have a distribution for the maximum temperature (a posterior distribution): $p(\mu|x) = \int p(x|\mu,\sigma) \times p(\mu)p(\sigma) /p(\mu)d\sigma$, note solving this is not trivial and isn't something we will handle in this class. \vfill - Luckily, there is an elegant computational procedure that will allow us to approximate $p(\mu|x)$ by taking samples from the distribution. *This is, of course, MCMC.* \vfill STAN code for this situation can be written as below. Note that the prior values are hard coded, these could also be passed in as arguments to the model. \vfill ``` data { int N; vector[N] y; } parameters { real mu; real sigma; } model { y ~ normal(mu, sigma); mu ~ normal(20, 10); } ``` \vfill ```{r, results = F} temp_data <- stan("normal.stan", data=list(N = length(temp), y=temp)) ``` \vfill ```{r} print(temp_data) ``` ```{r} plot(temp_data) ``` \vfill We can also view the posterior and prior beliefs together on a single figure. ```{r} tibble(sims = c(extract(temp_data, pars = 'mu')$mu,rnorm(4000, 20, 10)), Distribution = rep(c('posterior','prior'), each = 4000)) %>% ggplot(aes(x = sims, color = Distribution)) + geom_density() + theme_bw() + xlab('Temperature (F)') + ylab('') + ggtitle("Prior and posterior belief for winter temperature in Bozeman") ``` #### Multivariate Normal Distribution Next we will segue from standard linear models to analyzing correlated data. First we will start with the a bivariate normal distribution: y \~ N(theta,sigma), where theta is a mean vector and sigma = sigmasq \* I is a covariance matrix. To provide a motivating context, not consider jointly estimating the temperature at Bridger Bowl *and* Big Sky Resort. ##### 1. Simulate independent bivariate normal Simulate a set of temperature values from each location, where the temperature values are independent (sigma = sigmasq \* I) ```{r} library(mnormt) n <- 100 theta <- c(15,25) sigma <- diag(2) * 100 fake_temperatures <- rmnorm(n, theta , sigma) ``` Then create a few graphs to show marginal distribution of temperature as well as how the temperatures evolve in time. ```{r} library(reshape2) melt(fake_temperatures, value.name = 'temp') %>% rename(location = Var2) %>% mutate(location = factor(location)) %>% ggplot(aes(x =temp, fill = location)) + geom_histogram() + facet_wrap(.~location) + theme_bw() ``` ```{r} melt(fake_temperatures, value.name = 'temp') %>% rename(location = Var2, day = Var1) %>% mutate(location = factor(location)) %>% ggplot(aes(y =temp, x = day, color = location )) + geom_line() + theme_bw() + xlim(0,30) + ggtitle('First 30 observations of independent response') ``` ##### 2. Simulate correlated bivariate normal Simulate a set of temperature values from each location, where the temperature values are not independent (sigma = sigmasq \* H), where H is a correlation matrix. (Note there are some constraints we will discuss later) ```{r} sigma <- matrix(c(1, .9, .9, 1), nrow = 2, ncol = 2) * 100 fake_temperatures_corr <- rmnorm(n, theta , sigma) ``` Then create a few graphs to show marginal distribution of temperature as well as how the temperatures evolve in time. ```{r} melt(fake_temperatures_corr, value.name = 'temp') %>% rename(location = Var2) %>% mutate(location = factor(location)) %>% ggplot(aes(x =temp, fill = location)) + geom_histogram() + facet_wrap(.~location) + theme_bw() ``` ```{r} melt(fake_temperatures_corr, value.name = 'temp') %>% rename(location = Var2, day = Var1) %>% mutate(location = factor(location)) %>% ggplot(aes(y =temp, x = day,color = location )) + geom_line() + theme_bw() + xlim(0,30) + ggtitle('First 30 observations of correlated response') ``` ##### 3. Write STAN code for bivariate normal Write stan code that will allow you to estimate theta and sigma (including H) ``` data { int p; int N; matrix[N,p] y; } parameters { vector[p] theta; corr_matrix[p] H; real sigma; } model { for(i in 1:N){ y[i,:] ~ multi_normal(theta, sigma*H); } } ``` ##### 4. Use STAN to estimate bivariate normal parameters Use your stan code to estimate theta and sigma (including H and sigmasq) ```{r} indep_mvn <- stan("multi_norm.stan", data=list(N = nrow(fake_temperatures), p = ncol(fake_temperatures), y=fake_temperatures)) ``` ```{r} print(indep_mvn) ``` ```{r} corr_mvn <- stan("multi_norm.stan", data=list(N = nrow(fake_temperatures_corr), p = ncol(fake_temperatures_corr), y=fake_temperatures_corr)) ``` ```{r} print(corr_mvn) ``` ##### 5. Final Thoughts About Correlation In many statistical models there is an assumption about independence. When independence is violated, uncertainty is under estimated and in incorrect inferences can be made. While lack of independence often has a negative connotation, in spatial statistics we can actually exploit correlation. For instance, by knowing the temperature at the weather station at Bozeman High School or Bridger Bowl, we can estimate temperature at other locations.