--- title: "Overview of Linear Models, Bayesian Inference, and STAN" output: pdf_document --- \renewcommand{\vec}[1]{\mathbf{#1}} ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.height = 4, fig.width = 6) library(tidyverse) library(gridExtra) library(arm) library(rstanarm) ``` ## Key Concepts - Linear Model Specification - Simulating Data in R - Fitting Linear Models in R - Bayesian Inference - Fitting Bayesian Models with Stan \vfill ### Linear Model Specification Linear models provide the foundation for most statistical analyses: \vfill \vfill \vfill One assumption, that is often violated in spatial statistics is that the errors are independently distributed. \newpage ### Simulating Data in R Simulating "fake" data will be a cornerstone of fitting models in this class. ```{r} set.seed(01112021) # initialize parameters num_obs <- 100 beta <- 1 sigma <- 2 # simulate data x <- runif(num_obs, min = -10, max = 10) y <- x * beta + rnorm(num_obs, mean = 0, sd = sigma) ``` \vfill ```{r, fig.cap = "Scatterplot of synthetic data with best fit regression line."} # create figure tibble(x=x, y=y) %>% ggplot(aes(x=x, y=y)) + geom_point() + theme_bw() + geom_smooth(formula = 'y~x', method = 'lm') ``` \newpage ### Fitting Linear Models in R The standard method for fitting linear models in R is with `lm()`. ```{r} fit_lm <- lm(y ~ x) summary(fit_lm) ``` A quick interlude about model interpretation. \vfill - P-values... The ASA Statement on p-values is ["required reading"](https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). \vfill \newpage \vfill ```{r} display(fit_lm) ``` \vfill An alternative framework for fitting regression models is to use the `rstanarm` package and the associated `stan_glm()` functionality. \vfill ```{r} synthetic_data <- tibble(x = x, y = y) fit_stan <- stan_glm(y ~ x, data = synthetic_data, refresh = 0) print(fit_stan) ``` \newpage ### Bayesian Inference \vfill While the coefficient estimates are similar to `lm` ```{r} coef(fit_stan) ``` \vfill ```{r} fit_stan %>% as.data.frame() %>% head(10) ``` \vfill \vfill ```{r} posterior_interval(fit_stan, prob = .95) confint(fit_lm) ``` \newpage So what is hiding in this model specification? ```{r} synthetic_data <- tibble(x = x, y = y) fit_stan <- stan_glm(y ~ x, data = synthetic_data, refresh = 0) ``` \vfill \vfill ```{r} prior_summary(fit_stan) ``` \newpage ##### Bayes Rule \vfill $$Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)} = \frac{Pr(B|A) Pr(A)}{Pr(B)},$$ where $Pr(A|B)$ is a conditional probability of event A, given event B. \vfill The classic example of Bayes Rule focuses on medical testing. So consider a COVID-19 testing scenario. Assume we want to know the probability that an individual is positive given they received a positive test or Pr(An individual is positive | test is positive). Let: - Pr(An individual is Positive) = .10 \vfill - Pr(test is positive | an individual is positive) = .93 \vfill - Pr(test is positive | an individual is not positive) = .02 \vfill \vfill \vfill \vfill \newpage As mentioned, using Bayes's theorem does not equate with Bayesian inference, but rather what we have just done is an exercise with conditional probability. \vfill \vfill \vfill Maximum likelihood estimator use the sampling distribution, or more specifically the likelihood, to select parameter values ($\hat{\theta}_{MLE}$) that maximize the likelihood. \vfill \vfill \vfill \newpage In some situations, the posterior distribution can be analytically calculated. However, in many scenarios, $p(x)$ requires integration and does not have an analytical solution. \vfill \vfill STAN can be called directly from R (or R Markdown documents.). The basic structure of a STAN script looks like: ```{r, eval = F} data { int N; vector[N] x; vector[N] y; } parameters { real alpha; real beta; real sigma; } model { y ~ normal(alpha + beta * x, sigma); } ``` \vfill