---
title: "Overview of Linear Models, Bayesian Inference, and STAN"
output: pdf_document
---

\renewcommand{\vec}[1]{\mathbf{#1}}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.height = 4, fig.width = 6)
library(tidyverse) 
library(gridExtra)
library(arm)
library(rstanarm)
```


## Key Concepts

- Linear Model Specification
- Simulating Data in R
- Fitting Linear Models in R
- Bayesian Inference
- Fitting Bayesian Models with Stan

\vfill

### Linear Model Specification

Linear models provide the foundation for most statistical analyses: 

\vfill

\vfill

\vfill

One assumption, that is often violated in spatial statistics is that the errors are independently distributed.

\newpage

### Simulating Data in R

Simulating "fake" data will be a cornerstone of fitting models in this class.

```{r}
set.seed(01112021)
# initialize parameters
num_obs <- 100
beta <- 1
sigma <- 2

# simulate data
x <- runif(num_obs, min = -10, max = 10)
y <- x * beta + rnorm(num_obs, mean = 0, sd = sigma)
```

\vfill

```{r, fig.cap = "Scatterplot of synthetic data with best fit regression line."}
# create figure
tibble(x=x, y=y) %>%
  ggplot(aes(x=x, y=y)) +
  geom_point() + 
  theme_bw() + 
  geom_smooth(formula = 'y~x', method = 'lm')
```

\newpage

### Fitting Linear Models in R

The standard method for fitting linear models in R is with `lm()`.

```{r}
fit_lm <- lm(y ~ x)
summary(fit_lm)
```

A quick interlude about model interpretation.


\vfill

- P-values... The ASA Statement on p-values is ["required reading"](https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). 
\vfill

\newpage


\vfill

```{r}
display(fit_lm)
```

\vfill

An alternative framework for fitting regression models is to use the `rstanarm` package and the associated `stan_glm()` functionality.

\vfill

```{r}
synthetic_data <- tibble(x = x, y = y)
fit_stan <- stan_glm(y ~ x, data = synthetic_data, refresh = 0)
print(fit_stan)
```

\newpage


### Bayesian Inference


\vfill

While the coefficient estimates are similar to `lm`

```{r}
coef(fit_stan)
```

\vfill


```{r}
fit_stan %>% as.data.frame() %>% head(10)
```
\vfill


\vfill

```{r}
posterior_interval(fit_stan, prob = .95)
confint(fit_lm)
```

\newpage

So what is hiding in this model specification? 

```{r}
synthetic_data <- tibble(x = x, y = y)
fit_stan <- stan_glm(y ~ x, data = synthetic_data, refresh = 0)
```

\vfill


\vfill

```{r}
prior_summary(fit_stan)
```

\newpage

##### Bayes Rule


\vfill

$$Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)} = \frac{Pr(B|A) Pr(A)}{Pr(B)},$$

where $Pr(A|B)$ is a conditional probability of event A, given event B.

\vfill

The classic example of Bayes Rule focuses on medical testing. So consider a COVID-19 testing scenario. Assume we want to know the probability that an individual is positive given they received a positive test  or Pr(An individual is positive | test is positive).
Let:

- Pr(An individual is Positive) = .10 

\vfill

- Pr(test is positive | an individual is positive) = .93 

\vfill

- Pr(test is positive | an individual is not positive) = .02 

\vfill

\vfill

\vfill

\vfill

\newpage

As mentioned, using Bayes's theorem does not equate with Bayesian inference, but rather what we have just done is an exercise with conditional probability.

\vfill


\vfill


\vfill

Maximum likelihood estimator use the sampling distribution, or more specifically the likelihood, to select parameter values ($\hat{\theta}_{MLE}$) that maximize the likelihood.

\vfill


\vfill


\vfill

\newpage

In some situations, the posterior distribution can be analytically calculated. However, in many scenarios, $p(x)$ requires integration and does not have an analytical solution.

\vfill


\vfill

STAN can be called directly from R (or R Markdown documents.). The basic structure of a STAN script looks like:

```{r, eval = F}
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
```

\vfill