--- title: "Bayesian Statistics with Stan" --- ## Packages for this section Installation instructions for the last three of these are below. {r rstan-1} library(tidyverse) library(cmdstanr) library(posterior) library(bayesplot)  {r rstan-2, echo=FALSE} set.seed(457299)  ## Installation 1/2 - cmdstanr: {r rstan-3, eval=FALSE} install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))  - posterior and bayesplot, from the same place: {r rstan-4, eval=FALSE} install.packages("posterior", repos = c("https://mc-stan.org/r-packages/", getOption("repos"))) install.packages("bayesplot", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))  ## Installation 2/2 Then, to check that you have the C++ stuff needed to compile Stan code: {r rstan-5} check_cmdstan_toolchain()  and then: {r rstan-6, eval=FALSE} install_cmdstan(cores = 4)  If you happen to know how many cores (processors) your computer has, insert the appropriate number. (My laptop has 4 and my desktop 6.) All of this is done once. If you have problems, go [here (link)](https://mc-stan.org/cmdstanr/articles/cmdstanr.html). ## Bayesian and frequentist inference 1/2 - The inference philosophy that we have learned so far says that: - parameters to be estimated are *fixed* but *unknown* - Data random; if we took another sample we'd get different data. - This is called "frequentist" or "repeated-sampling" inference. ## Bayesian and frequentist inference 2/2 - Bayesian inference says: - *parameters* are random, *data* is *given* - Ingredients: - **prior distribution**: distribution of parameters before seeing data. - **likelihood**: model for data if the parameters are known - **posterior distribution**: distribution of parameters *after* seeing data. ## Distribution of parameters - Instead of having a point or interval estimate of a parameter, we have an entire distribution - so in Bayesian statistics we can talk about eg. - probability that a parameter is bigger than some value - probability that a parameter is close to some value - probability that one parameter is bigger than another - Name comes from Bayes' Theorem, which here says > posterior is proportional to likelihood times prior - more discussion about this is in [**a blog post**](http://ritsokiguess.site/docs/2018/02/28/working-my-way-back-to-you-a-re-investigation-of-rstan/). ## An example - Suppose we have these (integer) observations: {r rstan-7} (x <- c(0, 4, 3, 6, 3, 3, 2, 4))  - Suppose we believe that these come from a Poisson distribution with a mean $\lambda$ that we want to estimate. - We need a prior distribution for $\lambda$. I will (for some reason) take a $Weibull$ distribution with parameters 1.1 and 6, that has quartiles 2 and 6. Normally this would come from your knowledge of the data-generating *process*. - The Poisson likelihood can be written down (see over). ## Some algebra - We have $n=8$ observations $x_i$, so the Poisson likelihood is proportional to $$\prod_{i=1}^n e^{-\lambda} \lambda^{x_i} = e^{-n\lambda} \lambda^S,$$ where $S=\sum_{i=1}^n x_i$. - then you write the Weibull prior density (as a function of $\lambda$): $$C (\lambda/6)^{0.1} e^{-(\lambda/6)^{1.1}}$$ where $C$ is a constant. - and then you multiply these together and try to recognize the distributional form. Only, here you can't. The powers 0.1 and 1.1 get in the way. ## Sampling from the posterior distribution - Wouldn't it be nice if we could just *sample* from the posterior distribution? Then we would be able to compute it as accurately as we want. - Metropolis and Hastings: devise a Markov chain (C62) whose limiting distribution is the posterior you want, and then sample from that Markov chain (easy), allowing enough time to get close enough to the limiting distribution. - Stan: uses a modern variant that is more efficient (called Hamiltonian Monte Carlo), implemented in R packages cmdstanr. - Write Stan code in a file, compile it and sample from it. ## Components of Stan code: the model  model { // likelihood x ~ poisson(lambda); }  This is how you say "$X$ has a Poisson distribution with mean $\lambda$". **Note that lines of Stan code have semicolons on the end.** ## Components of Stan code: the prior distribution  model { // prior lambda ~ weibull(1.1, 6); // likelihood x ~ poisson(lambda); }  ## Components of Stan code: data and parameters - first in the Stan code:  data { int x[8]; } parameters { real lambda; }  ## Compile and sample from the model 1/2 - compile {r rstan-8, message=FALSE} poisson1 <- cmdstan_model("poisson1.stan")  {r} poisson1  ## Compile and sample from the model 2/2 - set up data {r rstan-9} poisson1_data <- list(x = x) poisson1_data  - sample {r rstan-10} poisson1_fit <- poisson1$sample(data = poisson1_data)  ## The output {r rstan-11} poisson1_fit  ## Comments - This summarizes the posterior distribution of$\lambda$- the posterior mean is 3.19 - with a 90% posterior interval of 2.25 to 4.33. - The probability that$\lambda$is between these two values really is 90%. ## Making the code more general - The coder in you is probably offended by hard-coding the sample size and the parameters of the prior distribution. More generally:  data { int n; real a; real b; int x[n]; } ... model { // prior lambda ~ weibull(a, b); // likelihood x ~ poisson(lambda); }  ## Set up again and sample: - Compile again: {r rstan-12, message=FALSE} poisson2 <- cmdstan_model("poisson2.stan")  - set up the data again including the new things we need: \footnotesize {r rstan-13} poisson2_data <- list(x = x, n = length(x), a = 1.1, b = 6) poisson2_data  \normalsize ## Sample again Output should be the same (to within randomness): {r rstan-15} #| results = "hide", #| message = FALSE poisson2_fit <- poisson2$sample(data = poisson2_data)  {r} poisson2_fit  ## Picture of posterior {r rstan-16} mcmc_hist(poisson2_fit$draws("lambda"), binwidth = 0.25)  ## Extracting actual sampled values A little awkward at first: {r rstan-17} poisson2_fit$draws()  A 3-dimensional array. A dataframe would be much better. ## Sampled values as dataframe {r rstan-18} as_draws_df(poisson2_fit$draws()) %>% as_tibble() -> poisson2_draws poisson2_draws  ## Posterior predictive distribution - Another use for the actual sampled values is to see what kind of *response* values we might get in the future. This should look something like our data. For a Poisson distribution, the response values are integers: {r rstan-19} poisson2_draws %>% rowwise() %>% mutate(xsim = rpois(1, lambda)) -> d  ## The simulated posterior distribution (in xsim) {r} d %>% select(lambda, xsim)  ## Comparison Our actual data values were these: {r rstan-20} x  - None of these are very unlikely according to our posterior predictive distribution, so our model is believable. - Or make a plot: a bar chart with the data on it as well (over): {r rstan-21} ggplot(d, aes(x = xsim)) + geom_bar() + geom_dotplot(data = tibble(x), aes(x = x), binwidth = 1) + scale_y_continuous(NULL, breaks = NULL) -> g  - This also shows that the distribution of the data conforms well enough to the posterior predictive distribution (over). ## The plot {r rstan-22} g  ## Do they have the same distribution? {r rstan-23, fig.height=4} qqplot(d$xsim, x, plot.it = FALSE) %>% as_tibble() -> dd dd  ## The plot {r, fig.height=4} ggplot(dd, aes(x=x, y=y)) + geom_point()  the observed zero is a bit too small compared to expected (from the posterior), but the other points seem pretty well on a line. ## Analysis of variance, the Bayesian way Recall the jumping rats data: {r rstan-24, message=F} my_url <- "http://ritsokiguess.site/datafiles/jumping.txt" rats0 <- read_delim(my_url, " ") rats0  ## Our aims here - Estimate the mean bone density of all rats under each of the experimental conditions - Model: given the group means, each observation normally distributed with common variance $\sigma^2$ - Three parameters to estimate, plus the common variance. - Obtain posterior distributions for the group means. - Ask whether the posterior distributions of these means are sufficiently different. ## Numbering the groups - Stan doesn't handle categorical variables (everything is real or int). - Turn the groups into group *numbers* first. - Take opportunity to put groups in logical order: {r rstan-25} rats0 %>% mutate( group_fct = fct_inorder(group), group_no = as.integer(group_fct) ) -> rats rats  ## Plotting the data 1/2 Most obviously, boxplots: {r rstan-26, fig.height=4.2} ggplot(rats, aes(x = group_fct, y = density)) + geom_boxplot()  ## Plotting the data 2/2 Another way: density plot (smoothed out histogram); can distinguish groups by colours: {r density_plot, fig.height=4} ggplot(rats, aes(x = density, fill = group_fct)) + geom_density(alpha = 0.6)  ## The procedure - For each observation, find out which (numeric) group it belongs to, - then model it as having a normal distribution with that group's mean and the common variance. - Stan does for loops. ## The model part Suppose we have n_obs observations:  model { // likelihood for (i in 1:n_obs) { g = group_no[i]; density[i] ~ normal(mu[g], sigma); } }  ## The variables here {.scrollable} - n_obs is data. - g is a temporary integer variable only used here - i is only used in the loop (integer) and does not need to be declared - density is data, a real vector of length n_obs - mu is a parameter, a real vector of length 3 (3 groups) - sigma is a real parameter mu and sigma need prior distributions: - for mu, each component independently normal with mean 600 and SD 50 (my guess at how big and variable they will be) - for sigma, chi-squared with 50 df (my guess at typical amount of variability from obs to obs) ## Complete the model section:  model { int g; // priors mu ~ normal(600, 50); sigma ~ chi_square(50); // likelihood for (i in 1:n_obs) { g = group_no[i]; density[i] ~ normal(mu[g], sigma); } }  ## Parameters The elements of mu, one per group, and also sigma, scalar, lower limit zero:  parameters { real mu[n_group]; real sigma; }  - Declare sigma to have lower limit zero here, so that the sampling runs smoothly. - declare n_group in data section ## Data Everything else:  data { int n_obs; int n_group; real density[n_obs]; int group_no[n_obs]; }  ## Compile Arrange these in order data, parameters, model in file anova.stan, then: {r rstan-27, message=FALSE} anova <- cmdstan_model("anova.stan")  ## Set up data and sample Supply values for *everything* declared in data: {r anova_data_samples} anova_data <- list( n_obs = 30, n_group = 3, density = rats$density, group_no = rats$group_no ) anova_fit <- anova$sample(data = anova_data)  ## Check that the sampling worked properly \scriptsize {r rstan-28, fig.height=4} anova_fit$cmdstan_diagnose()  \normalsize ## Look at the results {r anova_samples} anova_fit  ## Comments - The posterior 95% intervals for control (group 1) and highjump (group 3) do not quite overlap, suggesting that these exercise groups really are different. - Bayesian approach does not normally do tests: look at posterior distributions and decide whether they are different enough to be worth treating as different. ## Plotting the posterior distributions for the mu {r rstan-29} mcmc_hist(anova_fit$draws("mu"), binwidth = 5)  ## Extract the sampled values {r anova_ext} as_draws_df(anova_fit$draws()) %>% as_tibble() -> anova_draws anova_draws  ## estimated probability that $\mu_3 > \mu_1$ {r rstan-30} anova_draws %>% count(mu[3]>mu[1]) %>% mutate(prob = n/sum(n))  High jumping group almost certainly has larger mean than control group. ## More organizing - for another plot - make longer - give group values their proper names back {r rstan-31, warning=F} anova_draws %>% pivot_longer(starts_with("mu"), names_to = "group", values_to = "bone_density") %>% mutate(group = fct_recode(group, Control = "mu[1]", Lowjump = "mu[2]", Highjump = "mu[3]" )) -> sims  ## What we have now: {r rstan-32} sims  ## Density plots of posterior mean distributions {r rstan-33, fig.height=4} ggplot(sims, aes(x = bone_density, fill = group)) + geom_density(alpha = 0.6)  ## Posterior predictive distributions Randomly sample from posterior means and SDs in sims. There are 12000 rows in sims: {r rstan-34} sims %>% mutate(sim_data = rnorm(12000, bone_density, sigma)) -> ppd ppd  ## Compare posterior predictive distribution with actual data - Check that the model works: distributions of data similar to what we'd predict - Idea: make plots of posterior predictive distribution, and plot actual data as points on them - Use facets, one for each treatment group: \scriptsize {r ppdgraph} my_binwidth <- 15 ggplot(ppd, aes(x = sim_data)) + geom_histogram(binwidth = my_binwidth) + geom_dotplot( data = rats, aes(x = density), binwidth = my_binwidth ) + facet_wrap(~group) + scale_y_continuous(NULL, breaks = NULL) -> g  \normalsize - See (over) that the data values are mainly in the middle of the predictive distributions. - Even for the control group that had outliers. ## The plot {r rstan-35, fig.height=3.75} g  ## Extensions - if you want a different model other than normal, change distribution in model section - if you want to allow unequal spreads, create sigma[n_group] and in model density[i] ~ normal(mu[g], sigma[g]); - Stan will work just fine after you recompile - very flexible. - Typical modelling strategy: start simple, add complexity as warranted by data.