--- title: "Why I like generalized fiducial inference" author: "Stéphane Laurent" date: '2020-11-08' tags: R, maths, statistics rbloggers: yes output: md_document: variant: markdown preserve_yaml: true html_document: highlight: kate keep_md: no highlighter: pandoc-solarized --- {r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, attr.source = ".numberLines", fig.width = 6, fig.height = 5, fig.path = "./figures/gfi_rstanarm-", warning = FALSE, message = FALSE )  Throughout this article, one considers the balanced one-way ANOVA model with a random factor (group). The between standard deviation and the within standard deviation are denoted by$\sigma_{\mathrm{b}}$and$\sigma_{\mathrm{w}}$respectively. The grand mean is denoted by$\mu$. The number of levels of the group factor is denoted by$I$and the number of individuals within each group is denoted by$J$. Thus the model is: $$\mu_i \sim_{\text{iid}} \mathcal{N}(\mu, \sigma_{\mathrm{b}}^2), \, i = 1, \ldots, I \qquad (y_{i,j} \mid \mu_i) \sim_{\text{iid}} \mathcal{N}(\mu_i, \sigma_{\mathrm{w}}^2), \, j = 1, \ldots, J.$$ ## Using 'rstanarm' with the default priors Below I fit the model with the 'rstanarm' package for fifteen simulated datasets with$I = 10$,$J = 5$,$\mu = 10000$,$\sigma_{\mathrm{b}} = 50$,$\sigma_{\mathrm{w}} = 1$. I assign a "vague" half-Cauchy prior distribution to$\sigma_{\mathrm{w}}$and the other prior distributions are the default prior distributions of stan_lmer. {r sims1, eval=FALSE} library(rstanarm) options(mc.cores = parallel::detectCores()) mu <- 10000 sigmaWithin <- 1 ratio <- 50 sigmaBetween <- sigmaWithin * ratio I <- 10L J <- 5L nsims <- 15L stanIntervals <- list( # to store the confidence intervals within = colnames<-(matrix(NA_real_, nrow = nsims, ncol = 3), c("estimate", "lwr", "upr")), between = colnames<-(matrix(NA_real_, nrow = nsims, ncol = 3), c("estimate", "lwr", "upr")) ) set.seed(666L) for(i in 1L:nsims){ groupMeans <- rnorm(I, mu, sigmaBetween) y <- c( vapply(groupMeans, function(gmean) rnorm(J, gmean, sigmaWithin), numeric(J)) ) dat <- data.frame( y = y, group = gl(I, J) ) rstanarm <- stan_lmer( y ~ (1|group), data = dat, iter = 5000L, prior_aux = cauchy(0, 5) ) pstrr <- as.data.frame( # extract posterior draws stan, pars = c( "(Intercept)", "sigma", "Sigma[group:(Intercept),(Intercept)]" ) ) names(pstrr)[2L:3L] <- c("sigma_error", "sigma_group") pstrr[["sigma_group"]] <- sqrt(pstrr[["sigma_group"]]) x <- t(vapply(pstrr, quantile, numeric(3L), probs = c(50, 2.5, 97.5)/100)) stanIntervals$within[i, ] <- x["sigma_error", ] stanIntervals$between[i, ] <- x["sigma_group", ] }  {r loadRDS, echo=FALSE} Results <- readRDS("./RDS/AOV1R_highRatio_simulations.rds") stanIntervals <- Results$stanIntervals stanIntervals2 <- Results$stanIntervals2 freqIntervals <- Results$freqIntervals fidIntervals <- Results$fidIntervals mu <- 10000 sigmaWithin <- 1 ratio <- 50 sigmaBetween <- sigmaWithin * ratio I <- 10L J <- 5L nsims <- 15L  Let's plot the intervals now. {r stanIntervalsWithin} library(ggplot2) stanWithin <- as.data.frame(stanIntervals[["within"]]) stanWithin[["simulation"]] <- factor(1L:nsims) ggplot( stanWithin, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr ) ) + geom_pointrange() + ylab("interval") + geom_hline(yintercept = 1, linetype = "dashed") + ggtitle("Confidence intervals about the within standard deviation")  The horizontal line shows the value of$\sigma_{\mathrm{w}}$. As you can see, the confidence intervals dramatically fail to catch this value. And this is also the case for$\sigma_{\mathrm{b}}$: {r stanIntervalsBetween} stanBetween <- as.data.frame(stanIntervals[["between"]]) stanBetween[["simulation"]] <- factor(1L:nsims) ggplot( stanBetween, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr ) ) + geom_pointrange() + ylab("interval") + geom_hline(yintercept = 1, linetype = "dashed") + ggtitle("Confidence intervals about the between standard deviation")  ## Resolving the issue Why? The explanation is simple: stan_lmer assigns a unit exponential prior distribution to the between standard deviation, which is equal to$50$. So we have to change this prior distribution, and stan_lmer allows to use a Gamma distribution as the prior distribution of the between standard deviation. Its parameters shape and scale are settable in the decov function which is passed on to the prior_covariance option: {r setGammaParameters, eval=FALSE} rstanarm <- stan_lmer( y ~ (1|group), data = dat, iter = 5000L, prior_aux = cauchy(0, 5), prior_covariance = decov(shape = 2, scale = 30) )  I choose the$\textrm{Gamma}(\textrm{shape}=2, \textrm{scale=30})$distribution because it has median$\approx 50$and is "vague" enough: its equi-tailed$95\%$-dispersion interval is$\approx (7, 167)$. ☛ *However it took me some time to pick up these parameters. I firstly tried a more dispersed Gamma distribution but stan_lmer returned a bunch of warnings and non-sensible results.* Below are the confidence intervals obtained with this Gamma prior distribution. I compare them with the frequentist intervals obtained with the 'AOV1R' package. {r stanIntervals2Within, echo=FALSE, fig.width=8} stan2Within <- as.data.frame(stanIntervals2[["within"]]) freqWithin <- as.data.frame(freqIntervals[["within"]]) stan2Within[["simulation"]] <- freqWithin[["simulation"]] <- factor(1L:nsims) freqWithin$inference <- "frequentist" stan2Within$inference <- "Bayesian" freqAndStan2 <- rbind(freqWithin, stan2Within) ggplot( freqAndStan2, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr, group = simulation, color = inference ) ) + geom_pointrange(position = position_dodge2(width = 0.5)) + scale_discrete_manual("colour", values = c("green", "blue")) + ylab("interval") + geom_hline(yintercept = 1, linetype = "dashed") + ggtitle("Confidence intervals about the within standard deviation", subtitle = "Frequentist and Bayesian")  {r stanIntervals2Between, echo=FALSE, fig.width=8} stan2Between <- as.data.frame(stanIntervals2[["between"]]) freqBetween <- as.data.frame(freqIntervals[["between"]]) stan2Between[["simulation"]] <- freqBetween[["simulation"]] <- factor(1L:nsims) freqBetween$inference <- "frequentist" stan2Between$inference <- "Bayesian" freqAndStan2 <- rbind(freqBetween, stan2Between) ggplot( freqAndStan2, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr, group = simulation, color = inference ) ) + geom_pointrange(position = position_dodge2(width = 0.5)) + scale_discrete_manual("colour", values = c("green", "blue")) + ylab("interval") + geom_hline(yintercept = 50, linetype = "dashed") + ggtitle("Confidence intervals about the between standard deviation", subtitle = "Frequentist and Bayesian")  Quite good. ☛ *I also noticed that the sampling was slower with this Gamma distribution.* ## Try the generalized fiducial inference. My new package 'gfilmm' allows to perform the *generalized fiducial inference* for any Gaussian linear mixed model with categorical random effects. Fiducial inference and Bayesian inference have something in common: they are both based on a distribution representing the uncertainty about the parameters: the fiducial distribution and the posterior distribution, respectively. A notable difference between these two methods of inference is that *there's no prior distribution in fiducial statistics*. Here is how to run the fiducial sampler: {r gfi1, eval=FALSE} library(gfilmm) fiducialSimulations <- gfilmm( y = ~ cbind(y - 0.01, y + 0.01), fixed = ~ 1, random = ~ group, data= dat, N = 10000L )  Note the form of the response variable: ~ cbind(y - 0.01, y + 0.01). That's because the generalized fiducial inference applies to interval data. Below are the fiducial confidence intervals for the same simulated datasets as before. {r fidIntervalsWithin, echo=FALSE, fig.width=8} fidWithin <- as.data.frame(fidIntervals[["within"]]) fidWithin[["simulation"]] <- factor(1L:nsims) fidWithin$inference <- "fiducial" fidAndFreq <- rbind(freqWithin, fidWithin) ggplot( fidAndFreq, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr, group = simulation, color = inference ) ) + geom_pointrange(position = position_dodge2(width = 0.5)) + scale_discrete_manual("colour", values = c("red", "blue")) + ylab("interval") + geom_hline(yintercept = 1, linetype = "dashed") + ggtitle("Confidence intervals about the within standard deviation", subtitle = "Frequentist and fiducial")  {r fidIntervalsBetween, echo=FALSE, fig.width=8} fidBetween <- as.data.frame(fidIntervals[["between"]]) fidBetween[["simulation"]] <- factor(1L:nsims) fidBetween\$inference <- "fiducial" fidAndFreq <- rbind(freqBetween, fidBetween) ggplot( fidAndFreq, aes( x = simulation, y = estimate, ymin = lwr, ymax = upr, group = simulation, color = inference ) ) + geom_pointrange(position = position_dodge2(width = 0.5)) + scale_discrete_manual("colour", values = c("red", "blue")) + ylab("interval") + geom_hline(yintercept = 50, linetype = "dashed") + ggtitle("Confidence intervals about the between standard deviation", subtitle = "Frequentist and fiducial")  Quite good too. And let me insist on this point: *there is no prior distribution, there is nothing to set, except the number of simulations*. I implemented the algorithm (due to J. Cisewski and J. Hannig) in C++ and it takes less than 1000 lines of code. Let's increase the between standard deviation now. {r gfiRatio1000, cache=TRUE} ratio <- 1000 sigmaBetween <- ratio * sigmaWithin set.seed(31415926L) groupMeans <- rnorm(I, mu, sigmaBetween) y <- c( vapply(groupMeans, function(gmean) rnorm(J, gmean, sigmaWithin), numeric(J)) ) dat <- data.frame( y = y, group = gl(I, J) ) library(AOV1R) library(gfilmm) aovfit <- aov1r(y ~ group, data = dat) gf <- gfilmm(~ cbind(y-0.01, y+0.01), ~ 1, ~ group, data = dat, N = 5000L) confint(aovfit) gfiSummary(gf)  The fiducial confidence interval about the within standard deviation does not match the frequentist interval, and does not catch the true value. Nothing to tinker with, except the number of simulations: {r gfi30000, cache=TRUE} gf <- gfilmm(~ cbind(y-0.01, y+0.01), ~ 1, ~ group, data = dat, N = 30000L) gfiSummary(gf)  Now the fiducial intervals match the frequentist ones. ## Epilogue As you have seen, using the generalized fiducial inference is easy, easier than the Bayesian inference. The difficulty I mentioned regarding the Bayesian inference is not severe, but this is because the one-way ANOVA model with a random factor is the simplest Gaussian linear mixed model. Namely, it has only one between standard deviation. Things get more complicated for a mixed model with multiple random effects. With rstanarm::stan_lmer, one has to assign a Gamma prior distribution on each between standard deviation. I cheated for the above example: I did multiple attempts to select the parameters of the Gamma prior, until I found results close to the frequentist ones!