---
title: "Why I like generalized fiducial inference"
author: "Stéphane Laurent"
date: '2020-11-08'
tags: R, maths, statistics
rbloggers: yes
output:
  md_document:
    variant: markdown
    preserve_yaml: true
  html_document:
    highlight: kate
    keep_md: no
highlighter: pandoc-solarized
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, collapse = TRUE,
  attr.source = ".numberLines",
  fig.width = 6, fig.height = 5, fig.path = "./figures/gfi_rstanarm-",
  warning = FALSE, message = FALSE
)
```

Throughout this article, one considers the balanced one-way ANOVA model with a 
random factor (`group`). The between standard deviation and the within standard 
deviation are denoted by $\sigma_{\mathrm{b}}$ and $\sigma_{\mathrm{w}}$
respectively. The grand mean is denoted by $\mu$. The number of levels of the 
`group` factor is denoted by $I$ and the number of individuals within each group
is denoted by $J$. Thus the model is:
$$
\mu_i \sim_{\text{iid}} \mathcal{N}(\mu, \sigma_{\mathrm{b}}^2), \, 
i = 1, \ldots, I \qquad 
(y_{i,j} \mid \mu_i) \sim_{\text{iid}} 
\mathcal{N}(\mu_i, \sigma_{\mathrm{w}}^2), \, j = 1, \ldots, J.
$$

## Using 'rstanarm' with the default priors

Below I fit the model with the 'rstanarm' package for fifteen simulated datasets
with $I = 10$, $J = 5$, $\mu = 10000$, $\sigma_{\mathrm{b}} = 50$, 
$\sigma_{\mathrm{w}} = 1$. I assign a "vague" half-Cauchy prior distribution to 
$\sigma_{\mathrm{w}}$ and the other prior distributions are the default prior 
distributions of `stan_lmer`.

```{r sims1, eval=FALSE}
library(rstanarm)
options(mc.cores = parallel::detectCores())

mu           <- 10000 
sigmaWithin  <- 1
ratio        <- 50
sigmaBetween <- sigmaWithin * ratio
I            <- 10L 
J            <- 5L 

nsims <- 15L 

stanIntervals <- list( # to store the confidence intervals
  within = `colnames<-`(matrix(NA_real_, nrow = nsims, ncol = 3), 
                      c("estimate", "lwr", "upr")),
  between = `colnames<-`(matrix(NA_real_, nrow = nsims, ncol = 3), 
                      c("estimate", "lwr", "upr"))
)

set.seed(666L)

for(i in 1L:nsims){
  groupMeans <- rnorm(I, mu, sigmaBetween)
  y          <- c(
    vapply(groupMeans, function(gmean) rnorm(J, gmean, sigmaWithin), numeric(J))
  )
  dat        <- data.frame(
    y = y,
    group = gl(I, J)
  )
  
  rstanarm <- stan_lmer(
    y ~ (1|group), data = dat, iter = 5000L,
    prior_aux = cauchy(0, 5)
  )
  
  pstrr <- as.data.frame( # extract posterior draws
    stan,
    pars = c(
      "(Intercept)",
      "sigma",
      "Sigma[group:(Intercept),(Intercept)]"
    )
  )
  names(pstrr)[2L:3L] <- c("sigma_error", "sigma_group")
  pstrr[["sigma_group"]] <- sqrt(pstrr[["sigma_group"]])

  x <- t(vapply(pstrr, quantile, numeric(3L), probs = c(50, 2.5, 97.5)/100))
  stanIntervals$within[i, ] <- x["sigma_error", ]
  stanIntervals$between[i, ] <- x["sigma_group", ]
}
```

```{r loadRDS, echo=FALSE}
Results <- readRDS("./RDS/AOV1R_highRatio_simulations.rds")
stanIntervals <- Results$stanIntervals
stanIntervals2 <- Results$stanIntervals2
freqIntervals <- Results$freqIntervals
fidIntervals <- Results$fidIntervals
mu           <- 10000
sigmaWithin  <- 1
ratio        <- 50
sigmaBetween <- sigmaWithin * ratio
I            <- 10L 
J            <- 5L 
nsims <- 15L 
```

Let's plot the intervals now.

```{r stanIntervalsWithin}
library(ggplot2)
stanWithin <- as.data.frame(stanIntervals[["within"]])
stanWithin[["simulation"]] <- factor(1L:nsims)
ggplot(
  stanWithin, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr
  )
) + 
  geom_pointrange() + 
  ylab("interval") + 
  geom_hline(yintercept = 1, linetype = "dashed") + 
  ggtitle("Confidence intervals about the within standard deviation")
```

The horizontal line shows the value of $\sigma_{\mathrm{w}}$. As you can see, 
the confidence intervals dramatically fail to catch this value.

And this is also the case for $\sigma_{\mathrm{b}}$:

```{r stanIntervalsBetween}
stanBetween <- as.data.frame(stanIntervals[["between"]])
stanBetween[["simulation"]] <- factor(1L:nsims)
ggplot(
  stanBetween, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr
  )
) + 
  geom_pointrange() + 
  ylab("interval") + 
  geom_hline(yintercept = 1, linetype = "dashed") + 
  ggtitle("Confidence intervals about the between standard deviation")
```

## Resolving the issue 

Why? The explanation is simple: `stan_lmer` assigns a unit exponential prior 
distribution to the between standard deviation, which is equal to $50$. 

So we have to change this prior distribution, and `stan_lmer` allows to use a 
Gamma distribution as the prior distribution of the between standard deviation.
Its parameters `shape` and `scale` are settable in the `decov` function which is 
passed on to the `prior_covariance` option:

```{r setGammaParameters, eval=FALSE}
rstanarm <- stan_lmer(
  y ~ (1|group), data = dat, iter = 5000L,
  prior_aux = cauchy(0, 5),
  prior_covariance = decov(shape = 2, scale = 30)
)
```

I choose the $\textrm{Gamma}(\textrm{shape}=2, \textrm{scale=30})$ distribution 
because it has median $\approx 50$ and is "vague" enough: its equi-tailed 
$95\%$-dispersion interval is $\approx (7, 167)$. 

<span style="color:firebrick;font-size:1.25em">
&#9755; *However it took me some time to pick up these parameters. I firstly 
tried a more dispersed Gamma distribution but `stan_lmer` returned a bunch of 
warnings and non-sensible results.*
</span>

Below are the confidence intervals obtained with this Gamma prior distribution. 
I compare them with the frequentist intervals obtained with the 'AOV1R' package.

```{r stanIntervals2Within, echo=FALSE, fig.width=8}
stan2Within <- as.data.frame(stanIntervals2[["within"]])
freqWithin <- as.data.frame(freqIntervals[["within"]])
stan2Within[["simulation"]] <- freqWithin[["simulation"]] <- factor(1L:nsims)
freqWithin$inference <- "frequentist"
stan2Within$inference <- "Bayesian"
freqAndStan2 <- rbind(freqWithin, stan2Within)
ggplot(
  freqAndStan2, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr, 
    group = simulation, color = inference
  )
) + 
  geom_pointrange(position = position_dodge2(width = 0.5)) + 
  scale_discrete_manual("colour", values = c("green", "blue")) + 
  ylab("interval") + 
  geom_hline(yintercept = 1, linetype = "dashed") + 
  ggtitle("Confidence intervals about the within standard deviation", 
          subtitle = "Frequentist and Bayesian")
```

```{r stanIntervals2Between, echo=FALSE, fig.width=8}
stan2Between <- as.data.frame(stanIntervals2[["between"]])
freqBetween <- as.data.frame(freqIntervals[["between"]])
stan2Between[["simulation"]] <- freqBetween[["simulation"]] <- factor(1L:nsims)
freqBetween$inference <- "frequentist"
stan2Between$inference <- "Bayesian"
freqAndStan2 <- rbind(freqBetween, stan2Between)
ggplot(
  freqAndStan2, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr, 
    group = simulation, color = inference
  )
) + 
  geom_pointrange(position = position_dodge2(width = 0.5)) + 
  scale_discrete_manual("colour", values = c("green", "blue")) + 
  ylab("interval") + 
  geom_hline(yintercept = 50, linetype = "dashed") + 
  ggtitle("Confidence intervals about the between standard deviation", 
          subtitle = "Frequentist and Bayesian")
```

Quite good. 

<span style="color:firebrick;font-size:1.25em">
&#9755;
*I also noticed that the sampling was slower with this Gamma distribution.*
</span>

## Try the generalized fiducial inference.

My new package 'gfilmm' allows to perform the *generalized fiducial inference* 
for any Gaussian linear mixed model with categorical random effects. 

Fiducial inference and Bayesian inference have something in common: they are 
both based on a distribution representing the uncertainty about the parameters: 
the fiducial distribution and the posterior distribution, respectively.

A notable difference between these two methods of inference is that *there's no 
prior distribution in fiducial statistics*.

Here is how to run the fiducial sampler:

```{r gfi1, eval=FALSE}
library(gfilmm)
fiducialSimulations <- gfilmm(
  y = ~ cbind(y - 0.01, y + 0.01), fixed = ~ 1, random = ~ group, 
  data= dat, N = 10000L
)
```

Note the form of the response variable: `~ cbind(y - 0.01, y + 0.01)`. That's 
because the generalized fiducial inference applies to interval data. 

Below are the fiducial confidence intervals for the same simulated datasets as 
before.

```{r fidIntervalsWithin, echo=FALSE, fig.width=8}
fidWithin <- as.data.frame(fidIntervals[["within"]])
fidWithin[["simulation"]] <- factor(1L:nsims)
fidWithin$inference <- "fiducial"
fidAndFreq <- rbind(freqWithin, fidWithin)
ggplot(
  fidAndFreq, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr, 
    group = simulation, color = inference
  )
) + 
  geom_pointrange(position = position_dodge2(width = 0.5)) + 
  scale_discrete_manual("colour", values = c("red", "blue")) + 
  ylab("interval") + 
  geom_hline(yintercept = 1, linetype = "dashed") + 
  ggtitle("Confidence intervals about the within standard deviation", 
          subtitle = "Frequentist and fiducial")
```

```{r fidIntervalsBetween, echo=FALSE, fig.width=8}
fidBetween <- as.data.frame(fidIntervals[["between"]])
fidBetween[["simulation"]] <- factor(1L:nsims)
fidBetween$inference <- "fiducial"
fidAndFreq <- rbind(freqBetween, fidBetween)
ggplot(
  fidAndFreq, 
  aes(
    x = simulation, y = estimate, ymin = lwr, ymax = upr, 
    group = simulation, color = inference
  )
) + 
  geom_pointrange(position = position_dodge2(width = 0.5)) + 
  scale_discrete_manual("colour", values = c("red", "blue")) + 
  ylab("interval") + 
  geom_hline(yintercept = 50, linetype = "dashed") + 
  ggtitle("Confidence intervals about the between standard deviation", 
          subtitle = "Frequentist and fiducial")
```

Quite good too. And let me insist on this point: *there is no prior distribution, 
there is nothing to set, except the number of simulations*. I implemented the 
algorithm (due to J. Cisewski and J. Hannig) in C++ and it takes less than 
1000 lines of code.

Let's increase the between standard deviation now.

```{r gfiRatio1000, cache=TRUE}
ratio <- 1000
sigmaBetween <- ratio * sigmaWithin
set.seed(31415926L)
groupMeans <- rnorm(I, mu, sigmaBetween)
y          <- c(
  vapply(groupMeans, function(gmean) rnorm(J, gmean, sigmaWithin), numeric(J))
)
dat        <- data.frame(
  y = y,
  group = gl(I, J)
)
library(AOV1R)
library(gfilmm)
aovfit <- aov1r(y ~ group, data = dat)
gf <- gfilmm(~ cbind(y-0.01, y+0.01), ~ 1, ~ group, data = dat, N = 5000L)
confint(aovfit)
gfiSummary(gf)
```

The fiducial confidence interval about the within standard deviation does not 
match the frequentist interval, and does not catch the true value. Nothing to 
tinker with, except the number of simulations:

```{r gfi30000, cache=TRUE}
gf <- gfilmm(~ cbind(y-0.01, y+0.01), ~ 1, ~ group, data = dat, N = 30000L)
gfiSummary(gf)
```

Now the fiducial intervals match the frequentist ones.

## Epilogue

As you have seen, using the generalized fiducial inference is easy, easier than 
the Bayesian inference. The difficulty I mentioned regarding the Bayesian 
inference is not severe, but this is because the one-way ANOVA model with a 
random factor is the simplest Gaussian linear mixed model. Namely, it has only 
one between standard deviation. Things get more complicated for a mixed model 
with multiple random effects. With `rstanarm::stan_lmer`, one has to assign a 
Gamma prior distribution on each between standard deviation. I cheated for 
the above example: I did multiple attempts to select the parameters of the 
Gamma prior, until I found results close to the frequentist ones!