Stan Workshop at Calvin College

Setup
The problem
The data
Bayesian workflow
Modeling count data : Poisson distribution
Expanding the model: multiple predictors
Modeling count data with the Negative Binomial
Hierarchical modeling
Time varying effects and structured priors
Using our model: loss forecasts for decision making
Gaussian process instead of AR(1)

Setup

knitr::opts_chunk$set(
  echo = TRUE, 
  dev = "png",
  dpi = 150,
  fig.align = "center",
  comment = NA,
  cache=TRUE
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)

theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs (not Stan's)
set.seed(1123)

The problem

Background

Imagine that you are a statistician or data scientist working as an independent contractor. One of your clients is a company that owns many residential buildings throughout New York City. The property manager explains that they are concerned about the number of cockroach complaints that they receive from their buildings. Previously the company has offered monthly visits from a pest inspector as a solution to this problem. While this is the default solution of many property managers in NYC, the tenants are rarely home when the inspector visits, and so the manager reasons that this is a relatively expensive solution that is currently not very effective.

One alternative to this problem is to deploy long term bait stations. In this alternative, child and pet safe bait stations are installed throughout the apartment building. Cockroaches obtain quick acting poison from these stations and distribute it throughout the colony. The manufacturer of these bait stations provides some indication of the space-to-bait efficacy, but the manager suspects that this guidance was not calculated with NYC roaches in mind. NYC roaches, the manager rationalizes, have more hustle than traditional roaches; and NYC buildings are built differently than other common residential buildings in the US. This is particularly important as the unit cost for each bait station per year is quite high.

The goal

The manager wishes to employ your services to help them to find the optimal number of roach bait stations they should place in each of their buildings in order to minimize the number of cockroach complaints while also keeping expenditure on pest control affordable.

A subset of the company’s buildings have been randomly selected for an experiment:

At the beginning of each month, a pest inspector randomly places a number of bait stations throughout the building, without knowledge of the current cockroach levels in the building
At the end of the month, the manager records the total number of cockroach complaints in that building.
The manager would like to determine the optimal number of traps ($\textrm{traps}$) that balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate with the all-in cost of maintaining the traps ($\textrm{TC}$).

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

\[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})] \]

The property manager would also, if possible, like to learn how these results generalize to buildings they haven’t treated so they can understand the potential costs of pest control at buildings they are acquiring as well as for the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the random variable contributing to this expectation is the number of complaints given the number of traps. We will model the number of complaints as a function of the number of traps.

The data

The data provided to us is in a file called pest_data.RDS. Let’s load the data and see what the structure is:

pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)

'data.frame':   120 obs. of  13 variables:
 $ building_id         : int  37 37 37 37 37 37 37 37 37 37 ...
 $ date                : Date, format: "2017-01-15" "2017-02-14" ...
 $ traps               : num  8 8 9 10 11 11 10 10 9 9 ...
 $ floors              : num  8 8 8 8 8 8 8 8 8 8 ...
 $ sq_footage_p_floor  : num  5149 5149 5149 5149 5149 ...
 $ live_in_super       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthly_average_rent: num  3847 3847 3847 3847 3847 ...
 $ average_tenant_age  : num  53.9 53.9 53.9 53.9 53.9 ...
 $ age_of_building     : num  47 47 47 47 47 47 47 47 47 47 ...
 $ total_sq_foot       : num  41192 41192 41192 41192 41192 ...
 $ month               : num  1 2 3 4 5 6 7 8 9 10 ...
 $ complaints          : num  1 3 0 1 0 0 4 3 2 2 ...
 $ log_sq_foot_1e4     : num  1.42 1.42 1.42 1.42 1.42 ...

We have access to the following fields:

complaints: Number of complaints per building per month
building_id: The unique building identifier
traps: The number of traps used per month per building
date: The date at which the number of complaints are recorded
live_in_super: An indicator for whether the building as a live-in super
age_of_building: The age of the building
total_sq_foot: The total square footage of the building
average_tenant_age: The average age of the tenants per building
monthly_average_rent: The average monthly rent per building
floors: The number of floors per building

First, let’s see how many buildings we have in the data:

N_buildings <- length(unique(pest_data$building_id))
N_buildings

[1] 10

And make some plots of the raw data:

ggplot(pest_data, aes(x = complaints)) + 
  geom_bar()

ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) + 
  geom_jitter()

ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) + 
  geom_line(aes(linetype = "Number of complaints")) + 
  geom_point(color = "black") + 
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) + 
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) + 
  scale_x_date(name = "Month", date_labels = "%b") + 
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) + 
  scale_linetype_discrete(name = "") + 
  scale_color_discrete(name = "Live-in super")

The first question we’ll look at is just whether the number of complaints per building per month is associated with the number of bait stations per building per month, ignoring the temporal and across-building variation (we’ll get to that later). This requires only two variables, $\textrm{complaints}$ and $\textrm{traps}$. How should we model the number of complaints?

Bayesian workflow

See slides

Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The number of complaints over a month should be either zero or an integer. The property manager tells us that it is possible but unlikely that number of complaints in a given month is zero. Occasionally there are a very large number of complaints in a single month. A common way of modeling this sort of skewed, single bounded count data is as a Poisson random variable. One concern about modeling the outcome variable as Poisson is that the data may be over-dispersed, but we’ll start with the Poisson model and then check whether over-dispersion is a problem by comparing our model’s predictions to the data.

Model

Given that we have chosen a Poisson regression, we define the likelihood to be the Poisson probability mass function over the number bait stations placed in the building, denoted below as traps. This model assumes that the conditional mean and variance of the outcome variable complaints (number of complaints) is the same. We’ll investigate whether this is a good assumption after we fit the model.

For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

Let’s encode this probability model in a Stan program.

Writing our first Stan model

Write simple_poisson_regression.stan together

Making sure our code is right

However, before we fit the model to real data, we should check that our model works well with simulated data. We’ll simulate data according to the model and then check that we can sufficiently recover the parameter values used in the simulation (ideally we would do this for many different draws from the prior). You can think of this step as software testing.

In R we can simulate data like this:

set.seed(106) # seed for R (not Stan)
N <- nrow(pest_data)
alpha_prior <- rnorm(1, mean = log(4), sd = .1)
beta_prior <- rnorm(1, mean = -0.25, sd = .1)
traps_sim <- rpois(N, lambda = mean(pest_data$traps))
complaints_sim <- rpois(N, lambda = exp(alpha_prior + beta_prior * traps_sim))

samps_dgp <- list(
  alpha = alpha_prior, 
  beta = beta_prior, 
  traps = matrix(traps_sim, nrow = 1),
  complaints = matrix(complaints_sim, nrow = 1)
)

But we can also use Stan to simulate the fake data:

Write simple_poisson_regression_dgp.stan together

We can use the stan() function to compile and fit the model, but here we will do the compilation and fitting in two stages to demonstrate what is really happening under the hood.

First we will compile the Stan program (simple_poisson_regression_dgp.stan) that will generate the fake data.

comp_dgp_simple <- stan_model('stan_programs/simple_poisson_regression_dgp.stan')

Printing this object shows the Stan program:

print(comp_dgp_simple)

S4 class stanmodel 'simple_poisson_regression_dgp' coded as follows:
data {
  int<lower=1> N;
  real<lower=0> mean_traps;
}
model {
} 
generated quantities {
  int traps[N];
  int complaints[N];
  real alpha = normal_rng(log(4), .1);
  real beta = normal_rng(-0.25, .1);
  
  for (n in 1:N)  {
    traps[n] = poisson_rng(mean_traps);
    complaints[n] = poisson_log_rng(alpha + beta * traps[n]);
  }
}

Now we can simulate the data by calling the sampling() function.

fitted_model_dgp <- sampling(
  comp_dgp_simple,
  data = list(N = nrow(pest_data), mean_traps = mean(pest_data$traps)),
  algorithm = 'Fixed_param', # set this if there's no parameter block
  seed = 123,
  chains = 1,
  iter = 1) # we'll just use 1 simulated data set but you can (and should!) use more


SAMPLING FOR MODEL 'simple_poisson_regression_dgp' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                5.6e-05 seconds (Sampling)
Chain 1:                5.6e-05 seconds (Total)
Chain 1:

Warning in throw_sampler_warnings(nfit): Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning in throw_sampler_warnings(nfit): Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

# see https://mc-stan.org/rstan/articles/stanfit_objects.html for various
# ways of extracting the contents of the stanfit object
samps_dgp <- rstan::extract(fitted_model_dgp)
str(samps_dgp)

List of 5
 $ traps     : num [1, 1:120] 7 5 8 11 9 6 5 6 8 9 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ complaints: num [1, 1:120] 0 1 0 0 0 0 0 0 1 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ alpha     : num [1(1d)] 1.29
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta      : num [1(1d)] -0.283
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ lp__      : num [1(1d)] 0
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL

Fit the model to the fake data:

In order to pass the fake data to our Stan program using RStan, we need to arrange the data into a named list. The names must match the names used in the data block of the Stan program.

stan_dat_fake <- list(
  N = nrow(pest_data), 
  traps = samps_dgp$traps[1, ], 
  complaints = samps_dgp$complaints[1, ]
)
str(stan_dat_fake)

List of 3
 $ N         : int 120
 $ traps     : num [1:120] 7 5 8 11 9 6 5 6 8 9 ...
 $ complaints: num [1:120] 0 1 0 0 0 0 0 0 1 0 ...

Now that we have the simulated data we fit the model to see if we can recover the alpha and beta parameters used in the simulation.

comp_model_P <- stan_model('stan_programs/simple_poisson_regression.stan')
fit_model_P <- sampling(comp_model_P, data = stan_dat_fake, seed = 123)

posterior_alpha_beta <- as.matrix(fit_model_P, pars = c('alpha','beta'))
head(posterior_alpha_beta)

Assess parameter recovery

true_alpha_beta <- c(samps_dgp$alpha, samps_dgp$beta)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Because we’re not simulating so many observations, we shouldn’t expect to recover the parameters perfectly, but it looks at least plausible ($\alpha$ and $\beta$ are contained within the histograms). If we did the simulation with many more observations the parameters would be estimated much more precisely. It’s also a good idea to do this many times using different draws from the prior to get a sense for the variety of the data that can be generated from the model.

We should also check if the y_rep datasets (in-sample predictions) that we coded in the generated quantities block are similar to the y (complaints) values we conditioned on when fitting the model. (The bayesplot package vignettes are a good resource on this topic.)

Here is a plot of the density estimate of the observed data compared to 200 of the y_rep datasets:

y_rep <- as.matrix(fit_model_P, pars = "y_rep")

# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = stan_dat_fake$complaints, yrep = y_rep[1:200, ])

In the plot above we have the kernel density estimate of the observed data ($y$, thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the posterior predictive distribution. If the model fits the data well, as it does here, there is little difference between the observed dataset and the simulated datasets.

Another plot we can make for count data is a rootogram. This is a plot of the (square roots of) expected counts vs the observed counts (blue histogram). We can see the model fits well because the observed histogram matches the expected counts relatively well.

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(y = stan_dat_fake$complaints, yrep = y_rep, prob = 0.9)

The ppc_bars() function will make a bar plot of the observed values and overlay prediction intervals but not on the square root scale (unlike the rootogram).

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(stan_dat_fake$complaints, yrep = y_rep)

Fit with real data

To fit the model to the actual observed data we’ll first create a list to pass to Stan using the variables in the pest_data data frame:

stan_dat_simple <- list(
  N = nrow(pest_data), 
  complaints = pest_data$complaints,
  traps = pest_data$traps
)
str(stan_dat_simple)

List of 3
 $ N         : int 120
 $ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
 $ traps     : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...

As we have already compiled the model, we can jump straight to sampling from it.

fit_P_real_data <- sampling(comp_model_P, data = stan_dat_simple, 
                            # these are the defaults but specifying them anyway
                            # so you can see how to use them: 
                            # posterior sample size = chains * (iter-warmup)
                            chains = 4, iter = 2000, warmup = 1000)


SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.098702 seconds (Warm-up)
Chain 1:                0.099506 seconds (Sampling)
Chain 1:                0.198208 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.08977 seconds (Warm-up)
Chain 2:                0.086348 seconds (Sampling)
Chain 2:                0.176118 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.9e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.111037 seconds (Warm-up)
Chain 3:                0.100781 seconds (Sampling)
Chain 3:                0.211818 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.096769 seconds (Warm-up)
Chain 4:                0.090637 seconds (Sampling)
Chain 4:                0.187406 seconds (Total)
Chain 4:

Print the summary of the intercept and slope parameters:

print(fit_P_real_data, pars = c('alpha','beta'))

Inference for Stan model: simple_poisson_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha  2.57       0 0.15  2.26  2.47  2.58  2.67  2.85  1062    1
beta  -0.19       0 0.02 -0.23 -0.21 -0.19 -0.18 -0.15  1025    1

Samples were drawn using NUTS(diag_e) at Mon Jun 17 17:14:31 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We can also plot the posterior distributions:

draws <- as.matrix(fit_P_real_data, pars = c('alpha','beta'))
mcmc_hist(draws) # marginal posteriors of alpha and beta

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcmc_scatter(draws, alpha = 0.2) # posterior of (alpha,beta)

From the posterior of beta, it appears the number of bait stations set in a building is associated with the number of complaints about cockroaches that were made in the following month. However, we still need to consider how well the model fits.

Posterior predictive checking

y_rep <- as.matrix(fit_P_real_data, pars = "y_rep")

ppc_dens_overlay(y = stan_dat_simple$complaints, y_rep[1:200,])

As opposed to when we fit the model to simulated data above, here the simulated datasets is not as dispersed as the observed data and don’t seem to capture the rate of zeros in the observed data. The Poisson model may not be sufficient for this data.

Let’s explore this further by looking directly at the proportion of zeros in the real data and predicted data.

# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)

# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot above shows the observed proportion of zeros (thick vertical line) and a histogram of the proportion of zeros in each of the simulated data sets. It is clear that the model does not capture this feature of the data well at all.

The rootogram is sometimes a useful plot to compare the observed vs expected number of complaints (on the square-root scale to make large differences easier to plot). This is a plot of the square root of the expected counts (continuous line and interval) vs the observed counts (blue histogram):

ppc_rootogram(y = stan_dat_simple$complaints, yrep = y_rep, prob = 0.9)

If the model was fitting well these would be relatively similar, however in this figure we can see the number of complaints is underestimated if there are few complaints, over-estimated for medium numbers of complaints, and underestimated if there are a large number of complaints.

We can also view how the predicted number of complaints varies with the number of traps. From this we can see that the model doesn’t seem to fully capture the data.

# Note: the jitter argument will be ignored unless you install a development
# version of bayesplot from GitHub: 
# devtools::install_github("stan-dev/bayesplot", ref = "ppd-functions")
ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps,
  # jitter helps since we have multiple observations at each (or many) values of x
  jitter = 0.2
) + 
  labs(x = "Number of traps", 
       y = "Predicted number of complaints")

Specifically, the model doesn’t capture the tails of the observed data very well.

Expanding the model: multiple predictors

Modeling the relationship between complaints and bait stations is the simplest model. We can expand the model, however, in a few ways that will be beneficial for our client. Moreover, the manager has told us that they expect there are a number of other reasons that one building might have more roach complaints than another.

Interpretability

Currently, our model’s mean parameter is a rate of complaints per 30 days, but we’re modeling a process that occurs over an area as well as over time. We have the square footage of each building, so if we add that information into the model, we can interpret our parameters as a rate of complaints per square foot per 30 days.

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can put it in $\eta_{b,t}$:

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b \end{align*} \]

A quick test shows us that there appears to be a relationship between the square footage of the building and the number of complaints received:

ggplot(pest_data, aes(x = log(total_sq_foot), y = log1p(complaints))) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Using the property manager’s intuition, we include two extra pieces of information we know about the building - the (log of the) square floor space and whether there is a live in super or not - into both the simulated and real data.

Adding in live_in_super gives us: \[ \eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} + \beta_{\rm super} \textrm{live_in_super}_{b,t} + \textrm{log_sq_foot}_b \]

Add them to the list of data to pass to Stan:

stan_dat_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
stan_dat_simple$live_in_super <- pest_data$live_in_super

Stan program for Poisson multiple regression

Now we need a new Stan model that uses multiple predictors.

Write multiple_poisson_regression.stan together

Simulate fake data with multiple predictors

comp_dgp_multiple <- stan_model('stan_programs/multiple_poisson_regression_dgp.stan')

fitted_model_dgp <-
  sampling(
  comp_dgp_multiple,
  data = list(N = nrow(pest_data)),
  refresh = 0, # suppress progress output from Stan
  chains = 1,
  iter = 1,
  algorithm = 'Fixed_param',
  seed = 123
  )

Warning in throw_sampler_warnings(nfit): Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning in throw_sampler_warnings(nfit): Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

samps_dgp <- rstan::extract(fitted_model_dgp)

Now pop that simulated data into a list ready for Stan.

stan_dat_fake <- list(
  N = nrow(pest_data), 
  log_sq_foot = samps_dgp$log_sq_foot[1, ],
  live_in_super = samps_dgp$live_in_super[1, ],
  traps = samps_dgp$traps[1, ], 
  complaints = samps_dgp$complaints[1, ]
)

And then compile and fit the model we wrote for the multiple regression.

comp_model_P_mult <- stan_model('stan_programs/multiple_poisson_regression.stan')
fit_model_P_mult <-
  sampling(
    comp_model_P_mult,
    data = stan_dat_fake,
    refresh = 0 # suppress progress ouput from Stan
  )

Then compare these parameters to the true parameters:

posterior_alpha_beta <- as.matrix(fit_model_P_mult, pars = c('alpha','beta','beta_super'))
true_alpha_beta <- c(samps_dgp$alpha, samps_dgp$beta, samps_dgp$beta_super)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We’ve recovered the parameters sufficiently well, so we’ve probably coded the Stan program correctly and we’re ready to fit the real data.

Fit the real data

Now let’s use the real data and explore the fit.

fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple, 
                                  refresh = 0)

y_rep <- as.matrix(fit_model_P_mult_real, pars = "y_rep")
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])

# in this case we have very high effective sample sizes, but if there is
# nontrivial autocorrelation then it's better to take a random sample of the draws, 
# for example:
ids <- sample(nrow(y_rep), size = 200)
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[ids, ])

This again looks like we haven’t captured the smaller counts very well, nor have we captured the larger counts.

We’re still severely underestimating the proportion of zeros in the data:

prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = stan_dat_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = 0.01
)

Ideally this vertical bar would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different numbers of traps.

ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps,
  # jitter will be ignored without dev version of bayesplot 
  # (see instructions above the first time ppc_intervals is used)
  jitter = 0.2
) + 
  labs(x = "Number of traps", 
       y = "Predicted number of complaints")

We can see that we’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.

Modeling count data with the Negative Binomial

When we considered modelling the data using a Poisson, we saw that the model didn’t appear to fit as well to the data as we would like. In particular the model underpredicted low and high numbers of complaints, and overpredicted the medium number of complaints. This is one indication of over-dispersion, where the variance is larger than the mean. A Poisson model doesn’t fit over-dispersed count data very well because the same parameter $\lambda$, controls both the expected counts and the variance of these counts. The natural alternative to this is the negative binomial model:

\[ \begin{align*} \text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b} \end{align*} \]

In Stan the negative binomial mass function we’ll use is called $\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$ in Stan. Like the poisson_log function, this negative binomial mass function that is parameterized in terms of its log-mean, $\eta$, but it also has a precision $\phi$ such that

\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]

\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]

As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so the variance of the negative-binomial approaches $\lambda$, i.e., the negative-binomial gets closer and closer to the Poisson.

Stan program for negative-binomial regression

Write multiple_NB_regression.stan together

Fake data fit: Multiple NB regression

comp_dgp_multiple_NB <- stan_model('stan_programs/multiple_NB_regression_dgp.stan')

We’re going to generate one draw from the fake data model so we can use the data to fit our model and compare the known values of the parameters to the posterior density of the parameters.

fitted_model_dgp_NB <-
  sampling(
  comp_dgp_multiple_NB,
  data = list(N = nrow(pest_data)),
  chains = 1,
  cores = 1,
  iter = 1,
  algorithm = 'Fixed_param',
  seed = 123
  )


SAMPLING FOR MODEL 'multiple_NB_regression_dgp' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                9.5e-05 seconds (Sampling)
Chain 1:                9.5e-05 seconds (Total)
Chain 1:

Warning in throw_sampler_warnings(nfit): Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning in throw_sampler_warnings(nfit): Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

samps_dgp_NB <- rstan::extract(fitted_model_dgp_NB)

Create a dataset to feed into the Stan model.

stan_dat_fake_NB <- list(
  N = nrow(pest_data), 
  log_sq_foot = samps_dgp_NB$log_sq_foot[1, ],
  live_in_super = samps_dgp_NB$live_in_super[1, ],
  traps = samps_dgp_NB$traps[1, ], 
  complaints = samps_dgp_NB$complaints[1, ]
)

Compile the inferential model.

comp_model_NB <- stan_model('stan_programs/multiple_NB_regression.stan')

Now we run our NB regression over the fake data and extract the samples to examine posterior predictive checks and to check whether we’ve sufficiently recovered our known parameters, $\text{alpha}$ $\texttt{beta}$, .

fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_fake_NB, 
                            cores = 4)
posterior_alpha_beta_NB <- 
  as.matrix(fitted_model_NB,
            pars = c('alpha',
                     'beta',
                     'beta_super',
                     'inv_phi'))

Construct the vector of true values from your simulated dataset and compare to the recovered parameters.

true_alpha_beta_NB <- 
  c(samps_dgp_NB$alpha,
    samps_dgp_NB$beta,
    samps_dgp_NB$beta_super,
    samps_dgp_NB$inv_phi
  )
mcmc_recover_hist(posterior_alpha_beta_NB, true = true_alpha_beta_NB)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Fit to real data and check the fit

fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple, cores = 4)
samps_NB <- rstan::extract(fitted_model_NB)
str(samps_NB)

List of 7
 $ alpha     : num [1:4000(1d)] 0.748 0.983 1.037 0.666 0.617 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta      : num [1:4000(1d)] -0.162 -0.188 -0.191 -0.162 -0.138 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta_super: num [1:4000(1d)] -0.3229 0.4158 -0.2176 -0.0167 0.0113 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ inv_phi   : num [1:4000(1d)] 0.742 0.738 0.84 0.639 0.677 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ phi       : num [1:4000(1d)] 1.35 1.35 1.19 1.56 1.48 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ y_rep     : num [1:4000, 1:120] 0 5 3 4 4 2 1 2 2 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ lp__      : num [1:4000(1d)] 231 227 232 232 232 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL

Let’s look at our predictions vs. the data.

y_rep <- samps_NB$y_rep
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])

It appears that our model now captures both the number of small counts better as well as the tails.

Let’s check if the negative binomial model does a better job capturing the number of zeros:

ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These look OK, but let’s look at the standardized residual plot.

mean_inv_phi <- mean(samps_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)

Looks OK, but we still have some very large standardized residuals. This might be because we are currently ignoring that the data are clustered by buildings, and that the probability of roach issue may vary substantially across buildings.

ppc_rootogram(stan_dat_simple$complaints, yrep = y_rep)

The rootogram now looks much more plausible. We can tell this because now the expected number of complaints matches much closer to the observed number of complaints. However, we still have some larger counts that appear to be outliers for the model.

Check predictions by number of traps:

ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps,
  jitter = 0.2
) + 
  labs(x = "Number of traps", y = "Number of complaints")

We haven’t used the fact that the data are clustered by building yet. A posterior predictive check might elucidate whether it would be a good idea to add the building information into the model.

ppc_stat_grouped(
  y = stan_dat_simple$complaints, 
  yrep = y_rep, 
  group = pest_data$building_id, 
  stat = 'mean',
  binwidth = 0.2
)

We’re getting plausible predictions for most building means but some are estimated better than others and some have larger uncertainties than we might expect. If we explicitly model the variation across buildings we may be able to get much better estimates.

Hierarchical modeling

Modeling varying intercepts for each building

Let’s add a hierarchical intercept parameter, $\alpha_b$ at the building level to our model.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})} \\ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha, \sigma_{\mu}) \]

In our Stan model, $\mu_b$ is the $b$-th element of the vector $\texttt{mu}$ which has one element per building.

One of our predictors varies only by building, so we can rewrite the above model more efficiently like so:

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\ \mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu}) \]

We have more information at the building level as well, like the average age of the residents, the average age of the buildings, and the average per-apartment monthly rent so we can add that data into a matrix called building_data, which will have one row per building and four columns:

live_in_super
age_of_building
average_tentant_age
monthly_average_rent

We’ll write the Stan model like:

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\ \mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu}) \]

Prepare building data for hierarchical

We’ll need to do some more data prep before we can fit our models. Firstly to use the building variable in Stan we will need to transform it from a factor variable to an integer variable.

N_months <- length(unique(pest_data$date))
N_buildings <- length(unique(pest_data$building_id))
# Add some IDs for building and month
pest_data <- pest_data %>%
  mutate(
    building_fac = factor(building_id, levels = unique(building_id)),
    building_idx = as.integer(building_fac),
    ids = rep(1:N_months, N_buildings),
    mo_idx = lubridate::month(date)
  )

# Center and rescale the building specific data
building_data <- pest_data %>%
    select(
      building_idx,
      live_in_super,
      age_of_building,
      total_sq_foot,
      average_tenant_age,
      monthly_average_rent
    ) %>%
    unique() %>%
    arrange(building_idx) %>%
    select(-building_idx) %>%
    scale(scale=FALSE) %>%
    as.data.frame() %>%
    mutate( # scale by constants
      age_of_building = age_of_building / 10,
      total_sq_foot = total_sq_foot / 10000,
      average_tenant_age = average_tenant_age / 10,
      monthly_average_rent = monthly_average_rent / 1000
    ) %>%
    as.matrix()

# Make data list for Stan
stan_dat_hier <-
  with(pest_data,
        list(complaints = complaints,
             traps = traps,
             N = length(traps),
             J = N_buildings,
             M = N_months,
             log_sq_foot = log(pest_data$total_sq_foot/1e4),
             building_data = building_data[,-3],
             mo_idx = as.integer(as.factor(date)),
             K = 4,
             building_idx = building_idx
             )
        )

Compile and fit the hierarchical model

Let’s compile the model.

comp_model_NB_hier <- stan_model('stan_programs/hier_NB_regression.stan')

Fit the model to data.

fitted_model_NB_hier <-
  sampling(
    comp_model_NB_hier,
    data = stan_dat_hier,
    iter = 4000, 
    cores = 4, # can run chains in parallel 
    refresh = 1000 # only report progress every 1000 iterations
  )

Warning: There were 727 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: Examine the pairs() plot to diagnose sampling problems

Warning in throw_sampler_warnings(nfits): Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning in throw_sampler_warnings(nfits): Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Diagnostics

We get a bunch of warnings from Stan about divergent transitions, which is an indication that there may be regions of the posterior that have not been explored by the Markov chains.

Divergences are discussed in more detail in the course slides as well as the bayesplot MCMC diagnostics vignette and A Conceptual Introduction to Hamiltonian Monte Carlo.

In this example we will see that we have divergent transitions because we need to reparameterize our model - i.e., we will retain the overall structure of the model, but transform some of the parameters so that it is easier for Stan to sample from the parameter space. Before we go through exactly how to do this reparameterization, we will first go through what indicates that this is something that reparameterization will resolve. We will go through:

Examining the fitted parameter values, including the effective sample size
Traceplots and scatterplots that reveal particular patterns in locations of the divergences.

Then we print the fits for the parameters that are of most interest.

print(fitted_model_NB_hier, pars = c('sigma_mu','beta','alpha','phi','mu'))

Inference for Stan model: hier_NB_regression.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma_mu  0.27    0.01 0.17  0.05  0.15  0.23  0.35  0.68   552 1.01
beta     -0.23    0.00 0.06 -0.35 -0.27 -0.23 -0.19 -0.11   752 1.00
alpha     1.27    0.02 0.43  0.42  0.98  1.27  1.55  2.13   775 1.00
phi       1.56    0.01 0.35  1.02  1.32  1.52  1.76  2.40   943 1.01
mu[1]     1.28    0.02 0.55  0.15  0.92  1.30  1.65  2.38   780 1.00
mu[2]     1.24    0.02 0.53  0.19  0.89  1.26  1.60  2.28   763 1.00
mu[3]     1.42    0.02 0.48  0.47  1.10  1.43  1.75  2.39   912 1.00
mu[4]     1.46    0.02 0.48  0.52  1.15  1.46  1.76  2.42   908 1.00
mu[5]     1.08    0.01 0.43  0.24  0.81  1.10  1.36  1.93  1082 1.00
mu[6]     1.17    0.02 0.49  0.21  0.84  1.18  1.51  2.11   617 1.01
mu[7]     1.47    0.02 0.52  0.47  1.12  1.47  1.80  2.51  1058 1.00
mu[8]     1.26    0.01 0.42  0.44  0.97  1.26  1.54  2.10  1004 1.00
mu[9]     1.40    0.02 0.57  0.23  1.03  1.41  1.79  2.49   836 1.00
mu[10]    0.87    0.01 0.37  0.16  0.62  0.88  1.12  1.60   708 1.00

Samples were drawn using NUTS(diag_e) at Mon Jun 17 17:18:13 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

You can see that the effective samples are quite low for many of the parameters relative to the total number of samples. This alone isn’t indicative of the need to reparameterize, but it indicates that we should look further at the trace plots and pairs plots. First let’s look at the traceplots to see if the divergent transitions form a pattern.

# change bayesplot color scheme (to make chains very different colors)
color_scheme_set("brewer-Spectral") 
mcmc_trace(
  # use as.array to keep the markov chains separate for trace plots
  as.array(fitted_model_NB_hier,pars = 'sigma_mu'),
  np = nuts_params(fitted_model_NB_hier),
  window = c(500,1000) # zoom into a window to see more clearly
)

color_scheme_set("blue")

Looks like the divergences (marked in red below the traceplot) correspond to spots where sampler gets stuck.

# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
  as.array(fitted_model_NB_hier),
  pars = c("mu[4]", "sigma_mu"),
  transform = list("sigma_mu" = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fitted_model_NB_hier),
  np_style = scatter_style_np(div_size = 1.5) # change style of divergences points
)
scatter_with_divs + coord_equal()

What we have here is a cloud-like shape, with most of the divergences clustering towards the bottom. We’ll see a bit later that we actually want this to look more like a funnel than a cloud, but the divergences are indicating that the sampler can’t explore the narrowing neck of the funnel.

One way to see why we should expect some version of a funnel is to look at some simulations from the prior, which we can do without MCMC and thus with no risk of sampling problems:

N_sims <- 1000
log_sigma <- rep(NA, N_sims)
theta <- rep(NA, N_sims)
for (j in 1:N_sims) {
  log_sigma[j] <- rnorm(1, mean = 0, sd = 1)
  theta[j] <- rnorm(1, mean = 0, sd = exp(log_sigma[j]))
}
draws <- cbind("mu" = theta, "log(sigma_mu)" = log_sigma)

mcmc_scatter(draws)

Of course, if the data is at all informative we shouldn’t expect the posterior to look exactly like the prior. But unless the data is incredibly informative about the parameters and the posterior concentrates away from the narrow neck of the funnel, the sampler is going to have to confront the funnel geometry. (See the bayesplot Visual MCMC Diagnostics vignette for more on this.)

Another way to look at the divergences is via a parallel coordinates plot.

# https://mc-stan.org/bayesplot/reference/MCMC-parcoord.html#plot-descriptions
# this plot is a bit computationally intensive
parcoord_with_divs <- mcmc_parcoord(
  as.array(fitted_model_NB_hier, pars = c("sigma_mu", "mu")),
  np = nuts_params(fitted_model_NB_hier)
)
parcoord_with_divs

Again, we see evidence that our problems concentrate when $\texttt{sigma_mu}$ is small.

Reparameterize and recheck diagnostics

Instead, we should use the non-centered parameterization for $\mu_b$. We define a vector of auxiliary variables in the parameters block, $\texttt{mu_raw}$ that is given a $\text{Normal}(0, 1)$ prior in the model block. We then make $\texttt{mu}$ a transformed parameter: We can reparameterize the random intercept $\mu_b$, which is distributed:

\[ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \]

transformed parameters {
  vector[J] mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}

This implies $\texttt{mu}$ is $\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)$ distribution, but it decouples the dependence of the density of each element of $\texttt{mu}$ from $\texttt{sigma_mu}$ ($\sigma_\mu$). hier_NB_regression_ncp.stan uses the non-centered parameterization for $\texttt{mu}$. We will examine the effective sample size of the fitted model to see whether we’ve fixed the problem with our reparameterization.

comp_model_NB_hier_ncp <- stan_model('stan_programs/hier_NB_regression_ncp.stan')

Fit the model to the data.

fitted_model_NB_hier_ncp <- sampling(comp_model_NB_hier_ncp, data = stan_dat_hier, 
                                     chains = 4, cores = 4, refresh = 1000)

You should have no divergences (or only a few) this time and much improved effective sample sizes:

print(fitted_model_NB_hier_ncp, pars = c('sigma_mu','beta','alpha','phi','mu'))

Inference for Stan model: hier_NB_regression_ncp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma_mu  0.23    0.00 0.17  0.01  0.10  0.19  0.31  0.67  1513    1
beta     -0.23    0.00 0.06 -0.35 -0.27 -0.23 -0.19 -0.11  2775    1
alpha     1.25    0.01 0.43  0.41  0.96  1.25  1.53  2.10  2617    1
phi       1.58    0.01 0.35  1.02  1.33  1.54  1.78  2.40  4290    1
mu[1]     1.26    0.01 0.54  0.19  0.91  1.27  1.61  2.36  2547    1
mu[2]     1.21    0.01 0.53  0.18  0.86  1.21  1.56  2.27  2781    1
mu[3]     1.38    0.01 0.48  0.43  1.07  1.39  1.70  2.32  3105    1
mu[4]     1.42    0.01 0.48  0.43  1.10  1.41  1.73  2.39  2954    1
mu[5]     1.07    0.01 0.40  0.28  0.80  1.06  1.34  1.85  3577    1
mu[6]     1.17    0.01 0.48  0.26  0.84  1.18  1.49  2.09  2753    1
mu[7]     1.44    0.01 0.51  0.46  1.10  1.44  1.78  2.47  3233    1
mu[8]     1.23    0.01 0.42  0.39  0.95  1.23  1.51  2.08  3387    1
mu[9]     1.41    0.01 0.56  0.31  1.04  1.42  1.78  2.46  2740    1
mu[10]    0.86    0.01 0.37  0.17  0.60  0.86  1.11  1.60  3483    1

Samples were drawn using NUTS(diag_e) at Mon Jun 17 17:19:11 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Let’s make the same scatter plot we made for the model with many divergences and compare:

scatter_no_divs <- mcmc_scatter(
  as.array(fitted_model_NB_hier_ncp),
  pars = c("mu[4]", 'sigma_mu'),
  transform = list('sigma_mu' = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fitted_model_NB_hier_ncp)
)

bayesplot_grid(
  scatter_with_divs, scatter_no_divs,
  grid_args = list(ncol = 2), 
  ylim = c(-11, 1), 
  subtitles = c("Centered parameterization", 
                "Non-centered parameterization")
)

Comparing the plots, we see that the divergences were a sign that there was part of the posterior that the chains were not exploring.

The marginal PPC plot, again.

y_rep <- as.matrix(fitted_model_NB_hier_ncp, pars = "y_rep")
ppc_dens_overlay(stan_dat_hier$complaints, y_rep[1:200,])

This looks quite nice. If we’ve captured the building-level means well, then the posterior distribution of means by building should match well with the observed means of the quantity of building complaints by month.

ppc_stat_grouped(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = 'mean',
  binwidth = 0.5
)

We weren’t doing terribly with the building-specific means before, but now they are all estimated a bit better by the model. The model is also able to do a decent job estimating within-building variability and, for the most part, the probability of zero complaints:

ppc_stat_grouped(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = 'sd',
  binwidth = 0.5
)

prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  stat = prop_zero,
  binwidth = 0.025
)

# plot separately for each building
ppc_stat_grouped(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = prop_zero,
  binwidth = 0.025
)

Predictions by number of traps:

ppc_intervals_grouped(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  x = stan_dat_hier$traps,
  # difference b/t grouping by pest_data$building_id and stan_dat_hier$building_idx
  # is that the latter is consecutive from 1 to N_buildings to make things easier 
  # to code in Stan. 
  group = stan_dat_hier$building_idx, 
  jitter = 0.15
) +
  labs(x = "Number of traps", 
       y = "Predicted number of complaints")

ppc_rootogram(stan_dat_hier$complaints, yrep = y_rep)

Varying intercepts and varying slopes

We have some new data that extends the number of time points for which we have observations for each building. This will let us explore how to expand the model a bit more with varying slopes in addition to the varying intercepts and later also model temporal variation.

stan_dat_hier <- readRDS('data/pest_data_longer_stan_dat.RDS')

Perhaps if the levels of complaints differ by building, the coefficient for the effect of traps on building does too. We can add this to our model and observe the fit.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})}\\ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \\ \kappa_b \sim \text{Normal}(\beta + \texttt{building_data} \, \gamma, \sigma_{\kappa}) \]

Let’s compile the model.

comp_model_NB_hier_slopes <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod.stan')

Fit the model to data and extract the posterior draws needed for our posterior predictive checks.

fitted_model_NB_hier_slopes <-
  sampling(
    comp_model_NB_hier_slopes,
    data = stan_dat_hier,
    chains = 4, 
    cores = 4,
    control = list(adapt_delta = 0.95)
  )

Warning: There were 11 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: Examine the pairs() plot to diagnose sampling problems

Warning in throw_sampler_warnings(nfits): Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning in throw_sampler_warnings(nfits): Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

To see if the model infers building-to-building differences in, we can plot a histogram of our marginal posterior distribution for sigma_kappa.

mcmc_hist(
  as.matrix(fitted_model_NB_hier_slopes, pars = "sigma_kappa"),
  binwidth = 0.005
)

print(fitted_model_NB_hier_slopes, pars = c('kappa','beta','alpha','phi','sigma_mu','sigma_kappa','mu'))

Inference for Stan model: hier_NB_regression_ncp_slopes_mod.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
kappa[1]    -0.01    0.00 0.08 -0.14 -0.07 -0.03  0.03  0.18   522 1.01
kappa[2]    -0.42    0.00 0.10 -0.64 -0.48 -0.41 -0.35 -0.24  1632 1.00
kappa[3]    -0.59    0.00 0.10 -0.79 -0.65 -0.58 -0.52 -0.39  4844 1.00
kappa[4]    -0.22    0.00 0.07 -0.37 -0.27 -0.22 -0.18 -0.09  2306 1.00
kappa[5]    -0.60    0.00 0.09 -0.78 -0.66 -0.60 -0.53 -0.42  4437 1.00
kappa[6]    -0.44    0.00 0.11 -0.67 -0.50 -0.43 -0.37 -0.25  2787 1.00
kappa[7]    -0.31    0.00 0.07 -0.44 -0.35 -0.31 -0.26 -0.18  4567 1.00
kappa[8]    -0.23    0.00 0.16 -0.57 -0.33 -0.22 -0.13  0.05  1444 1.00
kappa[9]     0.08    0.00 0.06 -0.03  0.04  0.08  0.12  0.20  1252 1.00
kappa[10]   -0.71    0.01 0.16 -1.00 -0.82 -0.72 -0.62 -0.33   444 1.01
beta        -0.35    0.00 0.06 -0.47 -0.38 -0.35 -0.31 -0.22  1739 1.00
alpha        1.41    0.01 0.31  0.76  1.22  1.42  1.61  1.99  2369 1.00
phi          1.62    0.00 0.19  1.27  1.48  1.60  1.74  2.03  2883 1.00
sigma_mu     0.51    0.02 0.42  0.02  0.18  0.41  0.74  1.54   452 1.01
sigma_kappa  0.14    0.02 0.15  0.03  0.07  0.11  0.17  0.45    89 1.04
mu[1]        0.25    0.03 0.77 -1.58 -0.13  0.37  0.77  1.48   490 1.01
mu[2]        1.66    0.01 0.54  0.71  1.30  1.62  2.00  2.86  1525 1.00
mu[3]        2.12    0.00 0.32  1.51  1.90  2.12  2.33  2.77  4909 1.00
mu[4]        1.51    0.01 0.53  0.48  1.17  1.50  1.85  2.58  2321 1.00
mu[5]        2.38    0.01 0.42  1.58  2.09  2.37  2.66  3.23  4922 1.00
mu[6]        1.90    0.01 0.40  1.19  1.64  1.87  2.13  2.79  2862 1.00
mu[7]        2.68    0.00 0.26  2.21  2.51  2.68  2.85  3.21  4269 1.00
mu[8]       -0.52    0.03 0.98 -2.39 -1.15 -0.57  0.08  1.61  1505 1.00
mu[9]        0.21    0.02 0.59 -0.96 -0.17  0.23  0.60  1.35  1305 1.00
mu[10]       1.77    0.06 1.16 -1.35  1.19  1.96  2.54  3.56   340 1.01

Samples were drawn using NUTS(diag_e) at Mon Jun 17 17:20:38 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

While the model can’t specifically rule out zero from the posterior, it does have mass at small non-zero numbers, so we will leave in the hierarchy over $\texttt{kappa}$. Plotting the marginal data density again, we can see the model still looks well calibrated.

y_rep <- as.matrix(fitted_model_NB_hier_slopes, pars = "y_rep")
ppc_dens_overlay(
  y = stan_dat_hier$complaints,
  yrep = y_rep[1:200,]
)

Check quantiles of predictions:

q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  stat = "q90"
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Time varying effects and structured priors

We haven’t looked at how cockroach complaints change over time. Let’s look at whether there’s any pattern over time.

select_vec <- which(stan_dat_hier$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = stan_dat_hier$complaints[select_vec],
  yrep = y_rep[,select_vec],
  group = stan_dat_hier$mo_idx[select_vec],
  stat = 'mean'
) + xlim(0, 11)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We might augment our model with a log-additive monthly effect, $\texttt{mo}_t$.

\[ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \texttt{mo}_t + \text{log_sq_foot}_b \]

We have complete freedom over how to specify the prior for $\texttt{mo}_t$. There are several competing factors for how the number of complaints might change over time. It makes sense that there might be more roaches in the environment during the summer, but we might also expect that there is more roach control in the summer as well. Given that we’re modeling complaints, maybe after the first sighting of roaches in a building, residents are more vigilant, and thus complaints of roaches would increase.

This can be a motivation for using an autoregressive prior for our monthly effects. The model is as follows:

\[ \texttt{mo}_t \sim \text{Normal}(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}) \\ \equiv \\ \texttt{mo}_t = \rho \, \texttt{mo}_{t-1} +\epsilon_t , \quad \epsilon_t \sim \text{Normal}(0, \sigma_\texttt{mo}) \\ \quad \rho \in [-1,1] \]

This equation says that the monthly effect in month $t$ is directly related to the last month’s monthly effect. Given the description of the process above, it seems like there could be either positive or negative associations between the months, but there should be a bit more weight placed on positive $\rho$s, so we’ll put an informative prior that pushes the parameter $\rho$ towards 0.5.

Before we write our prior, however, we have a problem: Stan doesn’t implement any densities that have support on $[-1,1]$. We can use variable transformation of a raw variable defined on $[0,1]$ to to give us a density on $[-1,1]$. Specifically,

\[ \rho_{\text{raw}} \in [0, 1] \\ \rho = 2 \times \rho_{\text{raw}} - 1 \]

Then we can use a beta prior on $\rho_\text{raw}$ to push our estimate of $\rho$ towards 0.5.

For example a rho_raw ~ beta(10, 5) prior would look like:

suppressPackageStartupMessages(library(ggformula))
gf_dist("beta", shape1 = 10, shape2 = 5) + xlab("rho_raw")

One further wrinkle is that we have a prior for $\texttt{mo}_t$ that depends on $\texttt{mo}_{t-1}$. That is, we are working with the conditional distribution of $\texttt{mo}_t$ given $\texttt{mo}_{t-1}$. But what should we do about the prior for $\texttt{mo}_1$, for which we don’t have a previous time period in the data?

We need to work out the marginal distribution of the first observation. Thankfully we can use the fact that AR models are stationary, so $\text{Var}(\texttt{mo}_t) = \text{Var}(\texttt{mo}_{t-1})$ and $\mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\texttt{mo}_{t-1})$ for all $t$. Therefore the marginal distribution of $\texttt{mo}_1$ is the same as the marginal distribution of any $\texttt{mo}_t$.

First we derive the marginal variance of $\texttt{mo}_{t}$.

\[ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1} + \epsilon_t) \\ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1}) + \text{Var}(\epsilon_t) \] where the second line holds by independence of $\epsilon_t$ and $\epsilon_{t-1})$. Then, using the fact that $Var(cX) = c^2Var(X)$ for a constant $c$ and the fact that, by stationarity, $\textrm{Var}(\texttt{mo}_{t-1}) = \textrm{Var}(\texttt{mo}_{t})$, we then obtain:

\[ \text{Var}(\texttt{mo}_t)= \rho^2 \text{Var}( \texttt{mo}_{t}) + \sigma_\texttt{mo}^2 \\ \text{Var}(\texttt{mo}_t) = \frac{\sigma_\texttt{mo}^2}{1 - \rho^2} \]

For the mean of $\texttt{mo}_t$ things are a bit simpler:

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1} + \epsilon_t) \\ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + \mathbb{E}(\epsilon_t) \\ \]

Since $\mathbb{E}(\epsilon_t) = 0$ by assumption we have

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + 0\\ \mathbb{E}(\texttt{mo}_t) = \rho \, \mathbb{E}(\texttt{mo}_{t}) \\ \mathbb{E}(\texttt{mo}_t) - \rho \mathbb{E}(\texttt{mo}_t) = 0 \\ \mathbb{E}(\texttt{mo}_t) = 0/(1 - \rho) \]

which for $\rho \neq 1$ yields $\mathbb{E}(\texttt{mo}_{t}) = 0$.

We now have the marginal distribution for $\texttt{mo}_{t}$, which, in our case, we will use for $\texttt{mo}_1$. The full AR(1) specification is then:

\[ \texttt{mo}_1 \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \]

comp_model_NB_hier_mos <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod_mos.stan')

fitted_model_NB_hier_mos <-
  sampling(
    comp_model_NB_hier_mos,
    data = stan_dat_hier,
    chains = 4,
    cores = 4,
    control = list(adapt_delta = 0.95)
  )

In the interest of brevity, we won’t go on expanding the model, though we certainly could. What other information would help us understand the data generating process better? What other aspects of the data generating process might we want to capture that we’re not capturing now?

As usual, we run through our posterior predictive checks.

y_rep <- as.matrix(fitted_model_NB_hier_mos, pars = "y_rep")
ppc_dens_overlay(
  y = stan_dat_hier$complaints,
  yrep = y_rep[1:200,]
)

select_vec <- which(stan_dat_hier$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = stan_dat_hier$complaints[select_vec],
  yrep = y_rep[,select_vec],
  group = stan_dat_hier$mo_idx[select_vec],
  stat = 'mean'
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, our monthly random intercept has captured a monthly pattern across all the buildings. We can also compare the prior and posterior for the autoregressive parameter to see how much we’ve learned. Here are two different ways of comparing the prior and posterior visually:

# 1) compare draws from prior and draws from posterior
rho_draws <- cbind(
  2 * rbeta(4000, 10, 5) - 1, # draw from prior
  as.matrix(fitted_model_NB_hier_mos, pars = "rho")
)
colnames(rho_draws) <- c("prior", "posterior")
mcmc_hist(rho_draws, freq = FALSE, binwidth = 0.025,
          facet_args = list(nrow = 2)) + xlim(-1, 1)

print(fitted_model_NB_hier_mos, pars = c('rho','sigma_mu','sigma_kappa','gamma'))

Inference for Stan model: hier_NB_regression_ncp_slopes_mod_mos.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
rho          0.78    0.00 0.08  0.60  0.73  0.78  0.83  0.91  1522    1
sigma_mu     0.30    0.01 0.23  0.01  0.12  0.26  0.43  0.85  1388    1
sigma_kappa  0.09    0.00 0.05  0.01  0.05  0.08  0.11  0.21  1178    1
gamma[1]    -0.18    0.00 0.11 -0.39 -0.25 -0.18 -0.12  0.04  2020    1
gamma[2]     0.11    0.00 0.08 -0.03  0.07  0.11  0.16  0.28  1724    1
gamma[3]     0.10    0.00 0.14 -0.20  0.02  0.10  0.19  0.39  2212    1
gamma[4]     0.00    0.00 0.06 -0.12 -0.04  0.00  0.04  0.12  1945    1

Samples were drawn using NUTS(diag_e) at Mon Jun 17 17:22:12 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

ppc_intervals(
  y = stan_dat_hier$complaints,
  yrep = y_rep,
  x = stan_dat_hier$traps,
  jitter = 0.2
) +
  labs(x = "Number of traps", 
       y = "Number of complaints")

It looks as if our model finally generates a reasonable posterior predictive distribution for all numbers of traps, and appropriately captures the tails of the data generating process.

Using our model: loss forecasts for decision making

Our model seems to be fitting well, so now we will go ahead and use the model to help us make a decision about how many traps to put in our buildings. We’ll make a forecast for 6 months forward.

comp_rev <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod_mos_predict.stan')

An important input to the revenue model is how much revenue is lost due to each complaint. The client has a policy that for every 10 complaints, they’ll call an exterminator costing the client $100, so that’ll amount to $10 per complaint.

rev_model <- sampling(
  comp_rev,
  data = stan_dat_hier,
  cores = 4,
  control = list(adapt_delta = 0.9)
)

Warning: There were 5 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: Examine the pairs() plot to diagnose sampling problems

Below we’ve generated our revenue curves for the buildings. These charts will give us precise quantification of our uncertainty around our revenue projections at any number of traps for each building.

A key input to our analysis will be the cost of installing bait stations. We’re simulating the number of complaints we receive over the course of a year, so we need to understand the cost associated with maintaining each bait station over the course of a year. There’s the cost attributed to the raw bait station, which is the plastic housing and the bait material, a peanut-buttery substance that’s injected with insecticide. The cost of maintaining one bait station for a year plus monthly replenishment of the bait material is about $20.

N_traps <- 20
costs <- 20 * (0:N_traps)

We’ll also need labor for maintaining the bait stations, which need to be serviced every two months. If there are fewer than five traps, our in-house maintenance staff can manage the stations (about one hour of work every two months at $20/hour), but above five traps we need to hire outside pest control to help out. They’re a bit more expensive, so we’ve put their cost at $30 / hour. Each five traps should require an extra person-hour of work, so that’s factored in as well. The marginal person-person hours above five traps are at the higher pest-control-labor rate.

N_months_forward <- 12
N_months_labor <- N_months_forward / 2
hourly_rate_low <- 20
hourly_rate_high <- 30
costs <- costs +
  (0:N_traps < 5 & 0:N_traps > 0) * (N_months_labor * hourly_rate_low) +
  (0:N_traps >= 5 & 0:N_traps < 10) * (N_months_labor * (hourly_rate_low + 1 * hourly_rate_high)) +
  (0:N_traps >= 10 & 0:N_traps < 15) * (N_months_labor * (hourly_rate_low + 2 * hourly_rate_high)) +
  (0:N_traps >= 15) * (N_months_labor * (hourly_rate_low + 3 * hourly_rate_high))

We can now plot curves with number of traps on the x-axis and profit/loss forecasts and uncertainty intervals on the y-axis. When we do this we see that the optimal number of bait stations differs by building.

# extract as a list instead of matrix for convenience below
samps_rev <- rstan::extract(rev_model)

# get draws of total profits (in this case negative, so losses)
total_profits <- sweep(samps_rev$rev_pred, 3, STATS = costs, FUN = '-')

# calculate medians and central intervals (for example 80% intervals)
median_profit <- t(apply(total_profits, c(2, 3), median))
lower_profit <- t(apply(total_profits, c(2, 3), quantile, 0.1))
upper_profit <- t(apply(total_profits, c(2, 3), quantile, 0.9))

buildings <- seq(1, ncol(median_profit))
traps <- seq(0, nrow(median_profit) - 1)

profit_df <-
  data.frame(
    profit = as.vector(median_profit),
    lower = as.vector(lower_profit),
    upper = as.vector(upper_profit),
    traps = rep(traps, times = max(buildings)),
    building = rep(buildings, each = max(traps) + 1)
  )

p <- ggplot(data = profit_df, aes(x = traps, y = profit)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue") +
  geom_line()

# p + facet_wrap(~ building, scales = 'free', ncol = 2)
p + facet_wrap(~ building, scales = 'fixed', ncol = 3)

Left as an exercise for the reader (unless we have time):

How would we build a revenue curve for a new building?
Let’s say our utility function is revenue. If we wanted to maximize expected revenue, we can take expectations at each station count for each building, and choose the trap numbers that maximizes expected revenue. This will be called a maximum revenue strategy. How can we generate the distribution of portfolio revenue (i.e. the sum of revenue across all the buildings) under the maximum revenue strategy from the the draws of rev_pred we already have?

Gaussian process instead of AR(1)

Joint density for AR(1) process

We can derive the joint distribution for the AR(1) process before we move to the Gaussian process (GP) which will give us a little more insight into what a GP is. Remember that we’ve specified the AR(1) prior as:

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \end{aligned} \] Rewriting our process in terms of the errors will make the derivation of the joint distribution clearer

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & = \rho \, \texttt{mo}_{t-1} + \sigma_\texttt{mo}\epsilon_t \\ \epsilon_t & \sim \text{Normal}\left(0, 1\right) \end{aligned} \]

Given that our first term $\texttt{mo}_1$ is normally distributed, and subsequent terms are sums of normal random variables, this means that jointly the vector mo is multivariate normal.

More formally, if we have a vector $x \in \mathbb{R}^M$ which is multivariate normal, $x \sim \text{MultiNormal}(0, \Sigma)$, and we left-multiply $x$ by a nonsingular matrix $L \in \mathbb{R}^{M\times M}$, then $y = Lx \sim \text{MultiNormal}(0, L\Sigma L^T)$. We can use this fact to show that our vector mo is jointly multivariate normal.

Just as before with the noncentered parameterization, we’ll take a vector $\texttt{mo_raw} \in \mathbb{R}^M$ (in which each element is univariate $\text{Normal}(0,1)$) and transform it into mo. But instead of doing this with scalar transformations like we did previously (e.g., in the section Time varying effects and structured priors) we will now do it with linear algebra operations.

The trick is that by specifying each element of $\texttt{mo_raw}$ to be distributed $\text{Normal}(0,1)$ we are implicitly defining $\texttt{mo_raw} \sim \text{MultiNormal}(0, I_M)$, where $I_M$ is the $M \times M$ identity matrix. Then we do a linear transformation using a matrix $L$ and assign the result to mo, i.e., $\texttt{mo} = L\times\texttt{mo_raw}$. This gives us $\texttt{mo} \sim \text{MultiNormal}(0, LI_M L^T)$ and $LI_M L^T = LL^T$.

Consider the case where we have three elements in mo and we want to figure out the form for $L$.

The first element of mo is fairly straightforward, because it mirrors our earlier parameterization of the AR(1) prior. The only difference is that we’re explicitly adding the last two terms of mo_raw into the equation so we can use matrix algebra for our transformation. \[ \texttt{mo}_1 = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\ \] The second element is a bit more complicated:

\[ \begin{aligned} \texttt{mo}_2 & = \rho \texttt{mo}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ & = \rho \left(\frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1\right) + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ \end{aligned} \]

While the third element will involve all three terms \[ \begin{aligned} \texttt{mo}_3 & = \rho \, \texttt{mo}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ & = \rho \left(\frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2\right) + \sigma_{\texttt{mo}} \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \]

Writing this all together:

\[ \begin{aligned} \texttt{mo}_1 & = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\[3pt] \texttt{mo}_2 & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[3pt] \texttt{mo}_3 & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \] Separating this into a matrix of coefficients $L$ and the vector mo_raw:

\[ \texttt{mo} = \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \texttt{mo_raw} \]

If we multiply $L$ on the right by its transpose $L^T$, we’ll get expressions for the covariance matrix of our multivariate random vector mo:

\[ \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} \\ 0 & \sigma_\texttt{mo} & \rho \,\sigma_\texttt{mo} \\ 0 & 0 & \sigma_\texttt{mo} \end{bmatrix} \]

which results in:

\[ \begin{bmatrix} \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho^2 \, \sigma^2_\texttt{mo} / (1 - \rho^2)\\ \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) \\ \rho^2 \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) \end{bmatrix} \] We can simplify this result by dividing the matrix by $\sigma^2_\texttt{mo} / (1 - \rho^2)$ to get

\[ \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix} \]

This should generalize to higher dimensions pretty easily.

We can replace the Stan code defining mo in stan_programs/hier_NB_regression_ncp_slopes_mod_mos.stan with the following:

vector[M] mo;
{
  matrix[M,M] A = rep_matrix(0, M, M);
  A[1,1] = sigma_mo / sqrt(1 - rho^2);
  for (m in 2:M) {
    A[m,1] = rho^(m-1) * sigma_mo / sqrt(1 - rho^2);
  }
  for (m in 2:M) {
    A[m,m] = sigma_mo;
    for (i in (m + 1):M) {
      A[i,m] = rho^(i-m) * sigma_mo;
    }
  }
  mo = A * mo_raw;
}

Cholesky decomposition

If we only knew the covariance matrix of our process, say a matrix called $\Sigma$, and we had a way of decomposing $\Sigma$ into $L L^T$ we wouldn’t need to write out the equation for the vector. Luckily, there is a matrix decomposition called the Cholesky decomposition that does just that. The Stan function for the composition is called cholesky_decompose. Instead of writing out the explicit equation we can just do the following

vector[M] mo;
{
  mo = cholesky_decompose(Sigma) * mo_raw;
}

provided we’ve defined Sigma appropriately elsewhere in the transformed parameters block. Note that the matrix $L$ is lower-triangular (i.e. all elements in the upper right triangle of the matrix are zero).

We’ve already derived the covariance matrix Sigma for the three-dimensional AR(1) process above by explicitly calculating $L L^T$, but we can do so using the rules of covariance and the way our process is defined. We already know that each element of $\texttt{mo}_t$ has marginal variance $\sigma^2_\texttt{mo} / (1- \rho^2)$, but we don’t know the covariance of $\texttt{mo}_t$ and $\texttt{mo}_{t+h}$. We can do so recursively. First we derive the covariance for two elements of $\texttt{mo}_t$ separated by one month:

\[ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t} + \sigma_\texttt{mo}\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) + \sigma_\texttt{mo}\text{Cov}(\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Var}(\texttt{mo}_t) + 0 \]

Then we define the covariance for $\text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t})$ in terms of $\text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t})$

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t+h-1} + \sigma_\texttt{mo}\epsilon_{t+h},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \] Which we can use to recursively get at the covariance we need:

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \,( \rho \, \text{Cov}(\texttt{mo}_{t+h-2},\texttt{mo}_{t}) )\\ \dots \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \sigma_\texttt{mo}^2/(1 - \rho^2) \\ \]

One way to write this in Stan code is:

vector[M] mo;
{
  matrix[M,M] Sigma;
  for (m in 1:M) {
    Sigma[m,m] = 1.0;
    for (i in (m + 1):M) {
      Sigma[i,m] = rho^(i - m);
      Sigma[m,i] = Sigma[i,m];
    }
  }
  Sigma = Sigma * sigma_mo^2 / (1 - rho^2);
  mo = cholesky_decompose(Sigma) * mo_raw;
}

Extension to Gaussian processes

The prior we defined for mo is strictly speaking a Gaussian process. It is a stochastic process that is distributed as jointly multivariate normal for any finite value of $M$. Formally, we could write the above prior for mo like so:

\[ \begin{aligned} \sigma_\texttt{mo} & \sim \text{Normal}(0, 1) \\ \rho & \sim \text{GenBeta}(-1,1,10, 5) \\ \texttt{mo}_t & \sim \text{GP}\left( 0, K(t | \sigma_\texttt{mo},\rho) \right) \\ \end{aligned} \] The notation $K(t | \sigma_\texttt{mo},\rho)$ defines the covariance matrix of the process over the domain $t$, which is months.

In other words:

\[ \text{Cov}(\texttt{mo}_t,\texttt{mo}_{t+h}) = k(t, t+h | \sigma_\texttt{mo}, \rho) \]

We’ve already derived the covariance for our process. What if we want to use a different definition of Sigma?

As the above example shows defining a proper covariance matrix will yield a proper multivariate normal prior on a parameter. We need a way of defining a proper covariance matrix. These are symmetric positive definite matrices. It turns out there is a class of functions that define proper covariance matrices, called kernel functions. These functions are applied elementwise to construct a covariance matrix, $K$:

\[ K_{[t,t+h]} = k(t, t+h | \theta) \] where $\theta$ are the hyperparameters that define the behavior of covariance matrix.

One such function is called the exponentiated quadratic function, and it happens to be implemented in Stan as cov_exp_quad. The function is defined as:

\[ \begin{aligned} k(t, t+h | \theta) & = \alpha^2 \exp \left( - \dfrac{1}{2\ell^2} ((t+h) - t)^2 \right) \\ & = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) \end{aligned} \]

The exponentiated quadratic kernel has two components to theta, $\alpha$, the marginal standard deviation of the stochastic process $f$, and $\ell$, the process length-scale.

The length-scale defines how quickly the covariance decays between time points, with large values of $\ell$ yielding a covariance that decays slowly, and with small values of $\ell$ yielding a covariance that decays rapidly. It can be seen interpreted as a measure of how nonlinear the mo process is in time.

The marginal standard deviation defines how large the fluctuations are on the output side, which in our case is the number of roach complaints per month across all buildings. It can be seen as a scale parameter akin to the scale parameter for our building-level hierarchical intercept, though it now defines the scale of the monthly deviations.

This kernel’s defining quality is its smoothness; the function is infinitely differentiable. That will present problems for our example, but if we add some noise the diagonal of our covariance matrix, the model will fit well.

\[ k(t, t+h | \theta) = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) + \text{if } h = 0, \, \sigma^2_\texttt{noise} \text{ else } 0 \]

Compiling the GP model

comp_model_NB_hier_gp <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod_mos_gp.stan')

Fitting the GP model to data

fitted_model_NB_hier_gp <- sampling(comp_model_NB_hier_gp, data = stan_dat_hier, chains = 4, cores = 4, control = list(adapt_delta = 0.9))
samps_gp <- rstan::extract(fitted_model_NB_hier_gp)

Examining the fit

Let’s look at the prior vs. posterior for the GP length scale parameter:

length_scale_draws <- cbind(
  prior = rgamma(4000, 10, 2),
  posterior = samps_gp$gp_len
)
mcmc_areas(length_scale_draws)

From the plot above it only looks like we learned a small amount, however we can see a bigger difference between the prior and posterior if we consider how much we learned about the ratio of sigma_gp to the length scale gp_len:

noise_to_length_scale_ratio_draws <- cbind(
  prior = abs(rnorm(4000)) / rgamma(4000, 10, 2),
  posterior = samps_gp$sigma_gp / samps_gp$gp_len
)
mcmc_areas(noise_to_length_scale_ratio_draws)

This is a classic problem with Gaussian processes. Marginally, the length-scale parameter isn’t very well-identified by the data, but jointly the length-scale and the marginal standard deviation are well-identified.

And let’s compare the estimates for the time varying parameters between the AR(1) and GP. In this case the posterior mean of the time trend is essentially the same for the AR(1) and GP priors but the 50% uncertainty intervals are narrower for the AR(1):

# visualizing 50% intervals
mo_ar_intervals <- mcmc_intervals_data(as.matrix(fitted_model_NB_hier_mos, pars = "mo"), prob = 0.5)
mo_gp_intervals <- mcmc_intervals_data(as.matrix(fitted_model_NB_hier_gp, pars = "gp"), prob = 0.5)
plot_data <- bind_rows(mo_ar_intervals, mo_gp_intervals)
plot_data$prior <- factor(rep(c("AR1", "GP"), each = 36), levels = c("GP", "AR1"))
plot_data$time <- rep(1:36, times = 2)

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = prior)) +
  geom_ribbon(alpha = 2/3)

The way we coded the GP also lets us plot a decomposition of the GP into a monthly noise component (mo_noise in the Stan code) and the underlying smoothly varying trend (gp_exp_quad in the Stan code):

# visualizing 50% intervals
mo_noise_intervals <- mcmc_intervals_data(as.matrix(fitted_model_NB_hier_gp, pars = "mo_noise"), prob = 0.5)
gp_exp_quad_intervals <- mcmc_intervals_data(as.matrix(fitted_model_NB_hier_gp, pars = "gp_exp_quad"), prob = 0.5)

plot_data <- bind_rows(mo_noise_intervals, gp_exp_quad_intervals)
plot_data$time <- rep(1:36, times = 2)
plot_data$term <- factor(rep(c("Monthly Noise", "Smooth Trend"), each = 36),
                         levels = c("Smooth Trend", "Monthly Noise"))

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = term)) +
  geom_ribbon(alpha = 0.5) +
  geom_line(aes(color = term), size = 0.5) +
  ylab(NULL)