This notebook steps through the post-processing of hcv_bdsky.xml, the analysis of the Egyptian HCV data with a Birth Death Skyline Contemporary model.

Load the trace file and check convergence

We can load the log file using the readLog() function.

    bdsky_trace   <- beastio::readLog(params$logfile, burnin=0.1)

With the log file loaded as a coda::mcmc object we can use functions from the coda package to explore the trace. (see the coda manual for more examples)

    summary(bdsky_trace)
    varnames(bdsky_trace)

We can use the checkESS() function to find which parameters have ESS < 200,

  beastio::checkESS(bdsky_trace)
##                            posterior                                prior 
##                            138.37906                             60.38113 
##                          Tree.height                      Tree.treeLength 
##                             78.67775                             71.17478 
##                           gammaShape                               rateAC 
##                            116.30581                             99.85803 
##                               rateAG                               rateAT 
##                             80.79500                             93.91429 
##                               rateCG                               rateGT 
##                            159.01652                            123.13430 
##                      freqParameter.1                      freqParameter.2 
##                             50.14282                             62.07893 
##                      freqParameter.3                      freqParameter.4 
##                             56.17979                             49.70493 
##                        BDSKY_Contemp                 origin_BDSKY_Contemp 
##                             71.79820                             41.83838 
## becomeUninfectiousRate_BDSKY_Contemp   reproductiveNumber_BDSKY_Contemp.1 
##                             30.05652                             58.51121 
##   reproductiveNumber_BDSKY_Contemp.2   reproductiveNumber_BDSKY_Contemp.3 
##                             64.45153                             50.02300 
##   reproductiveNumber_BDSKY_Contemp.4   reproductiveNumber_BDSKY_Contemp.5 
##                             39.84365                             66.48599 
##   reproductiveNumber_BDSKY_Contemp.6   reproductiveNumber_BDSKY_Contemp.7 
##                             61.24581                            136.80401 
##   reproductiveNumber_BDSKY_Contemp.8   reproductiveNumber_BDSKY_Contemp.9 
##                            109.00646                             42.14392 
##  reproductiveNumber_BDSKY_Contemp.10 
##                             36.00705

or use the same function to plot the ESS values of all parameters.

  beastio::checkESS(bdsky_trace,   cutoff=200, plot=TRUE, log='y', ylim=c(1,10000), title="All parameters", plot.grid=TRUE)

We see that most parameters have very low ESS values, because we didn’t run the chain long enough. We can also investigate the ESS values of the analysis we ran for 30 million steps, where most parameters should have an ESS > 200:

    bdsky_trace_long   <- beastio::readLog(params$logfile_long, burnin=0.1)
    
    beastio::checkESS(bdsky_trace_long,   cutoff=200, plot=TRUE, log='y', ylim=c(1,10000), title="All parameters", plot.grid=TRUE)

## becomeUninfectiousRate_BDSKY_Contemp   reproductiveNumber_BDSKY_Contemp.9 
##                             111.4578                             196.5225 
##  reproductiveNumber_BDSKY_Contemp.10 
##                             191.1606

Extract parameter estimates and HPDs

Next we can extract the \(R_e\) parameter values and their HPDs.

  Re_sky <- beastio::getLogFileSubset(bdsky_trace, "reproductiveNumber_BDSKY_Contemp")
  Re_hpd <- t(beastio::getHPDMedian(Re_sky))
  
  delta_hpd <- beastio::getHPDMedian(bdsky_trace[, "becomeUninfectiousRate_BDSKY_Contemp"])

Plotting non-gridded BDSKY estimates

We can plot the raw \(R_e\) HPD intervals. This is equivalent to the output in Tracer.

    bdskytools::plotSkyline(1:10, Re_hpd, type='step', ylab="Re")

Since the intervals in this analysis are equidistant between the origin and the present and the origin was also estimated, they should probably not be plotted in this way. Use this only when the interval times in bdsky are fixed (requires editing the XML) or when the origin is fixed (automatically fixes the shift-times).

When doing this do NOT plot with type=‘smooth’ as it gives a misleading result!!!

Plotting a “smooth” skyline

In order to plot the smooth skyline we have to marginalize our \(R_e\) estimates on a regular time grid and calculate the HPD at each grid point. It is usually a good idea to use a grid with more cells than the dimension of \(R_e\) (but using too many can result in noisy estimates).

To do this we first calculate the marginal posterior at every time of interest using the function bdskytools::gridSkyline() and then calculate the HPD for each of the finer time intervals. Here we choose to look at params$gridsize equidistantly spaced points between the median tMRCA and the most recent sequence (params$mostrecent). The times to grid the skyline on (gridTimes), refers to years in the past.

    tmrca_med  <- median(bdsky_trace[, "Tree.height"])
    gridTimes  <- seq(0, median(tmrca_med), length.out=params$gridsize)  
    
    Re_gridded <- mcmc(bdskytools::gridSkyline(Re_sky, bdsky_trace[, "origin_BDSKY_Contemp"], gridTimes))
    Re_gridded_hpd <- t(getHPDMedian(Re_gridded))

Now we are ready to plot the smooth skyline.

    times <- params$mostrecent - gridTimes
    plotSkyline(times, Re_gridded_hpd, xlab="Date", ylab="Re", type="smooth")   

    plotSkyline(times, Re_gridded_hpd, xlab="Date", ylab="Re", type="lines")   

We can plot the gridded \(R_e\) skyline (not its HPDs) for a few of the states in our MCMC chain to see what it really looks like as the Markov chain samples parameters. Note that the intervals overlap between different posterior samples. This is because the origin estimate is a different sample from the origin’s posterior density in each of the plotted samples. As we add more samples to the plot we start to see the smooth skyline appear.

    layout(matrix(c(1:4), nrow=4))
    plotSkyline(times, Re_gridded, type='steplines', traces=1, 
                col=cols$blue,ylims=c(0,3.5), xlab="Time", ylab="Re", main="1 random sample")
    plotSkyline(times, Re_gridded, type='steplines', traces=10, 
                    col=set_alpha(cols$blue,0.5),ylims=c(0,3.5), xlab="Time", ylab="Re", main="10 random samples")
    plotSkyline(times, Re_gridded, type='steplines', traces=100, 
                    col=set_alpha(cols$blue,0.5),ylims=c(0,3.5), xlab="Time", ylab="Re", main="100 random samples")
    plotSkyline(times, Re_gridded, type='steplines', traces=1000, 
                    col=set_alpha(cols$blue,0.1),ylims=c(0,3.5), xlab="Time", ylab="Re", main="1000 random samples")

Combined plot

Finally, we can plot both Re and delta on one set of axes for comparison. In this analysis delta has a dimension of 1, so the skyline is not really insightful. We can also use this type of plot to compare skylines of the same parameter between different models (eg. changing the priors or number of shifts).

    par(mar=c(4,4,1,4))
    plot(1, type='n', xlim=c(1700,2000), ylim=c(0,1), 
         xlab='Year', ylab="", yaxt='n', xaxs='i', yaxs='i')

    plotSkyline(range(times), as.matrix(delta_hpd), type='step', lwd=2, 
                xlab="", ylab="", add=TRUE, new=FALSE, axes=FALSE, fill=set_alpha(cols$blue, 0.5), col=cols$blue)
    axis(4, las=1)
    mtext(expression(delta), side=4, line=3)
    
    plotSkyline(times, Re_gridded_hpd, lwd=2, xlims=c(1700,2000), xaxs='i', yaxs='i', 
                xlab="", ylab="", add=TRUE, new=TRUE, axes=FALSE, fill=set_alpha(cols$orange, 0.5), col=cols$orange)   
    axis(2, las=1)
    mtext(expression("R"[e]), side=2, line=3)
    
    abline(h = 1, lty=2, col=cols$red, lwd=2)
    
    legend("topright", legend=c(expression("R"[e]), expression(delta)), bty='n',
           fill=set_alpha(c(cols$orange, cols$blue), 0.5), border=c(cols$orange, cols$blue))