Model checking
===============
css: custom.css
transition: none
```{r setup, include=FALSE}
library(knitr)
library(magrittr)
library(viridis)
library(ggplot2)
library(reshape2)
library(animation)
opts_chunk$set(cache=TRUE, echo=FALSE, fig.height=10, fig.width=10)
```
```{r initialmodeletc, echo=FALSE, message=FALSE, warning=FALSE}
load("../spermwhaledata/R_import/spermwhale.RData")
library(Distance)
library(dsm)
df <- ds(dist, truncation=6000)
dsm_tw_xy_depth <- dsm(count ~ s(x, y) + s(Depth), ddf.obj=df, observation.data=obs, segment.data=segs, family=tw())
dsm_x_tw <- dsm(count~s(x), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
```
intro
======
type:section
title:none
*"perhaps the most important part of applied statistical modelling"*
Simon Wood
Model checking
===============
- Checking $\neq$ validation!
- As with detection function, checking is important
- Want to know the model conforms to assumptions
- What assumptions should we check?
What to check
===============
- Convergence
- Basis size
- Residuals
Convergence
===========
type:section
Convergence
===========
- Fitting the GAM involves an optimization
- By default this is REstricted Maximum Likelihood (REML) score
- Sometimes this can go wrong
- R will warn you!
A model that converges
======================
```{r convcheck, echo=TRUE, fig.keep="none"}
gam.check(dsm_tw_xy_depth)
```
A bad model
===========
```
Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In sqrt(w) : NaNs produced
Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
missing value where TRUE/FALSE needed
```
This is **rare**
The Folk Theorem of Statistical Computing
============
type:section
"most statistical computational problems are due not to the algorithm being used but rather the model itself"
Andrew Gelman
Basis size
===========
type:section
Basis size (k)
===========
- Set `k` per term
- e.g. `s(x, k=10)` or `s(x, y, k=100)`
- Penalty removes "extra" wigglyness
- *up to a point!*
- (But computation is slower with bigger `k`)
Checking basis size
====================
```{r gamcheck-text, fig.keep="none", echo=TRUE}
gam.check(dsm_x_tw)
```
Increasing basis size
====================
```{r gamcheck-kplus-text, fig.keep="none", echo=TRUE}
dsm_x_tw_k <- dsm(count~s(x, k=20), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
gam.check(dsm_x_tw_k)
```
Sometimes basis size isn't the issue...
========================================
- Generally, double `k` and see what happens
- Didn't increase the EDF much here
- Other things can cause low "`p-value`" and "`k-index`"
- Increasing `k` can cause problems (nullspace)
k is a maximum
==============
- (Usually) Don't need to worry about things being too wiggly
- `k` gives the maximum complexity
- Penalty deals with the rest
```{r plotk, fig.width=18, fig.height=9 }
dsm_sst_k5 <- dsm(count~s(SST, k=5), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
dsm_sst_k20 <- dsm(count~s(SST, k=20), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
dsm_sst_k50 <- dsm(count~s(SST, k=50), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
par(mfrow=c(1,3), cex.lab=2, lwd=3, cex.main=4, cex.axis=2, cex.lab=4, mar=c(5,6.5,4,2)+0.1)
plot(dsm_sst_k5, main="k=5")
plot(dsm_sst_k20, main="k=20")
plot(dsm_sst_k50, main="k=50")
```
Residuals
==================================
type:section
What are residuals?
====================
- Generally residuals = observed value - fitted value
- BUT hard to see patterns in these "raw" residuals
- Need to standardise $\Rightarrow$ **deviance residuals**
- Residual sum of squares $\Rightarrow$ linear model
- deviance $\Rightarrow$ GAM
- Expect these residuals $\sim N(0,1)$
Residual checking
===================
```{r gamcheck, results="hide"}
gam.check(dsm_x_tw)
```
Shortcomings
=============
- `gam.check` can be helpful
- "Resids vs. linear pred" is victim of artifacts
- Need an alternative
- "Randomised quanitle residuals" (*experimental*)
- `rqgam.check`
- Exactly normal residuals
Randomised quantile residuals
==============================
```{r rqgamcheck}
rqgam.check(dsm_x_tw)
```
Residuals vs. covariates
=========================
```{r covar-resids, fig.width=15}
par(cex.axis=2, lwd=2, cex.lab=2)
library(statmod)
par(mfrow=c(1,2))
plot(dsm_x_tw$data$x, residuals(dsm_x_tw), ylab="Deviance residuals", xlab="x")
environment(dsm_x_tw$family$variance)$p <- dsm_x_tw$family$getTheta(TRUE)
plot(dsm_x_tw$data$x, qres.tweedie(dsm_x_tw), ylab="Randomised quantile residuals", xlab="x")
```
Residuals vs. covariates (boxplots)
=========================
```{r covar-resids-boxplot, fig.width=15}
par(cex.axis=2, lwd=2, cex.lab=2, las=2)
library(statmod)
par(mfrow=c(1,2))
resid_dat <- data.frame(x = dsm_x_tw$data$x,
x_cut = cut(dsm_x_tw$data$x,
seq(min(dsm_x_tw$data$x),
max(dsm_x_tw$data$x),
len=20)),
dres = residuals(dsm_x_tw),
qres = qres.tweedie(dsm_x_tw))
plot(dres~x_cut, data=resid_dat, ylab="Deviance residuals", xlab="x")
plot(qres~x_cut, data=resid_dat, ylab="Randomised quantile residuals", xlab="x")
```
Example of "bad" plots
=======================
![Bad residual check plot from Wood 2006](images/badgam.png)
Example of "bad" plots
=======================
![Bad residual check plot from Wood 2006](images/badgam-annotate.png)
Residual checks
================
- Looking for patterns (not artifacts)
- This can be tricky
- Need to use a mixture of techniques
- Cycle through checks, make changes recheck
- Each dataset is different
Summary
=======
- Convergence
- Rarely an issue
- Check your thinking about the model
- Basis size
- k is a maximum
- Double and see what happens
- Residuals
- Deviance and randomised quantile
- check for artifacts
- `gam.check` is your friend