--- title: "extremeStat: quantile estimation" author: "Berry Boessenkool, " date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true \\ toc_float: true \\ not possible as of march 2016 vignette: > %\VignetteIndexEntry{extremeStat: quantile estimation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Vignette [Rmd source](https://raw.githubusercontent.com/brry/extremeStat/master/vignettes/extremeStat.Rmd) *The [R](https://www.r-project.org/) package `extremeStat`, available at [github.com/brry](https://github.com/brry/extremeStat), contains code to fit, plot and compare several (extreme value) distribution functions. It can also compute (truncated) distribution quantile estimates and draw a plot with return periods on a linear scale. * Main focus of this document: **Quantile estimation via distribution fitting Comparison of GPD implementations in several R packages Extreme Value Statistics** > Note: in some disciplines, quantiles are called percentiles, but technically, percentiles are only one kind of quantiles (as are deciles, quartiles, etc). ## Motivational example ```{r extreme, fig.height=3.5, fig.width=5, message=FALSE, echo=-1} par(mar=c(3.2,3.2,1.5,0.7), mgp=c(2.1,0.7,0)) library(extremeStat) data("annMax") extremes <- distLextreme(annMax, RPs=c(2,5,10,20,50,100,200), gpd=FALSE) plotLextreme(extremes, log="x", xlim=c(1,200), ylim=c(35,140), nbest=7) head(extremes$returnlev) ``` More on extreme value statistics in the last section. [TOC](#top) ## Package installation ```{r instcran, eval=FALSE} install.packages("extremeStat") # install the development version on github, incl. vignette: if(!requireNamespace("remotes", quitly=TRUE)) install.packages("remotes") remotes::install_github("brry/extremeStat", build_vignettes=TRUE) ``` `extremeStat` has 33 dependencies, because of the GPD comparison across the packages. ```{r library, echo=TRUE} library(extremeStat) ``` [TOC](#top) ## Example dataset Let's use the dataset `rain` with 17k values. With very small values removed, as those might be considered uncertain records, this leaves us with 6k values. ```{r dataHist, fig.show='hold', echo=-3} data(rain, package="ismev") rain <- rain[rain>2] par(mar=c(3.2,3.2,1.5,0.7), mgp=c(2.1,0.7,0)) hist(rain, breaks=80, col=4, las=1) # Visual inspection is easier on a logarithmic scale: berryFunctions::logHist(rain, breaks=80, col=3, las=1) ``` [TOC](#top) ## Fitting distributions The function `distLfit` fits 17 of the distribution types avalable in the R package `lmomco` (there are more, but some of these require quite a bit of computation time and are prone to not be able to be fitted to this type of data distribution anyways. Turn them on with `speed=FALSE`). The parameters are estimated via L-moments. These are analogous to the conventional statistical moments (mean, variance, skewness and kurtosis), but "robust [and] suitable for analysis of rare events of non-Normal data. L-moments are consistent and often have smaller sampling variances than maximum likelihood in small to moderate sample sizes. L-moments are especially useful in the context of quantile functions" [Asquith, W. (2015): lmomco package](https://www.rdocumentation.org/packages/lmomco) `distLfit` ranks the distributions according to their goodness of fit (RMSE between ecdf and cdf), see section [GOF](#goodness-of-fit). [TOC](#top) ## Quantile estimation To estimate the quantile of (small) samples via a distribution function, you can use `distLquantile`, which internally calls `distLfit`: ```{r dlf} dlf <- distLquantile(rain[1:900], probs=c(0.8,0.9,0.99,0.999), list=TRUE, quiet=TRUE) ``` If list is set to TRUE, it will return an object that can be examined with `printL`: ```{r dlprint, eval=1} printL(dlf) ``` Detailed documentation on dlf objects can be found in [?extremeStat](https://www.rdocumentation.org/packages/extremeStat). dlf objects can be plotted with several functions, e.g. `plotLquantile`: ```{r dlplot, echo=-1, fig.height=3.5, fig.width=5.5} par(mar=c(3.9,3.9,1.5,0.7), mgp=c(2.8,0.7,0)) plotLquantile(dlf, nbest=8, linargs=list(lwd=2), heights=seq(0.04, 0.01, len=8), breaks=80) ``` The resulting **parametric quantiles** can be obtained with ```{r dlquant} dlf$quant # distLquantile output if returnlist=FALSE (the default) ``` * The first 17 rows for each distribution function are sorted by their goodness of fit (see below). They each yield different results, epecially for very high quantile probailities (values that are exceeded very rarely). Note that the deviation from each other increases for badly fitted distributions. * The row `quantileMean` is an average of R's 9 methods implemented in `stats::quantile` to determine **empirical quantiles** (order based statistic, keyword plotting positions). * The rows `GPD_*` are the General Pareto Distribution quantiles, as estimated by a range of different R packages and methods (specified in the row names), computed by `q_gpd`. More on that in the next section [GPD](#GPD). * The rows `weighted*` are averages of the quantiles estimated from the distribution functions, weighted by their goodness of fit (RMSE ecdf / cdf) in three default (and a custom) weighting schemes (see next section). [TOC](#top) ## Goodness of Fit ```{r weight, echo=-1, fig.height=3.5, fig.width=5.5} par(mar=c(3.2,3.6,2.6,0.7), mgp=c(2.1,0.7,0)) plotLweights(dlf, legargs=list(cex=0.8, bg="transparent") ) ``` The way RMSE (as a measure of GOF) is computed can be visualized nicely with a smaller dataset: ```{r rmse_vis, fig.height=4, fig.width=7, echo=-1} par(mar=c(2,4,1.5,0.5)) data("annMax") # annual discharge maxima in the extremeStat package itself sel <- c("wak","lap","revgum") dlf <- distLfit(annMax, sel=sel, quiet=TRUE) col <- plotLfit(dlf, cdf=TRUE, sel=sel)$distcol x <- sort(annMax) for(d in 3:1) segments(x0=x, y0=lmomco::plmomco(x, dlf$parameter[[sel[d]]]), y1=ecdf(x)(x), col=col[d]) ``` The vertical bars mark the deviance of the distribution CDF from the ECDF. They are aggregated by [?rmse](https://www.rdocumentation.org/packages/berryFunctions) to mark distribution goodness of fit (GOF). [TOC](#top) ## POT, GPD The General Pareto Distribution ('GPD', or 'gpa' in the package `lmomco`) is often used to obtain **parametric quantile values** because of the [Pickands-Balkema-DeHaan theorem](https://en.wikipedia.org/wiki/Pickands-Balkema-de_Haan_theorem). It states that the tails of many (empirical) distributions converge to the GPD if a Peak-Over-Threshold (POT) method is used, i.e. the distribution is fitted only to the largest values of a sample. The resulting percentiles can be called **censored or truncated quantiles**. This package is based on the philosophy that, in order to compare parametric with empirical quantiles, the threshold must be at some percentage of the full sample. That way, the probabilities given to the quantile functions can be updated. For example, if the censored Q0.99 ( **` p = 0.99 `** ) is to be computed from the top 20 % of the full dataset (the truncated proportion is 80%: **` t = 0.8 `** ), then Q0.95 ( **` p2 = 0.95 `** ) of the truncated sample must be used. The probability adjustment for censored quantiles with truncation percentage **` t `** happens with the equation $$ p2 = \frac{p-t}{1-t} $$
derived from
$$ \frac{1-p}{1-t} = \frac{1-p2}{1-0} $$ as visualized along a probability line: ```{r prob, echo=FALSE, fig.height=1, fig.width=5.5} par(mar=rep(0,4)) plot(1:6, type="n", ylim=c(1,6), axes=F, ann=F) arrows(x0=1, y0=4, x1=6, code=3, angle=90, length=0.07) arrows(x0=4, y0=4, x1=6, code=1, angle=90, length=0.07) arrows(x0=4, y0=3, x1=6, code=3, angle=90, length=0.07) text(x=c(1,4,5.5,6), y=rep(5,4), c(0,"t","p",1), col=c(1,1,2,1)) text(x=c(4,5.5,6), y=rep(2,3), c(0,"p2",1), col=c(1,2,1)) text(3.8, 3, "truncated sample", adj=1) points(x=rep(5.5,2), y=c(3,4), col=2, pch=16) #text(7, 5.5, expression(frac(1-p, 1-t)*" = "* frac(1-p2, 1-0)), adj=0, cex=1.2) #text(7, 2.5, expression("p2 = "*frac(p-t, 1-t)), adj=0, cex=1.2) ``` In `distLquantile`, you can set the proportion of data discarded with the argument `truncate` (= **` t `** above) : ```{r trunc, echo=-1, fig.height=3.5, fig.width=5.5} par(mar=c(3.2,3.6,2.6,0.7), mgp=c(2.1,0.7,0)) d <- distLquantile(rain, truncate=0.9, probs=0.999, quiet=TRUE, list=TRUE) plotLquantile(d, breaks=50) ``` [TOC](#top) ## Truncation effect To examine the effect of the truncation percentage, we can compute the quantiles for different cutoff percentages. This computes for a few minutes, so the code is not performed upon vignette creation. The [result](https://github.com/brry/extremeStat/raw/master/vignettes/qq.Rdata) is loaded instead. ```{r teff} canload <- suppressWarnings(try(load("qq.Rdata"), silent = TRUE)) if(inherits(canload, "try-error")) { tt <- c(seq(0,0.5,0.1), seq(0.55,0.99,len=50)) if(interactive()) lapply <- pbapply::pblapply # for progress bars qq <- lapply(tt, function(t) distLquantile(rain, truncate=t, probs=0.999, quiet=TRUE, gpd=FALSE)) dlf00 <- plotLfit(distLfit(rain), nbest=17) dlf99 <- plotLfit(distLfit(rain, tr=0.99), nbest=17) save(tt,qq,dlf99,dlf00, file="qq.Rdata") } ``` We can visualize the truncation dependency with ```{r teffplot, fig.height=3.5, fig.width=5.5} load("qq.Rdata") par(mar=c(3,2.8,2.2,0.4), mgp=c(1.8,0.5,0)) plot(tt,tt, type="n", xlab="truncation proportion", ylab="Quantile estimate", main="truncation effect for 6k values of rain", ylim=c(22,90), las=1, xlim=0:1, xaxs="i", xaxt="n") ; axis(1, at=0:5*0.2, labels=c("0",1:4*0.2,"1")) dn <- dlf00$distnames ; names(dlf00$distcols) <- dn for(d in dn) lines(tt, sapply(qq, "[", d, j=1), col=dlf00$distcols[d], lwd=2) abline(h=berryFunctions::quantileMean(rain, probs=0.999), lty=3) gof <- formatC(round(dlf00$gof[dn,"RMSE"],3), format='f', digits=3) legend("center", paste(gof,dn), col=dlf00$distcols, lty=1, bg="white", cex=0.5) text(0.02, 62, "empirical quantile (full sample)", adj=0) ``` The 17 different distribution quantiles (and 11 different GPD estimates, not in figure) seem to converge with increasing truncation percentage. However, at least 5 remaining values in the truncated sample are necessary to fit distributions via L-moments, so don't truncate too much. I found a good cutoff percentage is 0.8. If you fit to the top 20% of the data, you get good results, while needing 'only' approximately 25 values in a sample to infer a quantile estimate. The goodness of fit also has an optimum (lowest RMSE) up to 0.8, before increasing again: ```{r teffplotrmse, fig.height=3.5, fig.width=5.5, echo=FALSE} par(mfrow=c(1,2), mar=c(1.5,2,0,0.5), oma=c(2,2,2,0) ) plot(1, type="n", xlim=0:1, xaxs="i", log="y", ylim=c(0.01,0.15), axes=FALSE, ylab="", xlab="") axis(1, at=0:5*0.2, labels=c("0",1:4*0.2,"1")) berryFunctions::logAxis(2) for(d in dn) lines(tt, sapply(qq, "[", d, j="RMSE"), col=dlf00$distcols[d]) plot(1, type="n", xlim=0:1, xaxs="i", ylim=c(0.011,0.025), xaxt="n", ylab="RMSE") title(xlab="truncation proportion", main="truncation effect for 6k values of rain", mgp=c(0.5,0,0), outer=TRUE) title(ylab="RMSE", mgp=c(1,0,0), outer=TRUE) axis(1, at=0:5*0.2, labels=c("0",1:4*0.2,"1")) for(d in dn) lines(tt, sapply(qq, "[", d, j="RMSE"), col=dlf00$distcols[d]) ``` [TOC](#top) ## Sample size dependency One motivation behind the development of this package is the finding that high empirical quantiles depend not only on the values of a sample (as it should be), but also on the number of observations available. That is not surprising: Given a distribution of a population, small samples tend to less often include the high (and rare) values. The cool thing about parametric quantiles is that they don't systematically underestimate the actual quantile in small samples. Here's a quick demonstration. ```{r ssdep, eval=FALSE} canload <- suppressWarnings(try(load("sq.Rdata"), silent = TRUE)) if(inherits(canload, "try-error")) { set.seed(1) ss <- c(30,50,70,100,200,300,400,500,1000) rainsamplequantile <- function() sapply(ss, function(s) distLquantile(sample(rain,s), probs=0.999, plot=F, truncate=0.8, quiet=T, sel="wak", gpd=F, weight=F)) sq <- pbapply::pbreplicate(n=100, rainsamplequantile()) save(ss,sq, file="sq.Rdata") } ``` Loading the [resulting R objects](https://github.com/brry/extremeStat/raw/master/vignettes/sq.Rdata), the sample size dependency can be visualized as follows: ```{r ssdepplot, fig.height=3.5, fig.width=5.5} load("sq.Rdata") par(mar=c(3,2.8,2.2,0.4), mgp=c(1.7,0.5,0)) sqs <- function(prob,row) apply(sq, 1:2, quantile, na.rm=TRUE, probs=prob)[row,] berryFunctions::ciBand(yu=sqs(0.6,1), yl=sqs(0.4,1), ym=sqs(0.5,1), x=ss, ylim=c(25,75), xlim=c(30,900), xlab="sample size", ylab="estimated 99.9% quantile", main="quantile estimations of small random samples", colm="blue") berryFunctions::ciBand(yu=sqs(0.6,2), yl=sqs(0.4,2), ym=sqs(0.5,2), x=ss, add=TRUE) abline(h=quantile(rain,0.999)) text(250, 50, "empirical", col="forestgreen") text(400, 62, "Wakeby", col="blue") text(0, 61, "'True' population value", adj=0) text(600, 40, "median and central 20% of 100 simulations") ``` [TOC](#top) ## Extreme value statistics, Return Periods Once you have a quantile estimator, you can easily compute extremes (= return levels) for given return periods. A value `x` in a time series has a certain expected frequency to occur or be exceeded: the exceedance probability Pe. The Return Period (RP) of `x` can be computed as follows: $$ RP = \frac{1}{Pe} = \frac{1}{1-Pne} $$ From that follows for the probability of non-exceedance Pne: $$ Pne = 1 - \frac{1}{RP} $$ The Return Level RL can be computed as $$ RL = quantile(x, ~~ prob=Pne) $$ The Return Periods are given in years, thus if you use daily values in a POT approach, you should set `npy=365.24`, as the formula in this package uses $$ returnlev = distLquantile(x, ~~ probs=1-\frac{1}{RP*npy}) $$ An example with actual numbers (only valid in stationary cases!!): A once-in-a-century-flood (event with a return period of 100 years) is expected to be exceeded ca. ten times in a dataset spanning a millennium. Let the associated discharge (river flow) be 5400 m^3/s. The exceedance probability of 5400 m^3/s is 1%, which means that in any given year, the probability to have a flood larger than that is 1%. The probability that the largest flood in 2017 is smaller than 5400 is thus 99 %. RP = 100, Pe=0.01, Pne=0.99. Let there be 60 years of daily discharge records. In the annual block maxima approach with 60 values, the probability passed to the quantile function to compute the 100-year flood is 0.99. In the peak over threshold approach with daily values above a threshold, it is 0.9999726. Here is an example with annual block maxima of stream discharge in Austria: ```{r RP, echo=-1, warning=FALSE, fig.height=4, fig.width=5.5} par(mar=c(3,2.8,1.2,0.4), mgp=c(1.8,0.5,0)) dlf <- distLextreme(annMax, quiet=TRUE) plotLextreme(dlf, log=TRUE, legargs=list(cex=0.6, bg="transparent"), nbest=17) dlf$returnlev[1:20,] ``` Explore the other possibilities of the package by reading the function help files. A good place to start is the package help: ```{r help, eval=FALSE} ?extremeStat ``` [TOC](#top) Any feedback on this package (or this vignette) is very welcome via [github](https://github.com/brry/extremeStat) or !