{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b2c9f140-e483-4ad6-af39-b4832216aa55",
   "metadata": {},
   "source": [
    "# Resampling methods: the Bootstrap and the Jackknife"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2da12b94-c909-4d93-a1b4-c47c4131966e",
   "metadata": {},
   "source": [
    "## The Bootstrap\n",
    "\n",
    "We observe an IID sample of size $n$,\n",
    "$\\{X_j \\}_{j=1}^n$ IID $F$.\n",
    "Each observation is real-valued.\n",
    "We wish to estimate some parameter of the distribution of\n",
    "$F$ that can be written as a functional of $F$, $T(F)$.\n",
    "Examples include the mean, $T(F) = \\int x dF(x)$, other moments, _etc_.\n",
    "\n",
    "The (unpenalized) nonparametric maximum likelihood estimator of $F$ from the data\n",
    "$\\{X_j \\}$ is\n",
    "just the empirical distribution $\\hat{F}_n$,\n",
    "which assigns mass $1/n$ to each observation:\n",
    "\\begin{equation}\n",
    "\\arg \\max_{\\mbox{distributions }G} \\mathbb{P}_G \\{ X_j = x_j, \\; j=1, \\ldots, n \\} = \\hat{F}_n.\n",
    "\\end{equation}\n",
    "(Note, however, that the MLE of $F$ is not generally consistent in\n",
    "problems with an\n",
    "infinite number of parameters, such as estimating a density or a\n",
    "distribution function.)\n",
    "\n",
    "Using the general principle that the maximum likelihood estimator of a\n",
    "function of a parameter is that function of the maximum likelihood\n",
    "estimator of the parameter, we might be led to consider $T(\\hat{F}_n)$\n",
    "as an estimator of $T(F)$.\n",
    "\n",
    "That is exactly what the sample mean does, as an estimator of the mean:\n",
    "\\begin{equation}\n",
    "        T(\\hat{F}_n) = \\int x d\\hat{F}_n(x) = \\sum_{j=1}^n\\frac{1}{n}X_j =\n",
    "        \\frac{1}{n} \\sum_j X_j.\n",
    "\\end{equation}\n",
    "\n",
    "Similarly, the maximum likelihood estimator of\n",
    "\\begin{equation}\n",
    "    \\mathrm{Var}(X) = T(F) = \\int \\left ( x - \\int x dF \\right )^2dF\n",
    "\\end{equation}\n",
    "is\n",
    "\\begin{equation}\n",
    "        T(\\hat{F}_n) = \\int \\left ( x - \\int x d\\hat{F}_n \\right )^2 d\\hat{F}_n =\n",
    "        \\frac{1}{n}\\sum_j \\left (X_j - \\frac{1}{n} \\sum_k X_k \\right )^2.\n",
    "\\end{equation}\n",
    "In these cases, we get analytically tractable expressions for $T(\\hat{F}_n)$.\n",
    "\n",
    "What is often more interesting is to estimate a property of the sampling\n",
    "distribution of the estimator $T(\\hat{F}_n)$, for example the variance of \n",
    "the estimator $T(\\hat{F}_n)$.\n",
    "The bootstrap approximates the sampling distribution of\n",
    "$T(\\hat{F}_n)$ by the sampling distribution of  $T(\\hat{F}_n^*)$,\n",
    "where $\\hat{F}_n^*$ is a size-$n$ IID random sample drawn\n",
    "from $\\hat{F}_n$.\n",
    "That is, the bootstrap approximates\n",
    "the sampling distribution of an estimator applied to the empirical\n",
    "distribution $\\hat{F}_n$ of a random sample of size\n",
    "$n$ from a distribution $F$ by the sampling distribution of that estimator\n",
    "applied to a random sample $\\hat{F}_n^*$ of size $n$ from a particular\n",
    "realization $\\hat{F}_n$ of\n",
    "the empirical distribution of a sample of size $n$ from $F$.\n",
    "\n",
    "When $T$ is the mean $\\int x dF$, so $T(\\hat{F}_n)$ is the sample mean, \n",
    "we could obtain\n",
    "the variance of the distribution of $T( \\hat{F}_n^* )$ analytically:\n",
    "Let $\\{ X_j^* \\}_{j=1}^n$ be an IID sample of size $n$ from $\\hat{F}_n$.\n",
    "Then\n",
    "\\begin{equation}\n",
    "        \\mathrm{Var}_{\\hat{F}_n} \\frac{1}{n}\\sum_{j=1}^n X_j^*\n",
    "        = \\frac{1}{n^2} \\sum_{j=1}^n (X_j - \\bar{X})^2,\n",
    "\\end{equation}\n",
    "where $\\{ X_j \\}$ are the original data and $\\bar{X}$ is their mean.\n",
    "When we do not get a tractable espression\n",
    "for the variance of an estimator under resampling from the empirical\n",
    "distribution, we could still approximate the distribution of\n",
    "$T(\\hat{F}_n)$ by generating a large number of\n",
    "size-$n$ IID $\\hat{F}_n$ data sets (drawing samples of\n",
    "size $n$ with replacement from $\\{ x_j \\}_{j=1}^n$), and applying \n",
    "$T$ to each of those sets.\n",
    "\n",
    "The idea of the bootstrap is to approximate the distribution (under $F$)\n",
    "of an estimator\n",
    "$T(\\hat{F}_n)$ by the distribution of the estimator under $\\hat{F}_n$,\n",
    "and to approximate _that_ distribution by using a computer to\n",
    "take a large number of pseudo-random samples of size $n$ from \n",
    "$\\hat{F}_n$.\n",
    "\n",
    "This basic idea is quite flexible, and can be applied to a wide variety of\n",
    "testing and estimation problems, including finding confidence sets for\n",
    "functional parameters.  \n",
    "(It is not a panacea, though: we will see later how delicate it can be.)\n",
    "It is related to some other \"resampling\" schemes in which\n",
    "one re-weights the data to form other distributions.\n",
    "Before doing more theory with the bootstrap, let's examine the jackknife."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76223dd5-a941-475c-99a9-07e89d855ae5",
   "metadata": {},
   "source": [
    "## The Jackknife\n",
    "\n",
    "The idea behind the jackknife, which is originally due to Tukey and\n",
    "Quenouille, is to\n",
    "form from the data $\\{ X_j \\}_{j=1}^n$, $n$ sets of $n-1$ data,\n",
    "leaving each datum out\n",
    "in turn.\n",
    "The \"distribution\" of $T$ applied to these $n$ sets is used to approximate\n",
    "the distribution of $T(\\hat{F}_n)$.\n",
    "Let $\\hat{F}_{(i)}$ denote  the empirical distribution of the data\n",
    "set with the $i$th value deleted;\n",
    "$T_{(i)} = T( \\hat{F}_{(i)})$ is the corresponding\n",
    "estimate of $T(F)$.\n",
    "An estimate of the expected value of $T(\\hat{F}_n)$ is\n",
    "\\begin{equation}\n",
    "    \\hat{T}_{(\\cdot)} = \\frac{1}{n} \\sum_{i=1}^n T( \\hat{F}_{(i)}) .\n",
    "\\end{equation}\n",
    "Consider the bias of $T(\\hat{F}_n)$:\n",
    "\\begin{equation}\n",
    "    \\mathbb{E}_F T(\\hat{F}_n) - T(F).\n",
    "\\end{equation}\n",
    "Quenouille's jackknife estimate of the bias is\n",
    "\\begin{equation}\n",
    "    \\widehat{\\mbox{BIAS}} = (n-1) (\\hat{T}_{(\\cdot)} - T(\\hat{F}_n) ).\n",
    "\\end{equation}\n",
    "It can be shown that if the bias of $T$ has a homogeneous\n",
    "polynomial expansion\n",
    "in $n^{-1}$ whose coefficients do not depend on $n$,\n",
    "then the bias of the bias-corrected estimate\n",
    "\\begin{equation}\n",
    "    \\tilde{T} = nT(\\hat{F}_n) - (n-1) T_{(\\cdot)}\n",
    "\\end{equation}\n",
    "is $O(n^{-2})$ instead of $O(n^{-1})$.\n",
    "\n",
    "Applying the jackknife estimate of bias to correct\n",
    "the plug-in estimate of\n",
    "variance reproduces the formula for the sample variance (with\n",
    "$1/(n-1)$) from the formula with $1/n$:\n",
    "Define\n",
    "\\begin{equation}\n",
    "    \\bar{X} = \\frac{1}{n} \\sum_{j=1}^n X_j,\n",
    "\\end{equation}\n",
    "\\begin{equation}\n",
    "    \\bar{X}_{(i)} = \\frac{1}{n-1} \\sum_{j \\ne i} X_j,\n",
    "\\end{equation}\n",
    "\\begin{equation}\n",
    "    T(\\hat{F}_n) = \\hat{\\sigma}^2 = \\frac{1}{n} \\sum_{j=1}^n (X_j - \\bar{X})^2,\n",
    "\\end{equation}\n",
    "\\begin{equation}\n",
    "    T(\\hat{F}_{(i)}) = \\frac{1}{n-1} \\sum_{j \\ne i} ( X_j - \\bar{X}_{(i)})^2,\n",
    "\\end{equation}\n",
    "\\begin{equation}\n",
    "    T(\\hat{F}_{(\\cdot)}) = \\frac{1}{n} \\sum_{i=1}^n T(\\hat{F}_{(i)}).\n",
    "\\end{equation}\n",
    "Now\n",
    "\\begin{equation}\n",
    "    \\bar{X}_{(i)} = \\frac{n\\bar{X} - X_i}{n-1} = \\bar{X} + \\frac{1}{n-1} (\\bar{X} - X_i),\n",
    "\\end{equation}\n",
    "so\n",
    "\\begin{eqnarray}\n",
    "     ( X_j - \\bar{X}_{(i)})^2 &=&\n",
    "        \\left ( X_j - \\bar{X} - \\frac{1}{n-1} (\\bar{X} - X_i) \\right )^2 \\nonumber \\\\\n",
    "      &=& (X_j - \\bar{X})^2 + \\frac{2}{n-1} (X_j - \\bar{X})(X_i - \\bar{X}) +\n",
    "      \\frac{1}{(n-1)^2}(X_i - \\bar{X})^2.\n",
    "\\end{eqnarray}\n",
    "Note also that\n",
    "\\begin{equation}\n",
    "    \\sum_{j \\ne i} (X_j - \\bar{X}_{(i)})^2 =\n",
    "    \\sum_{j=1}^n (X_j - \\bar{X}_{(i)})^2 - (X_i - \\bar{X}_{(i)})^2.\n",
    "\\end{equation}\n",
    "Thus\n",
    "\\begin{eqnarray}\n",
    "    \\sum_{i=1}^n \\sum_{j \\ne i} (X_j - \\bar{X}_{(i)})^2 &=&\n",
    "    \\frac{1}{n-1} \\sum_{i=1}^n \\left [ \\sum_{j=1}^n \\left [\n",
    "    (X_j - \\bar{X})^2 + \\frac{2}{n-1}(X_j - \\bar{X})(X_i - \\bar{X}) \\right . \\right . + \\nonumber \\\\\n",
    "    && + \\left . \\left . \\frac{1}{(n-1)^2}(X_i - \\bar{X})^2\n",
    "    \\right ]  - (X_i - \\bar{X})^2 - \\right . \\nonumber \\\\\n",
    "    &&  - \\left . \\left .\n",
    "    \\frac{2}{n-1}(X_i - \\bar{X})^2 - \\frac{1}{(n-1)^2}(X_i - \\bar{X})^2 \\right . \\right ].\n",
    "\\end{eqnarray}\n",
    "The last three terms all are multiples of $(X_i - \\bar{X})^2$; the sum of the coefficients\n",
    "is\n",
    "\\begin{equation}\n",
    "    1 + 2/(n-1) + 1/(n-1)^2 = n^2/(n-1)^2.\n",
    "\\end{equation}\n",
    "The middle term of the inner sum is a constant times $(X_j - \\bar{X})$, which sums to zero over $j$.\n",
    "Simplifying the previous displayed equation yields\n",
    "\\begin{eqnarray}\n",
    "    \\sum_{i=1}^n \\sum_{j \\ne i} (X_j - \\bar{X}_{(i)})^2\n",
    "    &=&\n",
    "        \\frac{1}{n-1} \\sum_{i=1}^n \\left ( n \\hat{\\sigma}^2 + \\frac{n}{(n-1)^2}(X_i - \\bar{X})^2 - \\frac{n^2}{(n-1)^2} (X_i - \\bar{X})^2 \\right ) \\nonumber \\\\\n",
    "    &=&\n",
    "        \\frac{1}{n-1} \\sum_{i=1}^n (n \\hat{\\sigma}^2 - \\frac{n}{n-1} (X_i - \\bar{X})^2 ) \\nonumber \\\\\n",
    "    &=&\n",
    "        \\frac{1}{n-1} \\left [ n^2 \\hat{\\sigma}^2 - \\frac{n^2}{n-1} \\hat{\\sigma}^2 \\right ] \\nonumber \\\\\n",
    "    &=&\n",
    "        \\frac{n(n-2)}{(n-1)^2}\\hat{\\sigma}^2.\n",
    "\\end{eqnarray}\n",
    "The jackknife bias estimate is thus\n",
    "\\begin{equation}\n",
    "    \\widehat{\\mbox{BIAS}} = (n-1)\\left ( T(\\hat{F}_{(\\cdot)}) - T(\\hat{F}_n) \\right )\n",
    "    = \\hat{\\sigma}^2 \\frac{n(n-2) - (n-1)^2}{n-1} = \\frac{-\\hat{\\sigma}^2}{n-1}.\n",
    "\\end{equation}\n",
    "The bias-corrected MLE variance estimate is therefore\n",
    "\\begin{equation}\n",
    "    \\hat{\\sigma}^2 \\left ( 1 - \\frac{1}{n-1} \\right ) =\n",
    "    \\hat{\\sigma}^2 \\frac{n}{n-1} = \\frac{1}{n-1} \\sum_{j=1}^n (X_j - \\bar{X})^2 = S^2,\n",
    "\\end{equation}\n",
    "the usual sample variance.\n",
    "\n",
    "The jackknife also can be used to estimate other properties of an\n",
    "estimator, such as its variance.\n",
    "The jackknife estimate of the variance of $T(\\hat{F}_n)$ is\n",
    "\\begin{equation}\n",
    "\\hat{\\mathrm{Var}}(T) = \\frac{n-1}{n} \\sum_{j=1}^n ( T_{(j)} - T_{(\\cdot)} )^2.\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f2709ca-200e-4994-a510-151eb3354c67",
   "metadata": {},
   "source": [
    "It is convenient to think of distributions on data sets to compare\n",
    "the jackknife and the bootstrap.\n",
    "We shall follow the notation in Efron (1982).\n",
    "We condition on $(X_i = x_i)$ and treat the data as fixed in what\n",
    "follows.\n",
    "Let $\\mathcal{S}_n$ be the $n$-dimensional simplex\n",
    "\\begin{equation}\n",
    "    \\mathcal{S}_n \\equiv \\{ \\mathbb{P}^* = (P_i^*)_{i=1}^n \\in \\Re^n\n",
    "    : P_i^* \\ge 0 \\mbox{ and } \\sum_{i=1}^n P_i^* = 1 \\}.\n",
    "\\end{equation}\n",
    "A _resampling vector_\n",
    "$\\mathbb{P}^* = (P_k^*)_{k=1}^n $ is any element of $\\mathcal{S}_n$;\n",
    "_i.e._, an $n$-dimensional discrete probability vector.\n",
    "To each $\\mathbb{P}^* = (P_k^*) \\in \\mathcal{S}_n$ there corresponds\n",
    "a re-weighted  empirical measure $\\hat{F}(\\mathbb{P}^*)$ which\n",
    "puts mass $P_k^* $ on $x_k$, and a value of the estimator\n",
    "$T^* = T(\\hat{F}(\\mathbb{P}^*)) = T(\\mathbb{P}^*)$.\n",
    "The resampling vector $\\mathbb{P}^0 = (1/n)_{j=1}^n$ corresponds to the\n",
    "empirical distribution $\\hat{F}_n$ (each datum $x_j$ has the same\n",
    "mass).\n",
    "The resampling vector\n",
    "\\begin{equation}\n",
    "    \\mathbb{P}_i = \\frac{1}{n-1}(1, 1, \\ldots, 0, 1, \\ldots, 1),\n",
    "\\end{equation}\n",
    "which has the zero in the $i$th place, is one of the $n$\n",
    "resampling vectors the jackknife visits; denote the\n",
    "corresponding value of the estimator $T$ by $T_{(i)}$.\n",
    "The bootstrap visits all resampling vectors whose components are\n",
    "multiples of $1/n$.\n",
    "\n",
    "The bootstrap estimate of variance tends to be better than the\n",
    "jackknife estimate of variance for nonlinear estimators because of\n",
    "the distance between the empirical measure and the resampled measures:\n",
    "\\begin{equation}\n",
    "    \\| \\mathbb{P}^* - \\mathbb{P}^0 \\| = O_P(n^{-1/2}),\n",
    "\\end{equation}\n",
    "while\n",
    "\\begin{equation}\n",
    "    \\| \\mathbb{P}_k - \\mathbb{P}^0 \\| = O(n^{-1}).\n",
    "\\end{equation}\n",
    "To see the former, recall that the difference between the\n",
    "empirical distribution and the true distribution is $O_P(n^{-1/2})$:\n",
    "For any two probability distributions $\\mathbb{P}_1$, $\\mathbb{P}_2$, on\n",
    "$\\Re$, define the\n",
    "Kolmogorov-Smirnov distance\n",
    "\\begin{equation}\n",
    "    d_{KS}(\\mathbb{P}_1, \\mathbb{P}_2) \\equiv\n",
    "    \\| \\mathbb{P}_1 - \\mathbb{P}_2 \\|_{KS} \\equiv \\sup_{x \\in \\Re} |\n",
    "    \\mathbb{P}_1\\{(-\\infty, x]\\} - \\mathbb{P}_2\\{(-\\infty, x]\\} |.\n",
    "\\end{equation}\n",
    "There exist universal constants $\\chi_n(\\alpha)$\n",
    "so that for every continuous (w.r.t. Lebesgue measure) distribution\n",
    "$F$,\n",
    "\\begin{equation}\n",
    "    \\mathbb{P}_F\n",
    "    \\left \\{ \\| F - \\hat{F}_n \\|_{KS} \\ge \\chi_n(\\alpha) \\right \\}\n",
    "    = \\alpha.\n",
    "\\end{equation}\n",
    "This is the Dvoretzky-Kiefer-Wolfowitz inequality.\n",
    "Massart (_Ann. Prob., 18_, 1269--1283, 1990)\n",
    "showed that the constant\n",
    "\\begin{equation}\n",
    "    \\chi_n(\\alpha) \\le \\sqrt{\\frac{\\ln \\frac{2}{\\alpha}}{2n}}\n",
    "\\end{equation}\n",
    "is _tight_.\n",
    "Thinking of the bootstrap distribution (the empirical distribution $\\hat{F}_n$) as the true\n",
    "cdf and the resamples from it as the data gives the result that the distance between\n",
    "the cdf of the bootstrap resample and the empirical cdf of the original data\n",
    "is $O_P(n^{-1/2})$.\n",
    "\n",
    "To see that the cdfs of the jackknife samples are $O(n^{-1})$ from the\n",
    "empirical cdf $\\hat{F}_n$, note that for univariate real-valued data, the difference\n",
    "between $\\hat{F}_n$ and the cdf of the jackknife data set that\n",
    "leaves out the $j$th ranked observation $X_{(j)}$ is largest either at $X_{(j-1)}$ or\n",
    "at $X_{(j)}$.\n",
    "For $j = 1$ or $j = n$, the jackknife\n",
    "samples that omit the smallest or largest observation, the $L_1$\n",
    "distance between the jackknife measure and the empirical distribution\n",
    "is exactly $1/n$.\n",
    "Consider the jackknife cdf $\\hat{F}_{n,(j)}$, the cdf of the sample without $X_{(j)}$,\n",
    "$1 < j < n$.\n",
    "\\begin{equation}\n",
    "    \\hat{F}_{n,(j)}(X_{(j)}) = (j-1)/(n-1),\n",
    "\\end{equation}\n",
    "while $\\hat{F}_n((X_{(j)}) = j/n$; the difference is\n",
    "\\begin{equation}\n",
    "    \\frac{j}{n} - \\frac{j-1}{n-1} = \\frac{j(n-1) - n(j-1)}{n(n-1)} =\n",
    "    \\frac{n-j}{n(n-1)} = \\frac{1}{n-1} - \\frac{j}{n(n-1)}.\n",
    "\\end{equation}\n",
    "On the other hand,\n",
    "\\begin{equation}\n",
    "    \\hat{F}_{n,(j)}(X_{(j-1)}) = (j-1)/(n-1),\n",
    "\\end{equation}\n",
    "while $\\hat{F}_n((X_{(j-1)})= (j-1)/n$; the difference is\n",
    "\\begin{equation}\n",
    "    \\frac{j-1}{n-1} - \\frac{j-1}{n} = \\frac{n(j-1) - (n-1)(j-1)}{n(n-1)} =\n",
    "    \\frac{j - 1}{n(n-1)}.\n",
    "\\end{equation}\n",
    "Thus\n",
    "\\begin{equation}\n",
    "    \\| \\hat{F}_{n,(j)} - \\hat{F}_{n} \\| = \\frac{1}{n(n-1)} \\max\\{n-j, j-1\\}.\n",
    "\\end{equation}\n",
    "But $n/2 \\le \\max\\{n-j, j-1\\} \\le n-1$, so\n",
    "\\begin{equation}\n",
    "    \\| \\hat{F}_{n,(j)} - \\hat{F}_n\\| = O(n^{-1}).\n",
    "\\end{equation}\n",
    "\n",
    "The neighborhood that the bootstrap samples is larger, and is\n",
    "probabilistically of the right size to correspond to the uncertainty\n",
    "of the empirical distribution function as an estimator of the\n",
    "underlying distribution function $F$ (recall the\n",
    "Kiefer-Dvoretzky-Wolfowitz inequality---a K-S ball of radius $O(n^{-1/2})$\n",
    "has fixed coverage probability).\n",
    "For linear functionals, this does not matter, but for strongly nonlinear\n",
    "functionals, the bootstrap estimate of the variability tends to be more accurate than\n",
    "the jackknife estimate of the variability.\n",
    "\n",
    "Let us have a quick look at the distribution of the K-S distance between\n",
    "a continuous distribution and the empirical distribution of a\n",
    "sample $\\{X_j\\}_{j=1}^n$ IID $F$.\n",
    "The discussion follows _Feller_ (1971, pp.36ff).\n",
    "First we show that for continuous distributions $F$, the distribution of\n",
    "$\\| \\hat{F}_n - F \\|_{KS}$ does not depend on $F$.\n",
    "To see this, note that $F(X_j) \\sim U[0, 1]$:\n",
    "Let $x_t \\equiv \\inf \\{x \\in \\Re : F(x_t) = t \\}$.\n",
    "Continuity of $F$ ensures that $x_t$ exists for all $t \\in [0, 1]$.\n",
    "Now the event $\\{ X_j \\le x_t \\}$ is equivalent to the\n",
    "event $\\{F(X_j) \\le F(x_t)\\}$ up to a set of $F$-measure zero.\n",
    "Thus\n",
    "\\begin{equation}\n",
    "    t = \\mathbb{P}_F \\{ X_j \\le x_t \\} = \\mathbb{P}_F \\{ F(X_j) \\le F(x_t) \\} =\n",
    "    \\mathbb{P}_F \\{ F(X_j) \\le t \\}, \\,\\, t \\in [0, 1];\n",
    "\\end{equation}\n",
    "i.e., $\\{ F(X_j) \\}_{j=1}^n$ are IID $U[0, 1]$.\n",
    "Let\n",
    "\\begin{equation}\n",
    "    \\hat{G}_n(t) \\equiv \\#\\{F(X_j) \\le t\\}/n =\n",
    "    \\#\\{X_j \\le x_t\\}/n = \\hat{F}_n (x_t)\n",
    "\\end{equation}\n",
    "be the empirical cdf of $\\{ F(X_j) \\}_{j=1}^n$.\n",
    "Note that\n",
    "\\begin{equation}\n",
    "    \\sup_{x \\in \\Re} | \\hat{F}_n(x) - F(x) | =\n",
    "    \\sup_{t \\in [0, 1]} | \\hat{F}_n(x_t) - F(x_t) | =\n",
    "    \\sup_{t \\in [0, 1]} | \\hat{G}_n (t) - t |.\n",
    "\\end{equation}\n",
    "The probability distribution of $\\hat{G}_n$ is that of the cdf of $n$ IID\n",
    "$U[0, 1]$ random variables (it does not depend on $F$), so the distribution\n",
    "of the K-S distance between the empirical cdf and the true cdf is the\n",
    "same for every continuous distribution.\n",
    "It turns out that for distributions with atoms, the K-S distance between\n",
    "the empirical and the true distribution functions is stochastically\n",
    "smaller than it is for continuous distributions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5718016a-118e-4b1d-b31f-0849d0496e74",
   "metadata": {},
   "source": [
    "## Bootstrap Confidence Sets\n",
    "\n",
    "Let $\\mathcal{U}$ be an index set (not necessarily countable).\n",
    "Recall that a collection $\\{ \\mathcal{I}_u \\}_{u \\in \\mathcal{U}}$ of confidence intervals for\n",
    "parameters $\\{\\theta_u \\}_{u \\in \\mathcal{U}}$ has simultaneous $1-\\alpha$\n",
    "coverage probability if\n",
    "\\begin{equation}\n",
    "\\mathbb{P}_{\\theta} \\left \\{ \\cap_{u \\in \\mathcal{U}} \\{\\mathcal{I}_u \\ni \\theta_u \\} \\right \\}\n",
    "\\ge 1-\\alpha.\n",
    "\\end{equation}\n",
    "If $\\mathbb{P} \\{ \\mathcal{I}_u \\ni \\theta_u\\}$ does not depend on $u$, the confidence\n",
    "intervals are said to be _balanced_.\n",
    "\n",
    "Many of the procedures for forming joint confidence sets we have seen depend on\n",
    "_pivots_, which are functions of the data and the parameter(s) whose\n",
    "distribution is known (even though the parameter and the parent distribution are not).\n",
    "For example, the Scheff\\'{e} method relies on the fact that (for samples from\n",
    "a multivariate Gaussian with independent components)\n",
    "the sum of squared differences between the data and the corresponding parameters,\n",
    "divided by the variance estimate, has an $F$ distribution, regardless of the\n",
    "parameter values.\n",
    "Similarly, Tukey's maximum modulus method relies on the fact that\n",
    "(again, for independent Gaussian data) the distribution\n",
    "of the maximum of the studentized\n",
    "absolute differences between the data and the corresponding\n",
    "parameters does not depend on the parameters.\n",
    "Both of those examples are parametric, but the idea is more general:\n",
    "the procedure we looked at for finding bounds on the density function subject\n",
    "to shape restrictions just relied on the fact that there are uniform\n",
    "bounds on the probability that the K-S distance between the empirical\n",
    "distribution and the true distribution exceeds some threshold.\n",
    "\n",
    "Even in cases where there is no known exact pivot, one can sometimes show that\n",
    "some function of the data and parameters is asymptotically a pivot.\n",
    "Working out the distributions of the functions involved is not typically\n",
    "straightforward, and a general method of constructing (possibly\n",
    "simultaneous) confidence sets  would be nice.\n",
    "\n",
    "Efron gives several methods of basing confidence sets on the bootstrap.\n",
    "Those methods are substantially improved (in theory, and in my experience)\n",
    "by Beran's pre-pivoting approach, which leads to iterating the bootstrap.\n",
    "\n",
    "Let $X_n$ denote a sample of size $n$ from $F$.\n",
    "Let $R_n(\\theta) = R_n(X_n, \\theta)$ have cdf\n",
    "$H_n$, and let $H_n^{-1}(\\alpha)$\n",
    "be the largest $\\alpha$ quantile of the distribution of $R_n$.\n",
    "Then\n",
    "\\begin{equation}\n",
    "        \\{ \\gamma \\in \\Theta : R_n(\\gamma) \\le H_n^{-1}(1-\\alpha) \\}\n",
    "\\end{equation}\n",
    "is a $1-\\alpha$ confidence set for $\\theta$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af0a94a-890d-454a-85f1-0f63aac8e5e1",
   "metadata": {},
   "source": [
    "### The Percentile Method\n",
    "\n",
    "The idea of the percentile method is to use the empirical bootstrap percentiles of\n",
    "some quantity to approximate the true percentiles.\n",
    "Consider constructing a confidence interval for a single real parameter $\\theta = T(F)$.\n",
    "We will estimate $\\theta$ by $\\hat{\\theta} = T(\\hat{F}_n)$.\n",
    "We would like to know the distribution function $H_n = H_n(\\cdot, F)$ of\n",
    "$D_n (\\theta) = T(\\hat{F}_n) - \\theta$.\n",
    "Suppose we did.\n",
    "Let $H_n^{-1}(\\cdot) = H_n^{-1}(\\cdot, F)$ be the inverse cdf of $D_n$.\n",
    "Then\n",
    "\\begin{equation}\n",
    "    \\mathbb{P}_F \\{ H_n^{-1}(\\alpha/2) \\le T(\\hat{F}_n) - \\theta  \\le  H_n^{-1}(1- \\alpha/2)\n",
    "    \\} = 1-\\alpha,\n",
    "\\end{equation}\n",
    "so\n",
    "\\begin{equation}\n",
    "        \\mathbb{P}_F \\{ \\theta \\le T(\\hat{F}_n) -  H_n^{-1}(\\alpha/2)  \\mbox{ and }\n",
    "        \\theta \\ge T(\\hat{F}_n) - H_n^{-1}(1-\\alpha/2) \\} = 1-\\alpha,\n",
    "\\end{equation}\n",
    "or, equivalently,\n",
    "\\begin{equation}\n",
    "        \\mathbb{P}_F \\{ [T(\\hat{F}_n) - H_n^{-1}(1-\\alpha/2), T(\\hat{F}_n) -  H_n^{-1}(\\alpha/2)]\n",
    "        \\ni \\theta \\} = 1-\\alpha,\n",
    "\\end{equation}\n",
    "so the interval\n",
    "$[T(\\hat{F}_n) - H_n^{-1}(1-\\alpha/2), T(\\hat{F}_n) -  H_n^{-1}(\\alpha/2)]$\n",
    "would be a $1-\\alpha$ confidence interval for $\\theta$.\n",
    "\n",
    "The idea behind the percentile method is to approximate $H_n(\\cdot, F)$ by\n",
    "$\\hat{H}_n = H_n(\\cdot, \\hat{F}_n)$, the distribution of $D_n$\n",
    "under resampling from $\\hat{F}_n$ rather than $F$.\n",
    "(This tends not to be an approximation for constructing confidence sets.\n",
    "See [confidence sets](./confidence-sets.ipynb).)\n",
    "\n",
    "An alternative approach is to take\n",
    "$D_n (\\theta) = |T(\\hat{F}_n) - \\theta|$; then\n",
    "\\begin{equation}\n",
    "        \\mathbb{P}_F \\{ | T(\\hat{F}_n) - \\theta | \\le  H_n^{-1}(1- \\alpha)\n",
    "        \\} = 1-\\alpha,\n",
    "\\end{equation}\n",
    "so\n",
    "\\begin{equation}\n",
    "        \\mathbb{P}_F \\{ [ T(\\hat{F}_n -  H_n^{-1}(1-\\alpha) , T(\\hat{F}_n +  H_n^{-1}(1-\\alpha)]\n",
    "        \\ni \\theta  \\} = 1-\\alpha.\n",
    "\\end{equation}\n",
    "In either case, the \"raw\" bootstrap approach is to approximate $H_n$ by resampling\n",
    "under $\\hat{F}_n$.\n",
    "\n",
    "Beran proves a variety of results under the following condition:\n",
    "\n",
    "> **Condition 1.** (Beran, 1987)  \n",
    "For any sequence $\\{F_n\\}$ that converges to\n",
    "$F$ in a metric $d$ on\n",
    "cdfs, $H_n(\\cdot, F_n)$ converges weakly to a continuous cdf\n",
    "$H = H(\\cdot, F)$ that depends only on $F$, and not the sequence $\\{F_n\\}$.\n",
    "\n",
    "Suppose Condition 1 holds.\n",
    "Then because $\\hat{F}_n$ is consistent for $F$, the estimate $\\hat{H}_n$ converges\n",
    "in probability to $H$ in sup norm; moreover, the distribution of\n",
    "$\\hat{H}_n(R_n(\\theta))$ converges to $U[0,1]$.\n",
    "\n",
    "Instead of $D_n$, consider $R_n(\\theta) = | T(\\hat{F}_n) - \\theta |$ or some\n",
    "other (approximate) pivot.\n",
    "Let $\\hat{H}_n(\\cdot, \\hat{F}_n)$ be the bootstrap estimate of the cdf of $R_n$;\n",
    "The set\n",
    "\\begin{eqnarray}\n",
    "        B_n &=& \\{ \\gamma \\in \\Theta : \\hat{H}_n (R_n(\\gamma)) \\le 1-\\alpha \\}\n",
    "            \\nonumber \\\\\n",
    "        &=& \\{ \\gamma \\in \\Theta : R_n(\\gamma) \\le \\hat{H}_n^{-1}(1-\\alpha) \\}\n",
    "\\label{eq:BnDef}\n",
    "\\end{eqnarray}\n",
    "is (asymptotically) a $1-\\alpha$ confidence set for $\\theta$.\n",
    "\n",
    "The level of this set for finite samples tends to be inaccurate.\n",
    "It can be improved in the following way, due to Beran.\n",
    "\n",
    "The original root, $R_n(\\theta)$, whose limiting distribution depends on $F$,\n",
    "was transformed into a new root $R_{n,1}(\\theta) = \\hat{H}_n(R_n(\\theta) )$,\n",
    "whose limiting distribution is $U[0,1]$.  The distribution of $R_{n,1}$\n",
    "depends less strongly on $F$ than does that of $R_n$; Beran calls\n",
    "mapping $R_n$ into $R_{n,1}$  _prepivoting_. \n",
    "The confidence set acts as if the distribution of $R_{n,1}$ really is uniform,\n",
    "which is not generally true.  One could instead treat $R_{n,1}$ itself as a root,\n",
    "and pivot to reduce the dependence on $F$.\n",
    "\n",
    "Let $H_{n,1} = H_{n,1}(\\cdot, F)$ be the cdf of the new root $R_{n,1}(\\theta)$,\n",
    "estimate $H_{n,1}$ by $\\hat{H}_{n,1} = H_{n,1}(\\cdot, \\hat{F}_n)$, and define\n",
    "\\begin{eqnarray}\n",
    "        B_{n,1} &=& \\{ \\gamma \\in \\Theta : \\hat{H}_{n,1}(R_{n,1}(\\gamma)) \\le 1-\\alpha \\}\n",
    "             \\nonumber \\\\\n",
    "        &=& \\{ \\gamma \\in \\Theta : \\hat{H}_{n,1}(\\hat{H}_n(R_n(\\gamma))) \\le 1-\\alpha \\}\n",
    "        \\nonumber \\\\\n",
    "        &=&   \\{ \\gamma \\in \\Theta :\n",
    "                R_n(\\gamma) \\le \\hat{H}_n^{-1}(\\hat{H}_{n,1}^{-1}(1-\\alpha)))\n",
    "        \\}.\n",
    "\\label{eq:Bn1Def}\n",
    "\\end{eqnarray}\n",
    "Beran shows that this confidence set tends to have smaller error in its level than\n",
    "does $B_n$.\n",
    "The transformation can be iterated further, typically\n",
    "resulting in additional reductions in the level error."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7015724-ae8d-4f39-8ab1-0eb43a3ac460",
   "metadata": {},
   "source": [
    "## Approximating $B_{n,1}$ by Monte Carlo\n",
    "I follow Beran's (1987) notation (mostly).\n",
    "\n",
    "Let $x_n$ denote the \"real\" sample of size $n$.\n",
    "Let $x_n^*$ be a bootstrap sample of size $n$ drawn from the empirical cdf $\\hat{F}_n$.\n",
    "The components of $x_n^*$ are conditionally IID  given $x_n$.\n",
    "Let $\\hat{F}_n^*$ denote the \"empirical\" cdf of the bootstrap sample $x_n^*$.\n",
    "Let $x_n^{**}$ denote a sample of size $n$ drawn from $\\hat{F}_n^*$; the components of\n",
    "$x_n^{**}$ are conditionally IID given $x_n$ and $x_n^*$.\n",
    "Let $\\hat{\\theta}_n = T(\\hat{F}_n)$, and $\\hat{\\theta}_n^* = T(\\hat{F}_n^*)$.\n",
    "Then\n",
    "\\begin{equation}\n",
    "H_n(s, F) = \\mathbb{P}_F \\{ R_n (x_n, \\theta) \\le s \\}  ,\n",
    "\\end{equation}\n",
    "and\n",
    "\\begin{equation}\n",
    "H_{n,1}(s, F) = \\mathbb{P}_F \\left \\{ \\mathbb{P}_{\\hat{F}_n} \\{ R_n ( x_n^*, \\hat{\\theta}_n )\n",
    "<  R_n(x_n, \\theta) \\} \\le s \\right \\}.\n",
    "\\end{equation}\n",
    "The bootstrap estimates of these cdfs are\n",
    "\\begin{equation}\n",
    "\\hat{H}_n(s) = H_n(s, \\hat{F}_n ) = \\mathbb{P}_{\\hat{F}_n} \\{ R_n ( x_n^*, \\hat{\\theta}_n ) \\le s\n",
    "\\},\n",
    "\\end{equation}\n",
    "and\n",
    "\\begin{equation}\n",
    "\\hat{H}_{n,1}(s) = H_{n,1}(s, \\hat{F}_n) = \\mathbb{P}_{\\hat{F}_n}\n",
    "\\left \\{ \\mathbb{P}_{\\hat{F}_n^*} \\{ R_n(x_n^{**}, \\hat{\\theta}_n^* )\n",
    "< R_n(x_n^*, \\hat{\\theta}_n) \\} \\le s \\right \\}.\n",
    "\\end{equation}\n",
    "\n",
    "The Monte Carlo approach is as follows:\n",
    "\n",
    "1. Draw $\\{ y_k^* \\}_{k=1}^M$ bootstrap samples of size $n$ from $\\hat{F}_n$.\n",
    "The ecdf of $\\{ R_n(y_k^*, \\hat{\\theta}_n) \\}_{k=1}^M $ is an approximation to $\\hat{H}_n$.\n",
    "\n",
    "1. For $k = 1, \\cdots, M$, let $\\{ y_{k\\ell}^{**} \\}_{\\ell=1}^N$ be\n",
    "$N$ size $n$ bootstrap samples from the ecdf of $y_k^*$.\n",
    "Let $\\hat{\\theta}_{n,k}^* = T(\\hat{F}_{n,k}^*)$.\n",
    "Let $Z_k$ be the fraction of the values\n",
    "$$ \\{ R_n(y_{k,\\ell}^{**}, \\hat{\\theta}_{n,k}^*  ) \\}_{\\ell=1}^N$$\n",
    "that are less than or equal to $R_n(y_k^*, \\hat{\\theta}_n)$.\n",
    "\n",
    "1. The ecdf of $\\{ Z_k \\}$ is an approximation to $\\hat{H}_{n,1}$ that improves\n",
    "(in probability) as $M$ and $N$ grow.\n",
    "\n",
    "Note that this approach is extremely general.  \n",
    "Beran gives examples for confidence sets for directions, _etc_.  \n",
    "The pivot can in principle be\n",
    "a function of any number of parameters, which can yield simultaneous confidence\n",
    "sets for parameters of any dimension."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02ee54da-0712-4256-ad6b-ac67b0587d4b",
   "metadata": {},
   "source": [
    "## Other approaches to improving coverage probability\n",
    "\n",
    "There are other ways of iterating the bootstrap to improve the level accuracy of\n",
    "bootstrap confidence sets.\n",
    "Efron suggests trying to attain a different coverage probability so that\n",
    "the coverage attained in the second generation samples is the nominal coverage probability.\n",
    "That is, if one wants a 95\\% confidence set, one tries different percentiles so that in\n",
    "resampling from the sample, the attained coverage probability is 95\\%.  Typically, the\n",
    "percentile one uses in the second generation will be higher than 95\\%.\n",
    "Here is a sketch of the Monte-Carlo approach:\n",
    "\n",
    "+ Set a value of $\\alpha^*$ (initially taking $\\alpha^* = \\alpha$ is reasonable)\n",
    "\n",
    "+ From the sample, draw $M$ size-$n$ samples that are each IID\n",
    "$\\hat{F}_n$. Denote the ecdfs of the samples by $\\{ \\hat{F}_{n,j}^*\\}$.\n",
    "\n",
    "+ For each $j = 1, \\ldots, M$, apply the percentile method to make a (nominal) level\n",
    "$1-\\alpha^*$ confidence interval for $T(\\hat{F}_n)$.\n",
    "This gives $M$ confidence intervals; a fraction $1-\\alpha'$ will cover\n",
    "$T(\\hat{F}_n)$. Typically,\n",
    "$1- \\alpha' \\ne 1-\\alpha$.\n",
    "\n",
    "+ If $1-\\alpha' < 1 - \\alpha$, decrease $\\alpha^*$ and return to the previous step.\n",
    "If $1-\\alpha' > 1 - \\alpha$, increase $\\alpha^*$ and return to the previous step.\n",
    "If $1-\\alpha' \\approx 1-\\alpha$ to the desired level of precision, go to the next\n",
    "step.\n",
    "\n",
    "+ Report as a $1-\\alpha$ confidence interval for $T(F)$ the (first generation)\n",
    "bootstrap quantile confidence interval\n",
    "that has nominal $1 - \\alpha^*$ coverage probability.\n",
    "\n",
    "An alternative approach to increasing coverage probability by iterating\n",
    "the bootstrap is to use the same root, but to use a quantile\n",
    "(among second-generation bootstrap samples) of\n",
    "its $1-\\alpha$ quantile rather than  the quantile observed in the first generation.\n",
    "The heuristic justification is that we would ideally like to know the $1-\\alpha$ quantile\n",
    "of the pivot under sampling from the true distribution $F$.\n",
    "We don't.\n",
    "The percentile method estimates the $1-\\alpha$ quantile of the pivot under $F$ by the\n",
    "$1-\\alpha$ quantile of the pivot under $\\hat{F}_n$, but this is subject to sampling variability.\n",
    "To try to be conservative, we could use the bootstrap a second time  find an (approximate)\n",
    "upper\n",
    "$1-\\alpha^*$ confidence interval for the $1-\\alpha$ quantile of the pivot.\n",
    "\n",
    "Here is a sketch of the Monte-Carlo approach:\n",
    "\n",
    "+ Pick a value $\\alpha^* \\in (0, 1/2)$ (e.g., $\\alpha^* = \\alpha$).  This is a tuning\n",
    "parameter.\n",
    "+ From the sample, draw $M$ size-$n$ samples that are each IID\n",
    "$\\hat{F}_n$. Denote the ecdfs of the samples by $\\{ \\hat{F}_{n,j}^*\\}$.\n",
    "\n",
    "+ For each $j = 1, \\ldots, M$, draw $N$ size-$n$ samples, each IID\n",
    "$\\hat{F}_{n,j}$. Find the $1-\\alpha$ quantile of the pivot.\n",
    "This gives $M$ values of the $1-\\alpha$ quantile.\n",
    "Let $c$ be the $1-\\alpha^*$ quantile of the $M$ $1-\\alpha$ quantiles.\n",
    "\n",
    "+ Report as a $1-\\alpha$ confidence interval for  $T(F)$ the interval one gets\n",
    "by taking $c$ to be the estimate of the $1-\\alpha$ quantile of the pivot.\n",
    "\n",
    "In a variety of simulations, this tends to be more conservative than Beran's method, and more\n",
    "often attains at least the nominal coverage probability."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "078cc566-1b12-443a-82ea-22040cf5cae4",
   "metadata": {},
   "source": [
    "### Exercise.\n",
    "\n",
    "Consider forming a two-sided 95\\% confidence interval for the mean $\\theta$ of\n",
    "a distribution $F$ based on the sample mean,\n",
    "using $| \\bar{X} - \\theta |$ as a pivot.\n",
    "\n",
    "+ Implement the three \"double-bootstrap\" approaches to finding a confidence interval\n",
    "(Beran's pre-pivoting, Efron's  calibrated target percentile, and the\n",
    "percentile-of-percentile).\n",
    "+ Generate 100 synthetic samples of size 100 from the following distributions: normal,\n",
    "lognormal, Cauchy,\n",
    "mixtures of normals with the same mean but quite different variances (try different\n",
    "mixture coefficients), and mixtures of normals with different means and different variances\n",
    "(the means should differ enough that the result is bimodal).\n",
    "+ Apply the three double bootstrap methods to each, resampling 1000 times from each of\n",
    "1000 first-generation bootstrap samples.\n",
    "+ Which method on the average has the lowest level error? Which method tends to be most\n",
    "conservative?  Try to provide some intuition about the circumstances under\n",
    "which each method fails, and the circumstances under which each method would be expected\n",
    "to perform well.\n",
    "+ How do you interpret coverage for the Cauchy?\n",
    "\n",
    "\n",
    "*Warning*: You might need to be clever in how you implement this to make it\n",
    "a feasible calculation.  \n",
    "If you try to store all the intermediate results,\n",
    "the memory requirement is huge. On the other hand, if you use too many loops, the\n",
    "execution time will be long."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e58e3093-a167-4650-830f-278d32718b5f",
   "metadata": {},
   "source": [
    "### Bootstrap confidence sets based on Stein (shrinkage) estimates\n",
    "Beran (1995) discusses finding a confidence region for the mean vector $\\theta \\in \\Re^q$,\n",
    "$q \\ge 3$,\n",
    "from data $X \\sim N(\\theta, I)$.\n",
    "This is an example illustrating that _what_ one bootstraps is important, and that\n",
    "naive plug-in bootstrapping doesn't always work.\n",
    "\n",
    "The sets are spheres centered at the shrinkage estimate\n",
    "\\begin{equation}\n",
    "\\hat{\\theta}_S = \\left ( 1 - \\frac{q-2}{\\|X\\|^2} \\right ) X,\n",
    "\\end{equation}\n",
    "with random diameter $\\hat{d}$.\n",
    "That is, the confidence sets $C$ are of the form\n",
    "\\begin{equation}\n",
    "C(\\hat{\\theta}_S, \\hat{d}) =\n",
    "\\left \\{ \\gamma \\in \\Re^q : \\| \\hat{\\theta}_S - \\gamma \\| \\le \\hat{d}\n",
    "\\right \\}.\n",
    "\\end{equation}\n",
    "The problem is how to find $\\hat{d} = \\hat{d}(X; \\alpha)$ such that\n",
    "\\begin{equation}\n",
    "\\mathbb{P}_\\gamma \\{ C(\\hat{\\theta}_S, \\hat{d})  \\ni \\gamma \\} \\ge 1-\\alpha\n",
    "\\end{equation}\n",
    "whatever be $\\gamma \\in \\Re^q$.\n",
    "\n",
    "This problem is parametric: $F$ is known up to the $q$-dimensional mean vector\n",
    "$\\theta$.\n",
    "We can thus use a \"parametric bootstrap\" to generate data that are approximately from\n",
    "$F$, instead of drawing directly\n",
    "from $\\hat{F}_n$: if we have an estimate $\\hat{\\theta}$ of $\\theta$,\n",
    "we can generate artificial data\n",
    "distributed as $N( \\hat{\\theta}, I)$.\n",
    "If $\\hat{\\theta}$ is a good estimator, the artificial data will  be distributed nearly\n",
    "as $F$.  The issue is in what sense $\\hat{\\theta}$ needs to be good.\n",
    "\n",
    "Beran shows (somewhat surprisingly) that resampling  from $N(\\hat{\\theta}_S,I)$\n",
    "or from $N(X,I)$\n",
    "do not tend to work well in calibrating $\\hat{d}$.\n",
    "The crucial  thing in using the bootstrap to calibrate the radius of the\n",
    "confidence sphere seems to be to estimate $\\| \\theta \\|$ well.\n",
    "\n",
    "**Definition.**\n",
    "The _geometrical risk_ of a confidence set $C$ for the parameter $\\theta \\in \\Re^q$\n",
    "is\n",
    "\\begin{equation}\n",
    "G_q(C, \\theta) \\equiv q^{-1/2} E_\\theta \\sup_{\\gamma \\in C} \\| \\gamma - \\theta \\|.\n",
    "\\end{equation}\n",
    "That is, the geometrical risk is the expected distance to the parameter from the\n",
    "most distant point in the confidence set.\n",
    "\\end{Definition}\n",
    "\n",
    "For confidence spheres\n",
    "\\begin{equation}\n",
    "C = C(\\hat{\\theta}, \\hat{d}) = \\{ \\gamma \\in \\Re^q : \\| \\gamma - \\hat{\\theta} \\| \\le\n",
    "\\hat{d} \\},\n",
    "\\end{equation}\n",
    "the geometrical risk can be decomposed further: the distance from $\\theta$\n",
    "to the most distant\n",
    "point in the confidence set is the distance from $\\theta$ to the center of the sphere,\n",
    "plus the radius of the sphere, so\n",
    "\\begin{eqnarray}\n",
    "G_q(C(\\hat{\\theta}, \\hat{r}), \\theta) &=&\n",
    "q^{-1/2} E_\\theta \\left (  \\| \\hat{\\theta} - \\theta \\| + \\hat{d} \\right )\n",
    "\\nonumber \\\\\n",
    "&=&\n",
    "q^{-1/2} E_\\theta \\| \\hat{\\theta} - \\theta \\| + q^{-1/2} E_\\theta \\hat{d} .\n",
    "\\end{eqnarray}\n",
    "\n",
    "\n",
    "> **Lemma** \n",
    "(Beran, 1995, Lemma 4.1).\n",
    "Define\n",
    "\\begin{equation}\n",
    "    W_q(X, \\gamma) \\equiv (q^{-1/2} ( \\|X - \\gamma \\|^2 - q ), q^{-1/2} \\gamma'(X - \\gamma).\n",
    "\\end{equation}\n",
    "Suppose $\\{ \\gamma_q \\in \\Re^q \\}$ is any sequence such that\n",
    "\\begin{equation} \\label{eq:gammaqCond}\n",
    "    \\frac{\\| \\gamma_q \\|^2}{q}  \\rightarrow a < \\infty \\mbox{ as } q \\rightarrow \\infty .\n",
    "\\end{equation}\n",
    "Then\n",
    "\\begin{equation}\n",
    "    W_q(X, \\gamma_q) \\overset{W}{\\rightarrow} (\\sqrt{2} Z_1, \\sqrt{a} Z_2 )\n",
    "\\end{equation}\n",
    "under $\\mathbb{P}_{\\gamma_q}$, where $Z_1$ and $Z_2$ are IID standard normal random variables.\n",
    "(The symbol $\\overset{W}{\\rightarrow}$ denotes weak convergence of distributions.)\n",
    "\n",
    "\n",
    "\n",
    "**Proof.**\n",
    "Under $ \\mathbb{P}_{\\gamma_q}$, the distribution of $X - \\gamma$ is rotationally invariant,\n",
    "so the distribution of the components of $W_q$ depend on $\\gamma$ only through\n",
    "$\\| \\gamma \\|$. Wlog, we may take each component of $\\gamma_q$ to be\n",
    "$q^{-1/2}\\| \\gamma_q\\|$.\n",
    "The distribution of the first component of $W_q$ is then that of the sum of squares\n",
    "of $q$ IID standard normals (a chi-square rv with $q$ df),\n",
    "minus the expected value of that sum, times $q^{-1/2}$.\n",
    "The standard deviation of a chi-square random variable with $q$ df is $\\sqrt{2q}$,\n",
    "so the first component of $W_q$ is $\\sqrt{2}$ times a standardized variable whose\n",
    "distribution is asymptotically (in $q$) normal.\n",
    "The second component of $W_q$ is a linear combination of IID standard normals; by\n",
    "symmetry (as argued above), its distribution is that of\n",
    "\\begin{eqnarray}\n",
    "q^{-1/2} \\sum_{j=1}^q q^{-1/2}\\| \\gamma_q\\| Z_j &=&\n",
    "\\| \\gamma_q \\|\\sum_{j=1}^q Z_j\n",
    "\\nonumber \\\\\n",
    "\\rightarrow a^{1/2} Z_2.\n",
    "\\end{eqnarray}\n",
    "\n",
    "Recall that the squared-error risk (normalized by $q^{-1/2}$)\n",
    "of the James-Stein estimator is\n",
    "$1 - q^{-1} E_\\theta \\{ (q-2)^2/\\|X\\|^2 \\} < 1$.\n",
    "The difference between the loss of $\\hat{\\theta}_S$ and an unbiased estimate of\n",
    "its risk is\n",
    "\\begin{equation}\n",
    "D_q(X, \\theta) = q^{-1/2} \\{ \\| \\hat{\\theta}_S - \\theta \\|^2 -\n",
    "[q - (q-2)^2/\\|X\\|^2] \\}.\n",
    "\\end{equation}\n",
    "By rotational invariance, the distribution of this quantity depends on $\\theta$ only\n",
    "through $\\| \\theta\\|$; Beran writes the distribution as $H_q(\\| \\theta \\|^2/q)$.\n",
    "Beran shows that if  $\\{ \\gamma_q \\in \\Re^q \\}$ satisfies \\ref{eq:gammaqCond},\n",
    "then\n",
    "\\begin{equation}\n",
    "H_q(\\|\\gamma_q\\|^2/q) \\overset{W}{\\rightarrow} N(0, \\sigma^2(a)),\n",
    "\\end{equation}\n",
    "where\n",
    "\\begin{equation}\n",
    "\\sigma^2(t) \\equiv 2 - 4t/(1+t)^2 \\ge 1.\n",
    "\\end{equation}\n",
    "Define\n",
    "\\begin{equation}\n",
    "\\hat{\\theta}_{\\mbox{CL}} = [ 1 - (q-2)/\\|X\\|^2]_+^{1/2} X.\n",
    "\\end{equation}\n",
    "\n",
    "\n",
    ">**Theorem.**\n",
    "(Beran, 1995, Theorem 3.1)\n",
    "Suppose   $\\{ \\gamma_q \\in \\Re^q \\}$ satisfies    \\ref{eq:gammaqCond}.\n",
    "Then\n",
    "\\begin{equation}\n",
    "H_q( \\|\\hat{\\theta}_{\\mbox{CL}}\\|^2/q)   \\overset{W}{\\rightarrow} N(0, \\sigma^2(a))   ,\n",
    "\\end{equation}\n",
    "\\begin{equation}\n",
    "H_q(\\|X\\|^2/q) \\overset{W}{\\rightarrow} N(0, \\sigma^2(1+a)),\n",
    "\\end{equation}\n",
    "and\n",
    "\\begin{equation}\n",
    "H_q(\\| \\hat{\\theta}_S\\|^2/q )\\rightarrow N(0, \\sigma^2(a^2/(1+a))),\n",
    "\\end{equation}\n",
    "all in $P_{\\gamma_q}$ probability.\n",
    "\n",
    "It follows that to estimate $H_q$ by the bootstrap consistently,\n",
    "one should use\n",
    "\\begin{equation}\n",
    "\\hat{H}_B = H_q( \\|\\hat{\\theta}_{\\mbox{CL}}\\|^2/q       )\n",
    "\\end{equation}\n",
    "rather than estimating using either the norm of $X$ or the norm of the\n",
    "James-Stein estimate $\\hat{\\theta}_{S}$ of $\\theta$.\n",
    "\n",
    "\n",
    "**Proof.**\n",
    "The Lemma implies that under the conditions of the theorem,\n",
    "$\\| \\hat{\\theta}_{\\mbox{CL}}\\|^2/q \\rightarrow a$,\n",
    "$\\|X\\|^2/q \\rightarrow 1+a$, and $\\| \\hat{\\theta}_S \\|^2 /q \\rightarrow a^2/(1+a)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d289982f-0bbc-4ee9-8149-ffd64b2ce4d7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}