{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section3_2-Bootstrapping.ipynb)\n",
    "\n",
    "# Bootstrapping\n",
    "\n",
    "Parametric inference can be non-robust:\n",
    "\n",
    "- inaccurate if parametric assumptions are violated\n",
    "- if we rely on asymptotic results, we may not achieve an acceptable level of accuracy\n",
    "\n",
    "Parmetric inference can be difficult:\n",
    "- derivation of sampling distribution may not be possible\n",
    "\n",
    "An alternative is to estimate the sampling distribution of a statistic **empirically** without making assumptions about the form of the population.\n",
    "\n",
    "The bootstrap is a simulation tool for assessing accuracy. It is related to cross-validation, which we will encounter later in the course when we talk about tuning machine learning models.\n",
    "\n",
    "This approach is most commonly used as a non-parametric method for calculating standard errors and confidence intervals. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set_context('notebook')\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "DATA_URL = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Jackknife\n",
    "\n",
    "A simple precursor to bootstrapping is the jackknife (Quenouille 1949), which facilitates the estimation of the bias and variance of an estimator. Recall that the bias of an estimator $\\widehat{\\theta}$ of $\\theta$ is:\n",
    "\n",
    "$$Bias(\\widehat{\\theta}) = E(\\widehat{\\theta}) - \\theta$$\n",
    "\n",
    "Consider calculating an estimate using $n-1$ values from the dataset $\\widehat{\\theta}_{(-i)}$ (that is, the $i^{th}$ value is removed). If this is repeated for each $i=1,\\ldots,n$ observation, we can average them to obtain:\n",
    "\n",
    "$$\\bar{\\theta}_{(n)} = n^{-1} \\sum_i \\widehat{\\theta}_{(-i)}$$\n",
    "\n",
    "The **jackknife bias estimate** is:\n",
    "\n",
    "$$b_{jack} = (n-1)(\\bar{\\theta}_{(n)} - \\widehat{\\theta})$$\n",
    "\n",
    "It can be shown that $b_{jk}$ estimates the bias up to order $O(n^{-2})$.\n",
    "\n",
    "Thus, a bias-corrected estimator is:\n",
    "\n",
    "$$\\widehat{\\theta}_{jack} = \\widehat{\\theta} - b_{jack}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jackknife_bias(data, func, **kwargs):\n",
    "    theta_hat = func(data, **kwargs)\n",
    "    n = data.shape[0]\n",
    "    idx = np.arange(n)\n",
    "    theta_jack = sum(func(data[idx!=i], **kwargs) for i in range(n))/n\n",
    "    return (n-1) * (theta_jack - theta_hat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.normal(0, 2, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.std(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.std(x) - jackknife_bias(x, np.std)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way of expressing $\\widehat{\\theta}_{jack}$ is:\n",
    "\n",
    "$$\\widehat{\\theta}_{jack} = (n^{-1}) \\sum_i  \\widetilde{\\theta}_i$$\n",
    "\n",
    "where $\\widetilde{\\theta}_i$ are known as **pseudovalues** and are defined as:\n",
    "\n",
    "$$\\widetilde{\\theta}_i = n \\widehat{\\theta} - (n-1) \\widehat{\\theta}_{(-i)}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jackknife(data, func, **kwargs):\n",
    "    theta_hat = func(data, **kwargs)\n",
    "    n = data.shape[0]\n",
    "    idx = np.arange(n)\n",
    "    return sum(n*theta_hat - (n-1)*func(data[idx!=i], **kwargs) for i in range(n))/n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jackknife(x, np.std)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Correspondingly, the **jackknife variance estimate** of the estimator $\\widehat{\\theta}$ is:\n",
    "\n",
    "$$v_{jack} = \\frac{\\widetilde{s}^2}{n}$$\n",
    "\n",
    "where $\\widetilde{s}^2$ is the sample variance of the pseudo-values:\n",
    "\n",
    "$$\\widetilde{s}^2 = \\frac{\\sum_i (\\widetilde{\\theta}_i - n^{-1}\\sum_j \\widetilde{\\theta}_j)^2}{n-1}$$\n",
    "\n",
    "Under certain regularity conditions, $v_{jack}$ is a consistent estimator of $V(\\widehat{\\theta})$. It can be inconsistent for some statistics, such as the median, that are not smooth."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Write a function that implements a jackknife variance estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your answer here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Bootstrap\n",
    "\n",
    "The bootstrap is a resampling method discovered by [Brad Efron](http://www.jstor.org/discover/10.2307/2958830?uid=3739568&uid=2&uid=4&uid=3739256&sid=21102342537691) that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of some function of the distribution.\n",
    "\n",
    "In general, consider a statistic $T_n = g(X_1, \\ldots, X_n)$, which has variance $V_F(T_n)$. Note this implies that the variance is a function of the underlying $F$, which is unknown. \n",
    "\n",
    "One approach is to estimate $V_F(T_n)$ with some $V_{F_n}(T_n)$, which is a plug-in estimator of the variance. But, $V_{F_n}(T_n)$ may be difficult to compute, so we can attempt to approximate it with a **simulation estimate** $V_{boot}$. \n",
    "\n",
    "Here is the algorithm:\n",
    "\n",
    "1. Draw $X_{1}^*, \\ldots, X_{n}^* \\sim \\widehat{F}_n$\n",
    "2. For each r in R iterations: \n",
    "    Calculate statistic $T^*_{rn} = g(X^*_{r1}, \\ldots, X^*_{rn})$\n",
    "3. Calculate variance: $V_{boot} = \\frac{1}{R} \\sum_r \\left(T^*_{rn} - \\bar{T}^*_{.n} \\right)^2$\n",
    "\n",
    "Rather than finding a way of drawing $n$ points at random from $\\widehat{F}_n$, we can instead draw a sample of size $n$ **with replacement from the original data**. So, step 1 can be modified to:\n",
    "\n",
    "> 1. Draw $X_{1}^*, \\ldots, X_{n}^*$ with replacement from $X_{1}, \\ldots, X_{n}$\n",
    "\n",
    "\n",
    "These are called **bootrstrap samples**:\n",
    "\n",
    "$$S_1^* = \\{X_{11}^*, X_{12}^*, \\ldots, X_{1n}^*\\}$$\n",
    "\n",
    "We regard S as an \"estimate\" of population P\n",
    "\n",
    "> population : sample :: sample : bootstrap sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate samples with replacement in Python, there are a handful of approaches. We can use NumPy to generate random integers (`np,random.randint`), and use these to index DataFrame rows with `iloc`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    prostate_data = pd.read_table('../data/prostate.data.txt', index_col=0)\n",
    "except FileNotFoundError:\n",
    "    prostate_data = pd.read_table(DATA_URL + 'prostate.data.txt', index_col=0)\n",
    "prostate_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "random_ind = np.random.randint(0, prostate_data.shape[0], 5)\n",
    "prostate_data.iloc[random_ind]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas' `sample` method makes this even easier, and allows for custom sampling probabilities when non-uniform sampling is desired."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prostate_data.sample(5, replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Generate 20 samples with replacement from the prostate dataset, weighting samples inversely by age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bootstrap Estimates\n",
    "\n",
    "From our bootstrapped samples, we can extract *estimates* of the expectation and its variance:\n",
    "\n",
    "$$\\bar{T}^* = \\hat{E}(T^*) = \\frac{\\sum_r T_r^*}{R}$$\n",
    "\n",
    "$$\\hat{\\text{Var}}(T^*) = \\frac{\\sum_r (T_r^* - \\bar{T}^*)^2}{R-1}$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Under appropriate regularity conditions, $\\frac{s^2}{\\hat{\\text{Var}}(T^*)} \\rightarrow 1$ as $n \\rightarrow \\infty$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bootstrap Confidence Intervals\n",
    "\n",
    "There are a handful of ways for constructing confidence intervals from bootstrap samples, varying in ease of calculation and accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bootstrap Normal Interval\n",
    "\n",
    "Perhaps the simplest bootstrap interval is the normal interval:\n",
    "\n",
    "$$T_n ± z_{α/2} \\widehat{SE}_{boot}$$\n",
    "\n",
    "where $\\widehat{SE}_{boot}$ is an estimate of the standard error using the bootstrap sample. Of course, this interval is not accurate unless the distribution of $T_n$ is close to Gaussian.\n",
    "\n",
    "\n",
    "We can first define a **pivotal interval**. Let $\\theta = T(F)$ and $\\widehat{\\theta}_n = T(\\widehat{F}_n)$, and further define $P_n = \\widehat{\\theta}_n - \\theta$. \n",
    "\n",
    "Let $H(p)$ denote the CDF of the pivot:\n",
    "\n",
    "$$\n",
    "H(p) = Pr_F(P_n \\le p)\n",
    "$$\n",
    "\n",
    "Now define the interval $C_n = (a, b)$ where:\n",
    "\n",
    "$$\\begin{aligned}\n",
    "a &= \\widehat{\\theta}_n - H^{-1}(1-\\alpha/2) \\\\\n",
    "b &= \\widehat{\\theta}_n - H^{-1}(\\alpha/2).\n",
    "\\end{aligned}$$\n",
    "\n",
    "It can be shown that $C_n$ is an exact $1 - \\alpha$ confidence interval for $\\theta$. But clearly, $a$ and $b$ depend on the unknown $H$. \n",
    "\n",
    "A pivot confidence interval uses a bootstrap estimate of H as an approximation:\n",
    "\n",
    "$$\\widehat{H}_P = \\frac{1}{R} \\sum_r I\\left[(\\widehat{\\theta}_{nr} - \\widehat{\\theta}_{n.}) < p\\right]$$\n",
    "\n",
    "From this, an approximate $1-\\alpha$ confidence interval is $C_n = (\\widehat{a}, \\widehat{b})$:\n",
    "\n",
    "$$\\begin{aligned}\n",
    "\\widehat{a} &= \\widehat{\\theta}_n - \\widehat{H}^{-1}(1-\\alpha/2) = 2\\widehat{\\theta}_n - \\theta^*_{1 - \\alpha/2} \\\\\n",
    "\\widehat{b} &= \\widehat{\\theta}_n - \\widehat{H}^{-1}(\\alpha/2) = 2\\widehat{\\theta}_n - \\theta^*_{\\alpha/2}.\n",
    "\\end{aligned}$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's calculate the pivot interval for the standard deviation of the log-weights in the prostate dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights = prostate_data.lweight.copy()\n",
    "n = weights.shape[0]\n",
    "weights.std(ddof=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "R = 1000\n",
    "samples = np.array([weights.sample(n, replace=True) for r in range(R)])\n",
    "samples.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "theta_hat = weights.std(ddof=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimates = samples.std(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "2*theta_hat - np.percentile(estimates, [97.5, 2.5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative approach for calculating a pivotal interval is to define:\n",
    "\n",
    "$$Z_n = \\frac{T_n - \\theta}{\\widehat{SE}_{boot}}$$\n",
    "\n",
    "which can be approximated with: \n",
    "\n",
    "$$Z^*_{nr} = \\frac{T^*_{nr} - T_n}{\\widehat{SE}^*_{boot}}$$\n",
    "\n",
    "where $\\widehat{SE}^*_{boot}$ is the standard error of $T^*_{nr}$.\n",
    "\n",
    "Here, the quantiles of $Z^*_{n1}, \\ldots, Z^*_{nR}$ should approximate the true quantiles of the distribution of $Z_n$. We can then calculate the following interval:\n",
    "\n",
    "$$C_n = (T_n - z^*_{1-\\alpha/2}\\widehat{SE}_{boot}, T_n - z^*_{\\alpha/2}\\widehat{SE}_{boot})$$\n",
    "\n",
    "where $z^*_{a}$ is the $a$ sample quantile of $Z^*_{n1}, \\ldots, Z^*_{nR}$.\n",
    "\n",
    "This is a **studentized pivotal interval**.\n",
    "\n",
    "This interval is more computationally-intensive because $\\widehat{SE}^*_{boot}$ has to be calculated for each bootstrap sample, but it has better accuracy than the non-studentized interval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Write a function to calculate studentized pivotal intervals for arbitrary functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bootstrap Percentile Intervals\n",
    "\n",
    "An even simpler interval involves using the empirical quantiles of the bootstrapped statistics. Consider a monotone transformation $U = m(T)$ such that $U \\sim N(m(\\theta), c^2)$. Importantly, we don't need to know what $m$ is, only that it exists. If we let $U^*_r = m(T^*_r)$, then $U^*_{(R\\alpha/2)} = m(T^*_{(R\\alpha/2)})$ because the monotone transformation preserves the quantiles.\n",
    "\n",
    "Since $U \\sim N(m(\\theta), c^2)$, the $\\alpha/2$ quantile of $U$ is $m(\\theta) - z_{\\alpha/2} c$. From this, it can be shown that:\n",
    "\n",
    "$$Pr(T^*_{R\\alpha/2} \\le \\theta \\le T^*_{R(1 - \\alpha/2)}) = Pr\\left( -z_{\\alpha/2} \\le \\frac{U - m(T)}{c} \\le z_{\\alpha/2} \\right) = 1 - \\alpha$$\n",
    "\n",
    "This employs the *ordered* bootstrap replicates:\n",
    "\n",
    "$$T_{(1)}^*, T_{(2)}^*, \\ldots, T_{(R)}^*$$\n",
    "\n",
    "Simply extract the $100(\\alpha/2)$ and $100(1-\\alpha/2)$ percentiles:\n",
    "\n",
    "$$T_{[(R+1)\\alpha/2]}^* \\lt \\theta \\lt T_{[(R+1)(1-\\alpha/2)]}^*$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract subset of varibles, and take bootstrap samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_subset = prostate_data[['lcavol', 'lweight', 'lbph', 'lcp', 'lpsa']]\n",
    "\n",
    "bootstrap_data = (data_subset.sample(data_subset.shape[0], replace=True) for _ in range(1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will user scikit-learn's `LinearRegression` model to fit a regression model to each bootstrap sample, and store the resulting coefficients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "regmod = LinearRegression()\n",
    "\n",
    "# Empty array to store estimates\n",
    "coefs = np.empty((1000, 4))\n",
    "\n",
    "for i,X in enumerate(bootstrap_data):\n",
    "    y = X.pop('lpsa')\n",
    "    regmod.fit(X, y)\n",
    "    coefs[i] = regmod.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs.mean(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort coefficients, and extract percentiles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs.sort(axis=0)\n",
    "boot_se = coefs[[25, 975], :].T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot means\n",
    "coef_means = coefs.mean(0)\n",
    "plt.plot(coef_means, 'ro')\n",
    "\n",
    "# Plot bootstrap intervals\n",
    "for i in range(len(coef_means)):\n",
    "    plt.errorbar(x=[i,i], y=boot_se[i], color='red')\n",
    "    \n",
    "plt.xlim(-0.5, 3.5)\n",
    "plt.xticks(range(len(coef_means)), ['lcavol', 'lweight', 'lbph', 'lcp'])\n",
    "plt.axhline(0, color='k', linestyle='--');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, however, that the pivot confidence interval is generally more accurate than the percentile interval.\n",
    "\n",
    "Also, its important to remember that the bootstrap is an **asymptotic method**. So, the coverage of the confidence interval is expected to be $1 - \\alpha + r_n$ where $r_n \\propto 1/\\sqrt{n}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parametric Bootstrap\n",
    "\n",
    "The bootstrap approaches we have outlined above are all non-parametric, though parametric inference is possible. If we have $X_1, \\ldots, X_n \\sim p(x|\\theta)$, where $\\widehat{\\theta}$ is the MLE, then let $\\phi = g(\\theta)$ and $\\widehat{\\phi} = g(\\widehat{\\theta})$.\n",
    "\n",
    "An approach for estimating the standard error of $\\widehat{\\phi}$ is to compute the standard deviation of the bootstrapped $\\hat{\\phi^*}_1, \\ldots, \\hat{\\phi^*}_R$. To do this, we can draw bootstrap samples from the **parametric** distribution $p(x|\\widehat{\\theta})$:\n",
    "\n",
    "$$\\widehat{X}_1^*, \\ldots, \\widehat{X}_n^* \\sim p(x|\\widehat{\\theta})$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## References\n",
    "\n",
    "Wasserman, L. (2006). All of Nonparametric Statistics. Springer Science & Business Media."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}