{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bootstrap confidence intervals\n", "> A Summary of lecture \"Statistical Thinking in Python (Part 2)\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Data Science, Statistics]\n", "- image: images/bootstrap-reg.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def ecdf(data):\n", " \"\"\"Compute ECDF for a one-dimensional array of measurements.\"\"\"\n", " # Number of data points: n\n", " n = len(data)\n", "\n", " # x-data for the ECDF: x\n", " x = np.sort(data)\n", "\n", " # y-data for the ECDF: y\n", " y = np.arange(1, n + 1) / n\n", "\n", " return x, y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating bootstrap replicates\n", "### Bootstrapping\n", "- The use of resampled data to perform statistical inference\n", "- Bootstrap sample : A resampled array of the data\n", "- Bootstrap replicate : A statistic computed from a resampled array\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing bootstrap samples\n", "In this exercise, you will generate bootstrap samples from the set of annual rainfall data measured at the Sheffield Weather Station in the UK from 1883 to 2015. The data are stored in the NumPy array ```rainfall``` in units of millimeters (mm). By graphically displaying the bootstrap samples with an ECDF, you can get a feel for how bootstrap sampling allows probabilistic descriptions of data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_fwf('./dataset/sheffield_weather_station.csv', skiprows=8)\n", "rainfall = df['rain'].astype('float')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for _ in range(50):\n", " # Generate bootstrap samle: bs_sample\n", " bs_sample = np.random.choice(rainfall, size=len(rainfall))\n", " \n", " # Compute and plot ECDF from bootstrap sample\n", " x, y = ecdf(bs_sample)\n", " _ = plt.plot(x, y, marker='.', linestyle='none', color='gray', alpha=0.1)\n", " \n", "# Compute and plot ECDF from original data\n", "x, y = ecdf(rainfall)\n", "_ = plt.plot(x, y, marker='.')\n", "\n", "# Make margins and label axes\n", "plt.margins(0.02)\n", "_ = plt.xlabel('yearly rainfall (mm)')\n", "_ = plt.ylabel('ECDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bootstrap confidence intervals\n", "- Confidence interval of a statistics\n", " - If we repeated measurements over and over again, ```p%``` of the observed values would lie within the ```p%``` confidence interval" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating many bootstrap replicates\n", "The function ```bootstrap_replicate_1d()``` from the video is available in your namespace. Now you'll write another function, ```draw_bs_reps(data, func, size=1)```, which generates many bootstrap replicates from the data set. This function will come in handy for you again and again as you compute confidence intervals and later when you do hypothesis tests.\n", "\n", "For your reference, the ```bootstrap_replicate_1d()``` function is provided below:\n", "```python\n", "def bootstrap_replicate_1d(data, func):\n", " \"\"\"Generate bootstrap replicate of 1D data.\"\"\"\n", " bs_sample = np.random.choice(data, len(data))\n", " return func(bs_sample)\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def bootstrap_replicate_1d(data, func):\n", " \"\"\"Generate bootstrap replicate of 1D data.\"\"\"\n", " bs_sample = np.random.choice(data, len(data))\n", " return func(bs_sample)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def draw_bs_reps(data, func, size=1):\n", " \"\"\"Draw bootstrap replicates.\"\"\"\n", " \n", " # Initialize array of replicates: bs_replicates\n", " bs_replicates = np.empty(size)\n", " \n", " # Generate replicates\n", " for i in range(size):\n", " bs_replicates[i] = bootstrap_replicate_1d(data, func)\n", " \n", " return bs_replicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bootstrap replicates of the mean and the SEM\n", "In this exercise, you will compute a bootstrap estimate of the probability density function of the mean annual rainfall at the Sheffield Weather Station. Remember, we are estimating the mean annual rainfall we would get if the Sheffield Weather Station could repeat all of the measurements from 1883 to 2015 over and over again. This is a probabilistic estimate of the mean. You will plot the PDF as a histogram, and you will see that it is Normal.\n", "\n", "In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.) The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the square root of the number of data points. I.e., for a data set, ```sem = np.std(data) / np.sqrt(len(data))```. Using hacker statistics, you get this same result without the need to derive it, but you will verify this result from your bootstrap replicates." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9488593574676786\n", "0.956034708235844\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Take 10,000 bootstrap replicates of the mean: bs_replicates\n", "bs_replicates = draw_bs_reps(rainfall, np.mean, size=10000)\n", "\n", "# Compute and print SEM\n", "sem = np.std(rainfall) / np.sqrt(len(rainfall))\n", "print(sem)\n", "\n", "# Compute and print standard deviation of bootstrap replicates\n", "bs_std = np.std(bs_replicates)\n", "print(bs_std)\n", "\n", "# Make a histogram of the results\n", "_ = plt.hist(bs_replicates, bins=50, density=True)\n", "_ = plt.xlabel('mean annual rainfall (mm)')\n", "_ = plt.ylabel('PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confidence intervals of rainfall data\n", "A confidence interval gives upper and lower bounds on the range of parameter values you might expect to get if we repeat our measurements. For named distributions, you can compute them analytically or look them up, but one of the many beautiful properties of the bootstrap method is that you can take percentiles of your bootstrap replicates to get your confidence interval. Conveniently, you can use the ```np.percentile()``` function.\n", "\n", "Use the bootstrap replicates you just generated to compute the 95% confidence interval. That is, give the 2.5th and 97.5th percentile of your bootstrap replicates stored as ```bs_replicates```. What is the 95% confidence interval?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([64.86782928, 68.62363296])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(bs_replicates, [2.5, 97.5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bootstrap replicates of other statistics\n", "We saw in a previous exercise that the mean is Normally distributed. This does not necessarily hold for other statistics, but no worry: as hackers, we can always take bootstrap replicates! In this exercise, you'll generate bootstrap replicates for the variance of the annual rainfall at the Sheffield Weather Station and plot the histogram of the replicates.\n", "\n", "Here, you will make use of the draw_bs_reps() function you defined a few exercises ago." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate 10,000 bootstrap replicates of the variance: bs_replicates\n", "bs_replicates = draw_bs_reps(rainfall, np.var, size=10000)\n", "\n", "# Put the variance in units of square centimeters\n", "bs_replicates /= 100\n", "\n", "# Make a histogram of the results\n", "_ = plt.hist(bs_replicates, bins=50, density=True)\n", "_ = plt.xlabel('variance of annual rainfall (sq. cm)')\n", "_ = plt.ylabel('PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confidence interval on the rate of no-hitters\n", "Consider again the inter-no-hitter intervals for the modern era of baseball. Generate 10,000 bootstrap replicates of the optimal parameter $\\tau$. Plot a histogram of your replicates and report a 95% confidence interval." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "nohitter_times = np.array([ 843, 1613, 1101, 215, 684, 814, 278, 324, 161, 219, 545,\n", " 715, 966, 624, 29, 450, 107, 20, 91, 1325, 124, 1468,\n", " 104, 1309, 429, 62, 1878, 1104, 123, 251, 93, 188, 983,\n", " 166, 96, 702, 23, 524, 26, 299, 59, 39, 12, 2,\n", " 308, 1114, 813, 887, 645, 2088, 42, 2090, 11, 886, 1665,\n", " 1084, 2900, 2432, 750, 4021, 1070, 1765, 1322, 26, 548, 1525,\n", " 77, 2181, 2752, 127, 2147, 211, 41, 1575, 151, 479, 697,\n", " 557, 2267, 542, 392, 73, 603, 233, 255, 528, 397, 1529,\n", " 1023, 1194, 462, 583, 37, 943, 996, 480, 1497, 717, 224,\n", " 219, 1531, 498, 44, 288, 267, 600, 52, 269, 1086, 386,\n", " 176, 2199, 216, 54, 675, 1243, 463, 650, 171, 327, 110,\n", " 774, 509, 8, 197, 136, 12, 1124, 64, 380, 811, 232,\n", " 192, 731, 715, 226, 605, 539, 1491, 323, 240, 179, 702,\n", " 156, 82, 1397, 354, 778, 603, 1001, 385, 986, 203, 149,\n", " 576, 445, 180, 1403, 252, 675, 1351, 2983, 1568, 45, 899,\n", " 3260, 1025, 31, 100, 2055, 4043, 79, 238, 3931, 2351, 595,\n", " 110, 215, 0, 563, 206, 660, 242, 577, 179, 157, 192,\n", " 192, 1848, 792, 1693, 55, 388, 225, 1134, 1172, 1555, 31,\n", " 1582, 1044, 378, 1687, 2915, 280, 765, 2819, 511, 1521, 745,\n", " 2491, 580, 2072, 6450, 578, 745, 1075, 1103, 1549, 1520, 138,\n", " 1202, 296, 277, 351, 391, 950, 459, 62, 1056, 1128, 139,\n", " 420, 87, 71, 814, 603, 1349, 162, 1027, 783, 326, 101,\n", " 876, 381, 905, 156, 419, 239, 119, 129, 467])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "95% confidence interval = [663.32300797 871.85756972] games\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates\n", "bs_replicates = draw_bs_reps(nohitter_times, np.mean, 10000)\n", "\n", "# Compute the 95% confidence interval: conf_int\n", "conf_int = np.percentile(bs_replicates, [2.5, 97.5])\n", "\n", "# Print the confidence interval\n", "print('95% confidence interval =', conf_int, 'games')\n", "\n", "# Plot the histogram of the replicates\n", "_ = plt.hist(bs_replicates, bins=50, density=True)\n", "_ = plt.xlabel(r'$\\tau$ (games)')\n", "_ = plt.ylabel('PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pairs bootstrap\n", "- Nonparametric inference\n", " - Make no assumptions about the model or probability distribution underlying the data\n", "- Pairs bootstrap for linear regression\n", " - Resample data in pairs\n", " - Compute slope and intercept from resampled data\n", " - Each slope and intercept is a bootstrap replicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A function to do pairs bootstrap\n", "As discussed in the video, pairs bootstrap involves resampling pairs of data. Each collection of pairs fit with a line, in this case using ```np.polyfit()```. We do this again and again, getting bootstrap replicates of the parameter values. To have a useful tool for doing pairs bootstrap, you will write a function to perform pairs bootstrap on a set of ```x,y``` data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def draw_bs_pairs_linreg(x, y, size=1):\n", " \"\"\"Perform pairs bootstrap for linear regression.\"\"\"\n", "\n", " # Set up array of indices to sample from: inds\n", " inds = np.arange(len(x))\n", "\n", " # Initialize replicates: bs_slope_reps, bs_intercept_reps\n", " bs_slope_reps = np.empty(size)\n", " bs_intercept_reps = np.empty(size)\n", "\n", " # Generate replicates\n", " for i in range(size):\n", " bs_inds = np.random.choice(inds, size=len(inds))\n", " bs_x, bs_y = x[bs_inds], y[bs_inds]\n", " bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, deg=1)\n", "\n", " return bs_slope_reps, bs_intercept_reps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pairs bootstrap of literacy/fertility data\n", "Using the function you just wrote, perform pairs bootstrap to plot a histogram describing the estimate of the slope from the illiteracy/fertility data. Also report the 95% confidence interval of the slope. The data is available to you in the NumPy arrays ```illiteracy``` and ```fertility```.\n", "\n", "As a reminder, ```draw_bs_pairs_linreg()``` has a function signature of ```draw_bs_pairs_linreg(x, y, size=1)```, and it returns two values: ```bs_slope_reps``` and ```bs_intercept_reps```." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./dataset/female_literacy_fertility.csv')\n", "fertility = np.array(df['fertility'])\n", "illiteracy = np.array(100 - df['female literacy'])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.04409522 0.05545012]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate replicates of slope and intercept using pairs bootstrap\n", "bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, 1000)\n", "\n", "# Compute and print 95% CI for slope\n", "print(np.percentile(bs_slope_reps, [2.5, 97.5]))\n", "\n", "# Plot the histogram\n", "_ = plt.hist(bs_slope_reps, bins=50, density=True)\n", "_ = plt.xlabel('slope')\n", "_ = plt.ylabel('PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting bootstrap regressions\n", "A nice way to visualize the variability we might expect in a linear regression is to plot the line you would get from each bootstrap replicate of the slope and intercept. Do this for the first 100 of your bootstrap replicates of the slope and intercept.\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate array of x-values for bootstrap lines: x\n", "x = np.array([0, 100])\n", "\n", "# Plot the bootstrap lines\n", "for i in range(100):\n", " _ = plt.plot(x, \n", " bs_slope_reps[i] * x + bs_intercept_reps[i],\n", " linewidth=0.5, alpha=0.2, color='red')\n", "\n", "# Plot the data\n", "_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')\n", "\n", "# Label axes, set the margins, and show the plot\n", "_ = plt.xlabel('illiteracy')\n", "_ = plt.ylabel('fertility')\n", "plt.margins(0.02)\n", "plt.savefig('../images/bootstrap-reg.png')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }