{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 9 Day 1: Confidence intervals\n",
    "\n",
    "## Objectives:\n",
    "* Look at sampling confidence intervals\n",
    "* Look at frequentist confidence intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confidence intervals example\n",
    "\n",
    "Based loosely on [A very friendly introduction to confidence intervals](https://towardsdatascience.com/a-very-friendly-introduction-to-confidence-intervals-9add126e714)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.stats\n",
    "import scipy.stats as stats\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at taking a small sample of a larger distribution. We'll use the same example given in the reference; this was a poll of the people in the US to see what fraction likes Soccer. Let's assume for the moment we know the real mean, and have the original distribution handy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "love_soccer_prop = 0.65  # Real percentage of people who love soccer\n",
    "total_population = 325 * 10 ** 6  # Total population in the U.S. (325M)\n",
    "# Note: 325e6 would be a floating point number\n",
    "\n",
    "love_soccer_num = int(love_soccer_prop * total_population)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bools are smaller than ints, and we can still take the mean, sum, etc. of an array of bools, so we'll use bools. We'll start off with a \"sorted\" array (people who love Soccer first), we we'll need to be careful when we select samples from it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population = np.zeros(total_population, dtype=bool)\n",
    "population[:love_soccer_num] = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.mean(population)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(10):\n",
    "    sample = np.random.choice(population, size=1000)\n",
    "    print(f\"Sample {i}: {np.mean(sample)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "values = np.empty(10_000)\n",
    "for i in range(len(values)):\n",
    "    values[i] = np.mean(np.random.choice(population, size=1_000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.mean(values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(values, bins=20)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confidence = 0.95\n",
    "sem = np.std(sample) / np.sqrt(len(sample))  # scipy.stats.sem(sample)\n",
    "h = sem * scipy.stats.t.ppf((1 + confidence) / 2, len(sample) - 1)\n",
    "print(f\"{np.mean(sample)} +/- {h:.3g} ({confidence:.1%})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For larger samples (more than 100), we can approximate the student's t distribution with a Gaussian:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sem = scipy.stats.sem(sample)\n",
    "h = sem * scipy.stats.norm.ppf((1 + confidence) / 2)\n",
    "print(f\"{np.mean(sample)} +/- {h:.3g} ({confidence:.1%})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In SciPy, we can do this even quicker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h = scipy.stats.t.interval(confidence, len(sample) - 1, loc=np.mean(sample), scale=sem)\n",
    "print(h)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Anaconda includes another popular package for statistics: StatsModels, which complements `scipy.stats` and `pandas`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.stats.api as sms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter = sms.DescrStatsW(sample).tconfint_mean()\n",
    "inter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Physics confidence intervals\n",
    "\n",
    "Based on the excellent slides here: <http://www.phas.ubc.ca/~oser/p509/Lec_16.pdf>\n",
    "\n",
    "For simplicity, we will assume that we are using 90% confidence intervals in the notes below.\n",
    "\n",
    "In Physics, frequentist confidence intervals are often used for measurements. We make a measurement, then construct a confidence interval from the expected underlying probability distribution for the different possible measurements; we quote the true values for which our measurement is in the 90% range.\n",
    "\n",
    "This does not tell us that there is a 90% chance the true value is in this range, rather that if we did this experiment many times, 90% of the experiments would contain the true value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Neyman confidence interval construction:\n",
    "\n",
    "1. Calculate the PDF of obtaining a measurement $\\hat{a}$ from the true value $a$: $\\mathrm{P}(\\hat{a}|a)$\n",
    "2. Define the interval where $\\hat{a}$ has a 90% chance of occurring\n",
    "3. Repeat for all possible true $a$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we have the following distribution, the Neyman confidence interval construction *does not specify* how to draw the confidence region; any region that contains 90% of the probability could be a 90% confidence interval.\n",
    "\n",
    "$$\n",
    "\\mathrm{PDF}(x) = x e^{-\\frac{1}{2} x^2}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sympy as s\n",
    "\n",
    "s.init_printing()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = s.symbols(\"x\")\n",
    "f = x * s.exp(-(x ** 2) / 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.integrate(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is normalized:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.integrate((x, 0, s.oo))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is one region:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.integrate((x, 0.32, 2.45))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fn = s.lambdify([x], f)\n",
    "\n",
    "xs = np.linspace(0, 5, 100)\n",
    "plt.plot(xs, fn(xs))\n",
    "xs = np.linspace(0.32, 2.45, 50)\n",
    "plt.fill_between(xs, fn(xs), color=\"C1\", alpha=0.3)\n",
    "plt.ylim(ymin=0)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also choose an upper limit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.integrate((x, 0, 2.15))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = np.linspace(0, 5, 100)\n",
    "plt.plot(xs, fn(xs))\n",
    "xs = np.linspace(0, 2.15, 50)\n",
    "plt.fill_between(xs, fn(xs), color=\"C1\", alpha=0.3)\n",
    "plt.ylim(ymin=0)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How about this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.integrate((x, 0, 1.09)) + f.integrate((x, 1.26, s.oo))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = np.linspace(0, 5, 100)\n",
    "plt.plot(xs, fn(xs))\n",
    "xs = np.linspace(0, 1.09, 50)\n",
    "plt.fill_between(xs, fn(xs), color=\"C1\", alpha=0.3)\n",
    "xs = np.linspace(1.26, 5, 50)\n",
    "plt.fill_between(xs, fn(xs), color=\"C1\", alpha=0.3)\n",
    "plt.ylim(ymin=0)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Flip flop problem\n",
    "\n",
    "Let's take the following example. Assume I am measuring from a Gaussian distribution where I know $\\mu\\ge0$:\n",
    "\n",
    "$$\n",
    "\\mathrm{P}(\\hat{\\mu} | \\mu) =\n",
    " \\frac{1}{\\sqrt{2\\pi}}\n",
    " \\exp{\\left[\n",
    "  -\\frac{1}{2} \\left(\n",
    "  \\mu - \\hat{\\mu}\n",
    "  \\right)^2\n",
    "\\right]}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For fixed true value $\\mu$, 90% of the probability lies in $\\mu-1.28<\\hat{\\mu}<\\infty$. Or $\\mu-1.65<\\hat{\\mu}<\\mu+1.65$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "\n",
    "# Upper limit quote\n",
    "x = np.linspace(0, 3)\n",
    "ax.fill_between(x, 1.28 + x, alpha=0.3)\n",
    "\n",
    "# Upper and lower bounds\n",
    "x = np.linspace(3, 4)\n",
    "ax.fill_between(x, 1.65 + x, x - 1.65, alpha=0.3)\n",
    "\n",
    "# Quote as 0\n",
    "x = np.linspace(-2, 0)\n",
    "ax.fill_between(x, 1.28 + x * 0, alpha=0.3)\n",
    "\n",
    "ax.set_xlim(-2, 4)\n",
    "ax.set_ylim(0, 6)\n",
    "ax.grid()\n",
    "ax.set_aspect(\"equal\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Feldman-Cousins\n",
    "\n",
    "A special prescription for building the limits intervals that transitions properly and avoids empty set when near a physical boundary (like 0) by using the following ordering ratio:\n",
    "\n",
    "$$\n",
    "R = \\frac{P(x∣\\mu)}\n",
    "         {P(x∣\\mu_\\mathrm{best})}\n",
    "$$\n",
    "\n",
    "You simply select the 90% with the highest R."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From out previous example, $\\mu_\\mathrm{best}=x$ if $x>0$ or  $\\mu_\\mathrm{best}=0$ if $x\\le 0$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Feldman-cousins from GammaPy (from pip)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gammapy.stats as gstats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_bins = np.arange(0, 50)\n",
    "mu_bins = np.linspace(0, 15, int(15 / 0.005) + 1, endpoint=True)\n",
    "matrix = [stats.poisson(mu + 3.0).pmf(x_bins) for mu in mu_bins]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(matrix[0], \".\")\n",
    "plt.plot(matrix[1000], \".\")\n",
    "plt.plot(matrix[2000], \".\")\n",
    "plt.plot(matrix[-1], \".\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "acceptance_intervals = gstats.fc_construct_acceptance_intervals_pdfs(matrix, 0.9)\n",
    "LowerLimitNum, UpperLimitNum, other = gstats.fc_get_limits(\n",
    "    mu_bins, x_bins, acceptance_intervals\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = np.array([gstats.fc_find_limit(i, UpperLimitNum, mu_bins) for i in range(14)])\n",
    "l = [gstats.fc_find_limit(i, LowerLimitNum, mu_bins) for i in range(14)]\n",
    "plt.step(range(14), u, where=\"post\", label=\"Upper limit\")\n",
    "plt.step(range(14), l, where=\"post\", label=\"Lower limit\")\n",
    "plt.xticks(range(15))\n",
    "plt.yticks(range(16))\n",
    "plt.grid()\n",
    "plt.xlabel(\"Measured\")\n",
    "plt.ylabel(\"True\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "compclass",
   "language": "python",
   "name": "compclass"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}