{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import stats\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Calculate confidence intervals on prevalence that appear in Skowronski et al preprint (July 15, 2020)\n",
    "\n",
    "For each value of prev (=prevalence), calculate the distribution of possible outcomes (ie. number of observed cases). The 95% confidence belt contains the central 95% of possible outcomes. The 95% confidence interval, are those values of prev for which the actual observation is in the central 95% of possible outcomes.\n",
    "\n",
    "* n_trial: number of trials\n",
    "* obs: observed number of positives\n",
    "* n_control: number of negative samples used to means false positive rate (ie. specificity)\n",
    "* n_fp: number of false positives in the control\n",
    "* sensitiviy: fraction of positive samples that would be identifies as positive\n",
    "\n",
    "The sensitivity is assumed to be known well. The confidence intervals are not very sensitive to this uncertainty.\n",
    "\n",
    "Distribution of possible outcomes: use binomial for signal+background on \"n_trial\" trials\n",
    "\n",
    "The background is from false positives. We know the control: \"n_fp\" false positives in \"n_control\" negative sample size. \n",
    "The knowledge from the control test is encapsulated as the posterior probability distribution of fpr from that control study. Use a uniform prior for fpr. The posterior probability for fpr proportional to the likelihood.\n",
    "\n",
    "Make many repetitions of the experiment. For each repetition of the experiment draw fpr from its distributions, and then do a binomial throw given the expectation of signal+background."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "study = 7\n",
    "joint = False\n",
    "\n",
    "if study == 1:\n",
    "    obs = [7]\n",
    "    n_trial = [870]\n",
    "    n_control = [470 + 390 + 99]\n",
    "    n_fp = [3] \n",
    "    sensitivity = [0.85]\n",
    "elif study == 2:\n",
    "    obs = [4]\n",
    "    n_trial = [869]\n",
    "    n_control = [149 + 105]\n",
    "    n_fp = [1]\n",
    "    sensitivity = [0.93]\n",
    "elif study == 3:\n",
    "    obs = [7,4]\n",
    "    n_trial = [870,869]\n",
    "    n_control = [470 + 390 + 99,149 + 105]\n",
    "    n_fp = [3,1] \n",
    "    sensitivity = [0.85,0.93]\n",
    "elif study == 4:\n",
    "    obs = [2]\n",
    "    n_trial = [869]\n",
    "    n_control = [1091+399+100]\n",
    "    n_fp = [2]\n",
    "    sensitivity = [0.85*0.86]\n",
    "elif study == 5:\n",
    "    obs = [6]\n",
    "    n_trial = [889]\n",
    "    n_control = [470 + 390 + 99]\n",
    "    n_fp = [3] \n",
    "    sensitivity = [0.85]\n",
    "elif study == 6:\n",
    "    obs = [7]\n",
    "    n_trial = [885]\n",
    "    n_control = [149 + 105]\n",
    "    n_fp = [1]\n",
    "    sensitivity = [0.93]\n",
    "elif study == 7:\n",
    "    joint = True\n",
    "    obs = [4]\n",
    "    n_trial = [885]\n",
    "    n_control = [470 + 390 + 99,149 + 105]\n",
    "    n_fp = [3,1] \n",
    "    sensitivity = [0.93*0.85]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is the un-normalized posterior for fpr after the control experiment is done (uniform prior)\n",
    "fpr_plot = np.arange(0.,0.1,0.001)\n",
    "for i in range(len(n_fp)):\n",
    "    p_fpr_plot = stats.binom.pmf(n_fp[i], n_control[i], fpr_plot)\n",
    "    plt.plot(fpr_plot,p_fpr_plot,label=str(n_fp[i])+'/'+str(n_control[i]))\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# draw a fpr from that distribution\n",
    "def get_fpr(i_fp, i_control):\n",
    "    nu = 1.*i_fp/i_control\n",
    "    big = stats.binom.pmf(i_fp, i_control, nu)\n",
    "    max_fpr = 0.05 # choose this value using the plot above, fpr will be restricted to be below this value\n",
    "    fpr_result = -1.\n",
    "    while fpr_result < 0.:\n",
    "        fpr_trial = stats.uniform.rvs(0.,max_fpr)\n",
    "        pdf = stats.binom.pmf(i_fp, i_control, fpr_trial)\n",
    "        if stats.uniform.rvs(0.,big) < pdf:\n",
    "            fpr_result = fpr_trial\n",
    "    return fpr_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Do repetions of experiment to find boundaries of the x% CL interval on prevalence\n",
    "# \n",
    "cl = 0.95\n",
    "upper = 1.-(1-cl)/2.\n",
    "lower = (1-cl)/2.\n",
    "\n",
    "# Do two studys, one coarse to find the approximate boundaries (3 digits) then a fine one to 4 digits\n",
    "\n",
    "# number of repititions to use:\n",
    "n_rep_coarse = 1000\n",
    "n_rep_fine = 10000\n",
    "n_reps = [n_rep_coarse, n_rep_fine]\n",
    "\n",
    "# range of prevalences to consider:\n",
    "prevs_course = np.arange(0., 0.025, 0.001)\n",
    "prevs_fine = np.arange(0.0142,0.0148,0.0001)\n",
    "prevs = [prevs_course,prevs_fine]\n",
    "\n",
    "sum_obs = 0\n",
    "for ob in obs:\n",
    "    sum_obs += ob\n",
    "\n",
    "for i_study in range(1):\n",
    "    first_in = -1\n",
    "    last_in = -1\n",
    "    prev_list = prevs[i_study]\n",
    "    for prev in prev_list:\n",
    "        results = []\n",
    "        for i in range(n_reps[i_study]):\n",
    "            total = 0\n",
    "            if not joint:\n",
    "                for j in range(len(n_fp)):\n",
    "                    fpr = get_fpr(n_fp[j], n_control[j])\n",
    "                    prob = prev*sensitivity[j] + fpr\n",
    "                    total += stats.binom.rvs(n_trial[j], prob)\n",
    "                results.append(total)\n",
    "            else:\n",
    "                fpr = 1.\n",
    "                for j in range(len(n_fp)):\n",
    "                    fpr *= get_fpr(n_fp[j], n_control[j])\n",
    "                prob = prev*sensitivity[0] + fpr\n",
    "                total += stats.binom.rvs(n_trial[0], prob)\n",
    "                results.append(total)\n",
    "\n",
    "        results.sort()\n",
    "        i_upper = int(upper*n_reps[i_study])\n",
    "        i_lower = int(lower*n_reps[i_study])\n",
    "        \n",
    "        print('prevalence = {:.4f}'.format(prev)+' '+str(cl*100)+'% range =['+\n",
    "             str(results[i_lower])+','+str(results[i_upper])+']')\n",
    "\n",
    "        if first_in<0:\n",
    "            if sum_obs>=results[i_lower] and sum_obs<=results[i_upper]:\n",
    "                first_in = prev\n",
    "        if sum_obs>=results[i_lower] and sum_obs<=results[i_upper]:\n",
    "            last_in = prev\n",
    "        if last_in > 0 and sum_obs<results[i_lower]-1:\n",
    "            break\n",
    "    if i_study==0:\n",
    "        print(str(cl*100)+'% CL central interval for prevalence is: [{0:.3f},{1:.3f}]'.format(first_in,last_in))\n",
    "    else:\n",
    "        print(str(cl*100)+'% CL central interval upper bound for prevalence is: {0:.4f}'.format(last_in))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of findings\n",
    "\n",
    "study # | n_trial | obs | n_control | n_fp | sensitivity | quoted CI (%) | correct CI (%)\n",
    "---|---|---|---|---|---|---|---\n",
    "1 | 870 | 7 | 959 | 3 | 0.85 | 0.32-1.65 | 0.0-1.5\n",
    "2 | 869 | 4 | 254 | 1 | 0.93 | 0.13-1.17 | 0.0-0.8\n",
    "3 | 869 | 11 | [959,254] | [3,1] | [0.85,0.93] | 0.63-2.25 | 0.0-0.8\n",
    "4 | 869 | 2 | 1590 | 2 | 0.73 | 0.03-0.83 | 0.0-0.8\n",
    "5 | 889 | 6 | 959 | 3 | 0.85 | 0.25-1.46 | 0.0-1.3\n",
    "6 | 885 | 7 | 254 | 1 | 0.93 | 0.32-1.62 | 0.0-1.3\n",
    "7 | 885 | 4 | - | - | 0.79 | 0.25-1.46 | 0.2-1.4\n",
    "\n",
    "Correct confidence intervals are calculated to 0.1%. A fine scan can be made to get another digit if needed.\n",
    "\n",
    "Studies:\n",
    "* 1: March snapshot S1. Control sample using both manufacturer and indepedent measures: 0/470 + 2/390 + 1/99\n",
    "* 2: March snapshot nucleocapsid. \n",
    "* 3: Combined 1 and 2 counts (not considering overlap)\n",
    "* 4: March snapshot S1 and S1-RBD. Cannot assume these are independent. Assume worst case: S1 typically has the same false positivies as S1-RBD. Control sample using both manufacturer and indepedent measures: 2/1091 + 0/399 + 0/100. Assume combined sensitivity is product of two (0.85 x 0.86 = 0.73)\n",
    "* 5: May snapshot S1.\n",
    "* 6: May snapshot nucleocapsid.\n",
    "* 7: May snapshot - both S1 and nucleocapsid, assuming S1 and nucleocapsid are independent tests\n",
    "\n",
    "### Combined studies\n",
    "\n",
    "In March snapshot, all 7 S1 positives were nucleosapsid negative. We can reject the following hypotheses:\n",
    "* All 7 were true positives and S1 and nucleosapsid represent independent tests. p-value = $0.07^7 = 10^{-8}$\n",
    "* All 7 were false positives and nucleosapsid false positives have strong overlap with S1 false positives (ie. tests are not independent)\n",
    "\n",
    "Let $C$ be the combined $S1$ and nucleosapsid ($N$) tests. Probability for a false positive is\n",
    "\n",
    "$$\n",
    "P(C|n) = P(S1|n) P(N|S1,n)\n",
    "$$\n",
    "\n",
    "where $n$ means a negative sample.\n",
    "\n",
    "The worst case (in terms of power of test) has $P(N|S1,n) = 1$. This is ruled out by the March snapshot.\n",
    "The hypothesis that the tests are independent is that $P(N|S1,n) = P(N|n)$.\n",
    "Is there good reason to suggest that they are independent?\n",
    "\n",
    "### Summary\n",
    "\n",
    "All 95% confidence intervals contain 0% prevalence, except for the final one when the assumption of independence is applied.\n",
    "\n",
    "Summarizing the results as \"we estimate ~8 times more infections than reported cases\" does not properly convey the very large uncertainty on that number. Perhaps better to say \"we estimate 4-20 times more infections...\"?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}