{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b4f4d03e-4b26-4d96-aee9-dc525795355c",
   "metadata": {},
   "source": [
    "# The Olkin-Petkau-Zidek example of MLE fragility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e121995a-db28-4656-b343-bc2ba95c4ea2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.stats as st\n",
    "import scipy.special\n",
    "import scipy.optimize"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44338151-2e46-493c-94af-5cc0b6b87b0d",
   "metadata": {},
   "source": [
    "<hr>\n",
    "\n",
    "In 1981, [Olkin, Petkau, and Zidek](https://doi.org/10.1080/01621459.1981.10477697) demonstrated an example in which MLE estimates can very wildly with only small changes in the data. Here, we will work through their example.\n",
    "\n",
    "The problem is thusly stated: Say you are measuring the outcomes of $N$ Bernoulli trials, but you can only measure a positive result; negative results are not detected in your experiment. You do know, however that $N$, while unknown, is the same for all experiments. The number of positive results you get from a set of measurements (sorted for convenience) are *n* = 16, 18, 22, 25, 27. Modeling the generative process with Binomial distribution, $n_i \\sim \\text{Binom}(\\theta, N)\\;\\;\\forall i$, we can obtain maximum likelihood estimates for $\\theta$ and $N$. \n",
    "\n",
    "Let us first attempt to make some analytical progress on computing the MLE. We start by writing the likelihood and log likelihood. Let $M$ be the number of measurements made; in this case, $M = 5$.\n",
    "\n",
    "\\begin{align}\n",
    "&L(\\theta, N;\\mathbf{n}) = \\prod_{i=1}^M \\frac{N!}{n_i!(N-n_i)!}\\,\\theta^{n_i}(1-\\theta)^{N-n_i},\\\\[1em]\n",
    "&\\ell(\\theta, N;\\mathbf{n}) = M\\ln N! + \\sum_{i=1}^M \\left[-\\ln n_i! - \\ln\\left(N-n_i\\right)! + n_i \\ln \\theta + (N-n_i)\\ln(1-\\theta)\\right].\n",
    "\\end{align}\n",
    "\n",
    "A necessary condition for the MLE is that\n",
    "\n",
    "\\begin{align}\n",
    "\\frac{\\partial \\ell}{\\partial \\theta} = \\sum_{i=1}^M\\,\\frac{n_i}{\\theta} - \\frac{N-n_i}{1-\\theta} \n",
    "= M\\left(\\frac{\\bar{n}}{\\theta} - \\frac{N}{1-\\theta} + \\frac{\\bar{n}}{1-\\theta}\\right)= 0,\n",
    "\\end{align}\n",
    "\n",
    "where $\\bar{n}$ denotes the average of the $M$ observed $n$ values. Solving for the MLE value of $\\theta$ yields\n",
    "\n",
    "\\begin{align}\n",
    "\\theta^* = \\frac{\\bar{n}}{N},\n",
    "\\end{align}\n",
    "\n",
    "which is perhaps not surprising. We are left to find the MLE for $N$, which is the value of $N$ for which the log likelihood\n",
    "\n",
    "\\begin{align}\n",
    "\\ell = M\\ln N! + \\sum_{i=1}^M \\left[-\\ln n_i! - \\ln\\left(N-n_i\\right)! + n_i\\,\\ln \\frac{\\bar{n}}{N} + (N-n_i)\\ln\\left(1-\\frac{\\bar{n}}{N}\\right)\\right]\n",
    "\\end{align}\n",
    "\n",
    "is maximal. Equivalently, we seek $N$ such that the quantity\n",
    "\n",
    "\\begin{align}\n",
    "\\ln N! - \\bar{n}\\,\\ln N + (N-\\bar{n}) \\ln\\left(1-\\frac{\\bar{n}}{N}\\right) - \\frac{1}{M}\\sum_{i=1}^M \\ln(N-n_i)!\n",
    "\\end{align}\n",
    "\n",
    "is maximal.\n",
    "\n",
    "Computing the MLE for a discrete parameter can be a bit tricky. We could take the approach of exhaustively enumerating the log likelihood for many values of $N$ and then selecting the value of $N$ that give the largest log likelihood. Here is an implementation of that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "0551a4e5-ad48-441c-8508-0dd0cef34704",
   "metadata": {},
   "outputs": [],
   "source": [
    "def mle_exhaustive(n, N_min=None, N_max=1000):\n",
    "    \"\"\"Compute MLE for N and theta.\"\"\"\n",
    "    # Values of N we will enumerate\n",
    "    N = np.arange(np.max(n) if N_min is None else N_min, N_max+1)\n",
    "\n",
    "    # MLE for theta as a function of N\n",
    "    theta = np.mean(n) / N\n",
    "\n",
    "    # Compute log likelihood\n",
    "    log_L = [\n",
    "        np.sum(st.binom.logpmf(n, N_val, theta_val)) for N_val, theta_val in zip(N, theta)\n",
    "    ]\n",
    "\n",
    "    # Find index of MLE of N\n",
    "    i_mle = np.argmax(log_L)\n",
    "\n",
    "    # Compute MLE\n",
    "    N_mle = N[i_mle]\n",
    "    theta_mle = np.mean(n) / N_mle\n",
    "    \n",
    "    return N_mle, theta_mle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b8c076b-631c-49ba-87d2-c111d071db84",
   "metadata": {},
   "source": [
    "Let's put our function to use and compute the MLE!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f6309687-e856-4ef7-b85c-ac3d576f95a1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(99, 0.2181818181818182)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Observed data\n",
    "n = np.array([16, 18, 22, 25, 27])\n",
    "\n",
    "mle_exhaustive(n)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57e02982-9589-4721-a4bd-bb0230aa448c",
   "metadata": {},
   "source": [
    "We get $N^* = 99$ and $\\theta^* = 0.22$.\n",
    "\n",
    "But now, let's say that the final measurement has 28 positive results instead of 27. That's a change of a single positive result in just one of the five experiments. Now let's compute the MLE for this new data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f5e2e281-7276-445b-af01-548a159af405",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(191, 0.11413612565445026)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Observed data\n",
    "n = np.array([16, 18, 22, 25, 28])\n",
    "\n",
    "mle_exhaustive(n)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42bb0a25-1009-49c4-872f-b29ba83cf1d7",
   "metadata": {},
   "source": [
    "Oof! $N^*$ almost doubled to 191 and $\\theta^*$ halved to 0.11. With just a tiny change in the data, we can get wildly different estimates for the parameters. This is an example of the fragility of the MLE.\n",
    "\n",
    "To get a handle on how our MLE would change with newly acquired data sets, we will do parametric bootstrap to compute confidence intervals for $N^*$ and $\\theta^*$ using the original data set with 27 successes for the fifth trial. Computing the bootstrapped confidence interval becomes difficult because exhaustive enumeration is slow and $N$ can be very large. Another approach is to consider $N$ as a continuous variable such that $N!= \\Gamma(N+1)$, and instead search for a maximizer of\n",
    "\n",
    "\\begin{align}\n",
    "h(N) = \\ln \\Gamma(N+1) - \\bar{n}\\,\\ln N + (N-\\bar{n}) \\ln\\left(1-\\frac{\\bar{n}}{N}\\right) - \\frac{1}{M}\\sum_{i=1}^M \\ln \\Gamma(N-n_i+1).\n",
    "\\end{align}\n",
    "\n",
    "Noting that the derivative of the log of a Gamma function is a polygamma function,\n",
    "\n",
    "\\begin{align}\n",
    "\\frac{\\mathrm{d}}{\\mathrm{d}x}\\,\\ln \\Gamma(x) = \\psi_0(x),\n",
    "\\end{align}\n",
    "\n",
    "we can write\n",
    "\n",
    "\\begin{align}\n",
    "\\frac{\\mathrm{d}h}{\\mathrm{d}N} = \\ln\\left(1 - \\frac{\\bar{n}}{N}\\right) + \\psi_0(N+1) - \\frac{1}{M}\\sum_{i=1}^M\\psi_0(N-n_i+1) = 0.\n",
    "\\end{align}\n",
    "\n",
    "Our strategy to find $N^*$ is to solve the above root finding problem for continuous $N$, and then find which integer $N$ on either side of the root gives the maximal log likelihood. Note that in the limit of $N\\to \\infty$, $\\mathrm{d}h/\\mathrm{d}N = 0$. A finite root may not exist, meaning that the MLE is $N\\to\\infty$ and $\\theta = 0$. In this limit, the number of positive results are Poisson distributed (the $N\\to\\infty$ and $\\theta \\to 0$ limit where $N\\theta$ remains finite). In this case, the MLE for $N\\theta$ is $\\bar{n}$. We will not include those results in our bootstrap samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9c3a5dac-a2be-44d6-9223-9ec7d9a8ed9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def dh_dN(N, n):\n",
    "    \"\"\"Return dh/dN for root finding problem.\"\"\"\n",
    "    sum_term = np.mean(scipy.special.polygamma(0, N - n + 1))\n",
    "    return np.log(1 - np.mean(n) / N) + scipy.special.polygamma(0, N + 1) - sum_term\n",
    "\n",
    "\n",
    "def mle(n, N_max=1_000_000):\n",
    "    \"\"\"Compute the MLE for OPZ problem.\"\"\"\n",
    "    try:\n",
    "        root = scipy.optimize.brentq(dh_dN, np.max(n)+1, N_max, args=(n,))\n",
    "        return mle_exhaustive(n, N_min=int(root), N_max=int(root)+1)\n",
    "    except:\n",
    "        return (np.nan, np.nan)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92da034e-fd57-4bbf-966d-08d558d9cdcc",
   "metadata": {},
   "source": [
    "Let's now compute the bootstrap replicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "791623aa-badf-449d-9836-765d485f48e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Observed data\n",
    "n = np.array([16, 18, 22, 25, 27])\n",
    "\n",
    "# Get MLE\n",
    "N, theta = mle(n)\n",
    "\n",
    "# Parametric bootstrap\n",
    "rng = np.random.default_rng()\n",
    "bs_reps = np.array([mle(rng.binomial(N, theta, size=len(n))) for _ in range(100_000)])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5d7483b-88f8-46f4-87c9-49f9e4c4b03b",
   "metadata": {},
   "source": [
    "Let's first check what fraction of the bootstrap replicates had $N\\to\\infty$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7c250f78-20a7-44e4-8e27-e9c167778a07",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.25038"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(np.isnan(bs_reps[:,0])) / len(bs_reps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8593cb9-3695-4d19-8cea-e53b84117a41",
   "metadata": {},
   "source": [
    "Wow. Nearly 25% of the bootstrap replicates had $N\\to\\infty$, such that we can't even compute an MLE for the Binomial model. Now, computing a confidence interval with the samples where an MLE exists, first for $N$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cbcfc638-c777-438c-b5b8-b5cd324b6ed3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 24., 487.])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.percentile(bs_reps[~np.isnan(bs_reps[:, 0]), 0], [2.5, 97.5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cbe0d2e-e1ab-4528-922d-441bcf2d772c",
   "metadata": {},
   "source": [
    "And now for theta."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "bfa13165-fe2b-4283-afb0-09f51baa367c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.04417178, 0.85      ])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.percentile(bs_reps[~np.isnan(bs_reps[:, 1]), 1], [2.5, 97.5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70ff1dda-bb2c-42d9-bb98-d8f6f23883ed",
   "metadata": {},
   "source": [
    "These results verify that from observing first positive results from five sets of experiments, we cannot really know anything about the parameters of the model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4be9194e-bf90-474b-b9bd-fdb5a803d0a2",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d2afda26-e264-4dcb-a647-a0e359fc4ecb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python implementation: CPython\n",
      "Python version       : 3.11.4\n",
      "IPython version      : 8.12.0\n",
      "\n",
      "numpy     : 1.24.3\n",
      "scipy     : 1.10.1\n",
      "jupyterlab: 4.0.4\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p numpy,scipy,jupyterlab"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}