{ "cells": [ { "cell_type": "markdown", "id": "b4f4d03e-4b26-4d96-aee9-dc525795355c", "metadata": {}, "source": [ "# The Olkin-Petkau-Zidek example of MLE fragility" ] }, { "cell_type": "code", "execution_count": 1, "id": "e121995a-db28-4656-b343-bc2ba95c4ea2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.stats as st\n", "import scipy.special\n", "import scipy.optimize" ] }, { "cell_type": "markdown", "id": "44338151-2e46-493c-94af-5cc0b6b87b0d", "metadata": {}, "source": [ "
\n", "\n", "In 1981, [Olkin, Petkau, and Zidek](https://doi.org/10.1080/01621459.1981.10477697) demonstrated an example in which MLE estimates can very wildly with only small changes in the data. Here, we will work through their example.\n", "\n", "The problem is thusly stated: Say you are measuring the outcomes of $N$ Bernoulli trials, but you can only measure a positive result; negative results are not detected in your experiment. You do know, however that $N$, while unknown, is the same for all experiments. The number of positive results you get from a set of measurements (sorted for convenience) are *n* = 16, 18, 22, 25, 27. Modeling the generative process with Binomial distribution, $n_i \\sim \\text{Binom}(\\theta, N)\\;\\;\\forall i$, we can obtain maximum likelihood estimates for $\\theta$ and $N$. \n", "\n", "Let us first attempt to make some analytical progress on computing the MLE. We start by writing the likelihood and log likelihood. Let $M$ be the number of measurements made; in this case, $M = 5$.\n", "\n", "\\begin{align}\n", "&L(\\theta, N;\\mathbf{n}) = \\prod_{i=1}^M \\frac{N!}{n_i!(N-n_i)!}\\,\\theta^{n_i}(1-\\theta)^{N-n_i},\\\\[1em]\n", "&\\ell(\\theta, N;\\mathbf{n}) = M\\ln N! + \\sum_{i=1}^M \\left[-\\ln n_i! - \\ln\\left(N-n_i\\right)! + n_i \\ln \\theta + (N-n_i)\\ln(1-\\theta)\\right].\n", "\\end{align}\n", "\n", "A necessary condition for the MLE is that\n", "\n", "\\begin{align}\n", "\\frac{\\partial \\ell}{\\partial \\theta} = \\sum_{i=1}^M\\,\\frac{n_i}{\\theta} - \\frac{N-n_i}{1-\\theta} \n", "= M\\left(\\frac{\\bar{n}}{\\theta} - \\frac{N}{1-\\theta} + \\frac{\\bar{n}}{1-\\theta}\\right)= 0,\n", "\\end{align}\n", "\n", "where $\\bar{n}$ denotes the average of the $M$ observed $n$ values. Solving for the MLE value of $\\theta$ yields\n", "\n", "\\begin{align}\n", "\\theta^* = \\frac{\\bar{n}}{N},\n", "\\end{align}\n", "\n", "which is perhaps not surprising. We are left to find the MLE for $N$, which is the value of $N$ for which the log likelihood\n", "\n", "\\begin{align}\n", "\\ell = M\\ln N! + \\sum_{i=1}^M \\left[-\\ln n_i! - \\ln\\left(N-n_i\\right)! + n_i\\,\\ln \\frac{\\bar{n}}{N} + (N-n_i)\\ln\\left(1-\\frac{\\bar{n}}{N}\\right)\\right]\n", "\\end{align}\n", "\n", "is maximal. Equivalently, we seek $N$ such that the quantity\n", "\n", "\\begin{align}\n", "\\ln N! - \\bar{n}\\,\\ln N + (N-\\bar{n}) \\ln\\left(1-\\frac{\\bar{n}}{N}\\right) - \\frac{1}{M}\\sum_{i=1}^M \\ln(N-n_i)!\n", "\\end{align}\n", "\n", "is maximal.\n", "\n", "Computing the MLE for a discrete parameter can be a bit tricky. We could take the approach of exhaustively enumerating the log likelihood for many values of $N$ and then selecting the value of $N$ that give the largest log likelihood. Here is an implementation of that." ] }, { "cell_type": "code", "execution_count": 2, "id": "0551a4e5-ad48-441c-8508-0dd0cef34704", "metadata": {}, "outputs": [], "source": [ "def mle_exhaustive(n, N_min=None, N_max=1000):\n", " \"\"\"Compute MLE for N and theta.\"\"\"\n", " # Values of N we will enumerate\n", " N = np.arange(np.max(n) if N_min is None else N_min, N_max+1)\n", "\n", " # MLE for theta as a function of N\n", " theta = np.mean(n) / N\n", "\n", " # Compute log likelihood\n", " log_L = [\n", " np.sum(st.binom.logpmf(n, N_val, theta_val)) for N_val, theta_val in zip(N, theta)\n", " ]\n", "\n", " # Find index of MLE of N\n", " i_mle = np.argmax(log_L)\n", "\n", " # Compute MLE\n", " N_mle = N[i_mle]\n", " theta_mle = np.mean(n) / N_mle\n", " \n", " return N_mle, theta_mle" ] }, { "cell_type": "markdown", "id": "5b8c076b-631c-49ba-87d2-c111d071db84", "metadata": {}, "source": [ "Let's put our function to use and compute the MLE!" ] }, { "cell_type": "code", "execution_count": 3, "id": "f6309687-e856-4ef7-b85c-ac3d576f95a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(99, 0.2181818181818182)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Observed data\n", "n = np.array([16, 18, 22, 25, 27])\n", "\n", "mle_exhaustive(n)" ] }, { "cell_type": "markdown", "id": "57e02982-9589-4721-a4bd-bb0230aa448c", "metadata": {}, "source": [ "We get $N^* = 99$ and $\\theta^* = 0.22$.\n", "\n", "But now, let's say that the final measurement has 28 positive results instead of 27. That's a change of a single positive result in just one of the five experiments. Now let's compute the MLE for this new data set." ] }, { "cell_type": "code", "execution_count": 4, "id": "f5e2e281-7276-445b-af01-548a159af405", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(191, 0.11413612565445026)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Observed data\n", "n = np.array([16, 18, 22, 25, 28])\n", "\n", "mle_exhaustive(n)" ] }, { "cell_type": "markdown", "id": "42bb0a25-1009-49c4-872f-b29ba83cf1d7", "metadata": {}, "source": [ "Oof! $N^*$ almost doubled to 191 and $\\theta^*$ halved to 0.11. With just a tiny change in the data, we can get wildly different estimates for the parameters. This is an example of the fragility of the MLE.\n", "\n", "To get a handle on how our MLE would change with newly acquired data sets, we will do parametric bootstrap to compute confidence intervals for $N^*$ and $\\theta^*$ using the original data set with 27 successes for the fifth trial. Computing the bootstrapped confidence interval becomes difficult because exhaustive enumeration is slow and $N$ can be very large. Another approach is to consider $N$ as a continuous variable such that $N!= \\Gamma(N+1)$, and instead search for a maximizer of\n", "\n", "\\begin{align}\n", "h(N) = \\ln \\Gamma(N+1) - \\bar{n}\\,\\ln N + (N-\\bar{n}) \\ln\\left(1-\\frac{\\bar{n}}{N}\\right) - \\frac{1}{M}\\sum_{i=1}^M \\ln \\Gamma(N-n_i+1).\n", "\\end{align}\n", "\n", "Noting that the derivative of the log of a Gamma function is a polygamma function,\n", "\n", "\\begin{align}\n", "\\frac{\\mathrm{d}}{\\mathrm{d}x}\\,\\ln \\Gamma(x) = \\psi_0(x),\n", "\\end{align}\n", "\n", "we can write\n", "\n", "\\begin{align}\n", "\\frac{\\mathrm{d}h}{\\mathrm{d}N} = \\ln\\left(1 - \\frac{\\bar{n}}{N}\\right) + \\psi_0(N+1) - \\frac{1}{M}\\sum_{i=1}^M\\psi_0(N-n_i+1) = 0.\n", "\\end{align}\n", "\n", "Our strategy to find $N^*$ is to solve the above root finding problem for continuous $N$, and then find which integer $N$ on either side of the root gives the maximal log likelihood. Note that in the limit of $N\\to \\infty$, $\\mathrm{d}h/\\mathrm{d}N = 0$. A finite root may not exist, meaning that the MLE is $N\\to\\infty$ and $\\theta = 0$. In this limit, the number of positive results are Poisson distributed (the $N\\to\\infty$ and $\\theta \\to 0$ limit where $N\\theta$ remains finite). In this case, the MLE for $N\\theta$ is $\\bar{n}$. We will not include those results in our bootstrap samples." ] }, { "cell_type": "code", "execution_count": 5, "id": "9c3a5dac-a2be-44d6-9223-9ec7d9a8ed9f", "metadata": {}, "outputs": [], "source": [ "def dh_dN(N, n):\n", " \"\"\"Return dh/dN for root finding problem.\"\"\"\n", " sum_term = np.mean(scipy.special.polygamma(0, N - n + 1))\n", " return np.log(1 - np.mean(n) / N) + scipy.special.polygamma(0, N + 1) - sum_term\n", "\n", "\n", "def mle(n, N_max=1_000_000):\n", " \"\"\"Compute the MLE for OPZ problem.\"\"\"\n", " try:\n", " root = scipy.optimize.brentq(dh_dN, np.max(n)+1, N_max, args=(n,))\n", " return mle_exhaustive(n, N_min=int(root), N_max=int(root)+1)\n", " except:\n", " return (np.nan, np.nan)" ] }, { "cell_type": "markdown", "id": "92da034e-fd57-4bbf-966d-08d558d9cdcc", "metadata": {}, "source": [ "Let's now compute the bootstrap replicates." ] }, { "cell_type": "code", "execution_count": 6, "id": "791623aa-badf-449d-9836-765d485f48e2", "metadata": {}, "outputs": [], "source": [ "# Observed data\n", "n = np.array([16, 18, 22, 25, 27])\n", "\n", "# Get MLE\n", "N, theta = mle(n)\n", "\n", "# Parametric bootstrap\n", "rng = np.random.default_rng()\n", "bs_reps = np.array([mle(rng.binomial(N, theta, size=len(n))) for _ in range(100_000)])" ] }, { "cell_type": "markdown", "id": "b5d7483b-88f8-46f4-87c9-49f9e4c4b03b", "metadata": {}, "source": [ "Let's first check what fraction of the bootstrap replicates had $N\\to\\infty$." ] }, { "cell_type": "code", "execution_count": 7, "id": "7c250f78-20a7-44e4-8e27-e9c167778a07", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.25038" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(np.isnan(bs_reps[:,0])) / len(bs_reps)" ] }, { "cell_type": "markdown", "id": "a8593cb9-3695-4d19-8cea-e53b84117a41", "metadata": {}, "source": [ "Wow. Nearly 25% of the bootstrap replicates had $N\\to\\infty$, such that we can't even compute an MLE for the Binomial model. Now, computing a confidence interval with the samples where an MLE exists, first for $N$." ] }, { "cell_type": "code", "execution_count": 8, "id": "cbcfc638-c777-438c-b5b8-b5cd324b6ed3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 24., 487.])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(bs_reps[~np.isnan(bs_reps[:, 0]), 0], [2.5, 97.5])" ] }, { "cell_type": "markdown", "id": "5cbe0d2e-e1ab-4528-922d-441bcf2d772c", "metadata": {}, "source": [ "And now for theta." ] }, { "cell_type": "code", "execution_count": 9, "id": "bfa13165-fe2b-4283-afb0-09f51baa367c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.04417178, 0.85 ])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(bs_reps[~np.isnan(bs_reps[:, 1]), 1], [2.5, 97.5])" ] }, { "cell_type": "markdown", "id": "70ff1dda-bb2c-42d9-bb98-d8f6f23883ed", "metadata": {}, "source": [ "These results verify that from observing first positive results from five sets of experiments, we cannot really know anything about the parameters of the model." ] }, { "cell_type": "markdown", "id": "4be9194e-bf90-474b-b9bd-fdb5a803d0a2", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 10, "id": "d2afda26-e264-4dcb-a647-a0e359fc4ecb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.11.4\n", "IPython version : 8.12.0\n", "\n", "numpy : 1.24.3\n", "scipy : 1.10.1\n", "jupyterlab: 4.0.4\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p numpy,scipy,jupyterlab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }