{ "cells": [ { "cell_type": "markdown", "id": "dc1abc34", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "id": "8bbc1018", "metadata": {}, "source": [ "# Multivariate Hypergeometric Distribution" ] }, { "cell_type": "markdown", "id": "4ab201f1", "metadata": {}, "source": [ "## Contents\n", "\n", "- [Multivariate Hypergeometric Distribution](#Multivariate-Hypergeometric-Distribution) \n", " - [Overview](#Overview) \n", " - [The Administrator’s Problem](#The-Administrator’s-Problem) \n", " - [Usage](#Usage) " ] }, { "cell_type": "markdown", "id": "faac732d", "metadata": {}, "source": [ "## Overview\n", "\n", "This lecture describes how an administrator deployed a **multivariate hypergeometric distribution** in order to access the fairness of a procedure for awarding research grants.\n", "\n", "In the lecture we’ll learn about\n", "\n", "- properties of the multivariate hypergeometric distribution \n", "- first and second moments of a multivariate hypergeometric distribution \n", "- using a Monte Carlo simulation of a multivariate normal distribution to evaluate the quality of a normal approximation \n", "- the administrator’s problem and why the multivariate hypergeometric distribution is the right tool " ] }, { "cell_type": "markdown", "id": "3a38a92d", "metadata": {}, "source": [ "## The Administrator’s Problem\n", "\n", "An administrator in charge of allocating research grants is in the following situation.\n", "\n", "To help us forget details that are none of our business here and to protect the anonymity of the administrator and the subjects, we call\n", "research proposals **balls** and continents of residence of authors of a proposal a **color**.\n", "\n", "There are $ K_i $ balls (proposals) of color $ i $.\n", "\n", "There are $ c $ distinct colors (continents of residence).\n", "\n", "Thus, $ i = 1, 2, \\ldots, c $\n", "\n", "So there is a total of $ N = \\sum_{i=1}^c K_i $ balls.\n", "\n", "All $ N $ of these balls are placed in an urn.\n", "\n", "Then $ n $ balls are drawn randomly.\n", "\n", "The selection procedure is supposed to be **color blind** meaning that **ball quality**, a random variable that is supposed to be independent of **ball color**, governs whether a ball is drawn.\n", "\n", "Thus, the selection procedure is supposed randomly to draw $ n $ balls from the urn.\n", "\n", "The $ n $ balls drawn represent successful proposals and are awarded research funds.\n", "\n", "The remaining $ N-n $ balls receive no research funds." ] }, { "cell_type": "markdown", "id": "76798dfe", "metadata": {}, "source": [ "### Details of the Awards Procedure Under Study\n", "\n", "Let $ k_i $ be the number of balls of color $ i $ that are drawn.\n", "\n", "Things have to add up so $ \\sum_{i=1}^c k_i = n $.\n", "\n", "Under the hypothesis that the selection process judges proposals on their quality and that quality is independent of continent of the author’s continent of residence, the administrator views the outcome of the selection procedure as a random vector\n", "\n", "$$\n", "X = \\begin{pmatrix} k_1 \\cr k_2 \\cr \\vdots \\cr k_c \\end{pmatrix}.\n", "$$\n", "\n", "To evaluate whether the selection procedure is **color blind** the administrator wants to study whether the particular realization of $ X $ drawn can plausibly\n", "be said to be a random draw from the probability distribution that is implied by the **color blind** hypothesis.\n", "\n", "The appropriate probability distribution is the one described [here](https://en.wikipedia.org/wiki/Hypergeometric_distribution).\n", "\n", "Let’s now instantiate the administrator’s problem, while continuing to use the colored balls metaphor.\n", "\n", "The administrator has an urn with $ N = 238 $ balls.\n", "\n", "157 balls are blue, 11 balls are green, 46 balls are yellow, and 24 balls are black.\n", "\n", "So $ (K_1, K_2, K_3, K_4) = (157 , 11 , 46 , 24) $ and $ c = 4 $.\n", "\n", "15 balls are drawn without replacement.\n", "\n", "So $ n = 15 $.\n", "\n", "The administrator wants to know the probability distribution of outcomes\n", "\n", "$$\n", "X = \\begin{pmatrix} k_1 \\cr k_2 \\cr \\vdots \\cr k_4 \\end{pmatrix}.\n", "$$\n", "\n", "In particular, he wants to know whether a particular\n", "outcome - in the form of a $ 4 \\times 1 $ vector of integers recording the\n", "numbers of blue, green, yellow, and black balls, respectively, - contains\n", "evidence against the hypothesis that the selection process is *fair*, which\n", "here means *color blind* and truly are random draws without replacement from\n", "the population of $ N $ balls.\n", "\n", "The right tool for the administrator’s job is the **multivariate hypergeometric distribution**." ] }, { "cell_type": "markdown", "id": "0585b1d9", "metadata": {}, "source": [ "### Multivariate Hypergeometric Distribution\n", "\n", "Let’s start with some imports." ] }, { "cell_type": "code", "execution_count": null, "id": "c91b0831", "metadata": { "hide-output": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams[\"figure.figsize\"] = (11, 5) #set default figure size\n", "import numpy as np\n", "from scipy.special import comb\n", "from scipy.stats import normaltest\n", "from numba import njit, prange" ] }, { "cell_type": "markdown", "id": "9068fb00", "metadata": {}, "source": [ "To recapitulate, we assume there are in total $ c $ types of objects in an urn.\n", "\n", "If there are $ K_{i} $ type $ i $ object in the urn and we take\n", "$ n $ draws at random without replacement, then the numbers of type\n", "$ i $ objects in the sample $ (k_{1},k_{2},\\dots,k_{c}) $\n", "has the multivariate hypergeometric distribution.\n", "\n", "Note again that $ N=\\sum_{i=1}^{c} K_{i} $ is\n", "the total number of objects in the urn and $ n=\\sum_{i=1}^{c}k_{i} $.\n", "\n", "**Notation**\n", "\n", "We use the following notation for **binomial coefficients**: $ {m \\choose q} = \\frac{m!}{(m-q)!} $.\n", "\n", "The multivariate hypergeometric distribution has the following properties:\n", "\n", "**Probability mass function**:\n", "\n", "$$\n", "\\Pr \\{X_{i}=k_{i} \\ \\forall i\\} =\n", " \\frac {\\prod _{i=1}^{c}{\\binom {K_{i}}{k_{i}}}}{\\binom {N}{n}}\n", "$$\n", "\n", "**Mean**:\n", "\n", "$$\n", "{\\displaystyle \\operatorname {E} (X_{i})=n{\\frac {K_{i}}{N}}}\n", "$$\n", "\n", "**Variances and covariances**:\n", "\n", "$$\n", "{\\displaystyle \\operatorname {Var} (X_{i})=n{\\frac {N-n}{N-1}}\\;{\\frac {K_{i}}{N}}\\left(1-{\\frac {K_{i}}{N}}\\right)}\n", "$$\n", "\n", "$$\n", "{\\displaystyle \\operatorname {Cov} (X_{i},X_{j})=-n{\\frac {N-n}{N-1}}\\;{\\frac {K_{i}}{N}}{\\frac {K_{j}}{N}}}\n", "$$\n", "\n", "To do our work for us, we’ll write an `Urn` class." ] }, { "cell_type": "code", "execution_count": null, "id": "b3e32c76", "metadata": { "hide-output": false }, "outputs": [], "source": [ "class Urn:\n", "\n", " def __init__(self, K_arr):\n", " \"\"\"\n", " Initialization given the number of each type i object in the urn.\n", "\n", " Parameters\n", " ----------\n", " K_arr: ndarray(int)\n", " number of each type i object.\n", " \"\"\"\n", "\n", " self.K_arr = np.array(K_arr)\n", " self.N = np.sum(K_arr)\n", " self.c = len(K_arr)\n", "\n", " def pmf(self, k_arr):\n", " \"\"\"\n", " Probability mass function.\n", "\n", " Parameters\n", " ----------\n", " k_arr: ndarray(int)\n", " number of observed successes of each object.\n", " \"\"\"\n", "\n", " K_arr, N = self.K_arr, self.N\n", "\n", " k_arr = np.atleast_2d(k_arr)\n", " n = np.sum(k_arr, 1)\n", "\n", " num = np.prod(comb(K_arr, k_arr), 1)\n", " denom = comb(N, n)\n", "\n", " pr = num / denom\n", "\n", " return pr\n", "\n", " def moments(self, n):\n", " \"\"\"\n", " Compute the mean and variance-covariance matrix for\n", " multivariate hypergeometric distribution.\n", "\n", " Parameters\n", " ----------\n", " n: int\n", " number of draws.\n", " \"\"\"\n", "\n", " K_arr, N, c = self.K_arr, self.N, self.c\n", "\n", " # mean\n", " μ = n * K_arr / N\n", "\n", " # variance-covariance matrix\n", " Σ = np.full((c, c), n * (N - n) / (N - 1) / N ** 2)\n", " for i in range(c-1):\n", " Σ[i, i] *= K_arr[i] * (N - K_arr[i])\n", " for j in range(i+1, c):\n", " Σ[i, j] *= - K_arr[i] * K_arr[j]\n", " Σ[j, i] = Σ[i, j]\n", "\n", " Σ[-1, -1] *= K_arr[-1] * (N - K_arr[-1])\n", "\n", " return μ, Σ\n", "\n", " def simulate(self, n, size=1, seed=None):\n", " \"\"\"\n", " Simulate a sample from multivariate hypergeometric\n", " distribution where at each draw we take n objects\n", " from the urn without replacement.\n", "\n", " Parameters\n", " ----------\n", " n: int\n", " number of objects for each draw.\n", " size: int(optional)\n", " sample size.\n", " seed: int(optional)\n", " random seed.\n", " \"\"\"\n", "\n", " K_arr = self.K_arr\n", "\n", " gen = np.random.Generator(np.random.PCG64(seed))\n", " sample = gen.multivariate_hypergeometric(K_arr, n, size=size)\n", "\n", " return sample" ] }, { "cell_type": "markdown", "id": "86dafa45", "metadata": {}, "source": [ "## Usage" ] }, { "cell_type": "markdown", "id": "959ca7ac", "metadata": {}, "source": [ "### First example\n", "\n", "Apply this to an example from\n", "[wiki](https://en.wikipedia.org/wiki/Hypergeometric_distribution#Multivariate_hypergeometric_distribution):\n", "\n", "Suppose there are 5 black, 10 white, and 15 red marbles in an urn. If\n", "six marbles are chosen without replacement, the probability that exactly\n", "two of each color are chosen is\n", "\n", "$$\n", "P(2{\\text{ black}},2{\\text{ white}},2{\\text{ red}})={{{5 \\choose 2}{10 \\choose 2}{15 \\choose 2}} \\over {30 \\choose 6}}=0.079575596816976\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "id": "ae23d37f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# construct the urn\n", "K_arr = [5, 10, 15]\n", "urn = Urn(K_arr)" ] }, { "cell_type": "markdown", "id": "0d660b6a", "metadata": {}, "source": [ "Now use the Urn Class method `pmf` to compute the probability of the outcome $ X = \\begin{pmatrix} 2 & 2 & 2 \\end{pmatrix} $" ] }, { "cell_type": "code", "execution_count": null, "id": "c6a9290e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "k_arr = [2, 2, 2] # array of number of observed successes\n", "urn.pmf(k_arr)" ] }, { "cell_type": "markdown", "id": "51187174", "metadata": {}, "source": [ "We can use the code to compute probabilities of a list of possible outcomes by\n", "constructing a 2-dimensional\n", "array `k_arr` and `pmf` will return an array of probabilities for\n", "observing each case." ] }, { "cell_type": "code", "execution_count": null, "id": "e0dbae09", "metadata": { "hide-output": false }, "outputs": [], "source": [ "k_arr = [[2, 2, 2], [1, 3, 2]]\n", "urn.pmf(k_arr)" ] }, { "cell_type": "markdown", "id": "14ae8a25", "metadata": {}, "source": [ "Now let’s compute the mean vector and variance-covariance matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "c30488bf", "metadata": { "hide-output": false }, "outputs": [], "source": [ "n = 6\n", "μ, Σ = urn.moments(n)" ] }, { "cell_type": "code", "execution_count": null, "id": "c88d442b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ" ] }, { "cell_type": "code", "execution_count": null, "id": "86a0c335", "metadata": { "hide-output": false }, "outputs": [], "source": [ "Σ" ] }, { "cell_type": "markdown", "id": "807269df", "metadata": {}, "source": [ "### Back to The Administrator’s Problem\n", "\n", "Now let’s turn to the grant administrator’s problem.\n", "\n", "Here the array of\n", "numbers of $ i $ objects in the urn is\n", "$ \\left(157, 11, 46, 24\\right) $." ] }, { "cell_type": "code", "execution_count": null, "id": "51735f9f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "K_arr = [157, 11, 46, 24]\n", "urn = Urn(K_arr)" ] }, { "cell_type": "markdown", "id": "5162154d", "metadata": {}, "source": [ "Let’s compute the probability of the outcome $ \\left(10, 1, 4, 0 \\right) $." ] }, { "cell_type": "code", "execution_count": null, "id": "2bbdfe0d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "k_arr = [10, 1, 4, 0]\n", "urn.pmf(k_arr)" ] }, { "cell_type": "markdown", "id": "ebb0b667", "metadata": {}, "source": [ "We can compute probabilities of three possible outcomes by constructing a 3-dimensional\n", "arrays `k_arr` and utilizing the method `pmf` of the `Urn` class." ] }, { "cell_type": "code", "execution_count": null, "id": "a951d05e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "k_arr = [[5, 5, 4 ,1], [10, 1, 2, 2], [13, 0, 2, 0]]\n", "urn.pmf(k_arr)" ] }, { "cell_type": "markdown", "id": "238591af", "metadata": {}, "source": [ "Now let’s compute the mean and variance-covariance matrix of $ X $ when $ n=6 $." ] }, { "cell_type": "code", "execution_count": null, "id": "f3157968", "metadata": { "hide-output": false }, "outputs": [], "source": [ "n = 6 # number of draws\n", "μ, Σ = urn.moments(n)" ] }, { "cell_type": "code", "execution_count": null, "id": "0e5f83e3", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# mean\n", "μ" ] }, { "cell_type": "code", "execution_count": null, "id": "96124cd3", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# variance-covariance matrix\n", "Σ" ] }, { "cell_type": "markdown", "id": "f4c56f10", "metadata": {}, "source": [ "We can simulate a large sample and verify that sample means and covariances closely approximate the population means and covariances." ] }, { "cell_type": "code", "execution_count": null, "id": "8c6146b5", "metadata": { "hide-output": false }, "outputs": [], "source": [ "size = 10_000_000\n", "sample = urn.simulate(n, size=size)" ] }, { "cell_type": "code", "execution_count": null, "id": "4820aca3", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# mean\n", "np.mean(sample, 0)" ] }, { "cell_type": "code", "execution_count": null, "id": "53e1213e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# variance covariance matrix\n", "np.cov(sample.T)" ] }, { "cell_type": "markdown", "id": "a82ec2dd", "metadata": {}, "source": [ "Evidently, the sample means and covariances approximate their population counterparts well." ] }, { "cell_type": "markdown", "id": "28d7aa3b", "metadata": {}, "source": [ "### Quality of Normal Approximation\n", "\n", "To judge the quality of a multivariate normal approximation to the multivariate hypergeometric distribution, we draw a large sample from a multivariate normal distribution with the mean vector and covariance matrix for the corresponding multivariate hypergeometric distribution and compare the simulated distribution with the population multivariate hypergeometric distribution." ] }, { "cell_type": "code", "execution_count": null, "id": "22800490", "metadata": { "hide-output": false }, "outputs": [], "source": [ "sample_normal = np.random.multivariate_normal(μ, Σ, size=size)" ] }, { "cell_type": "code", "execution_count": null, "id": "2b1bec3d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "def bivariate_normal(x, y, μ, Σ, i, j):\n", "\n", " μ_x, μ_y = μ[i], μ[j]\n", " σ_x, σ_y = np.sqrt(Σ[i, i]), np.sqrt(Σ[j, j])\n", " σ_xy = Σ[i, j]\n", "\n", " x_μ = x - μ_x\n", " y_μ = y - μ_y\n", "\n", " ρ = σ_xy / (σ_x * σ_y)\n", " z = x_μ**2 / σ_x**2 + y_μ**2 / σ_y**2 - 2 * ρ * x_μ * y_μ / (σ_x * σ_y)\n", " denom = 2 * np.pi * σ_x * σ_y * np.sqrt(1 - ρ**2)\n", "\n", " return np.exp(-z / (2 * (1 - ρ**2))) / denom" ] }, { "cell_type": "code", "execution_count": null, "id": "3543a0c4", "metadata": { "hide-output": false }, "outputs": [], "source": [ "@njit\n", "def count(vec1, vec2, n):\n", " size = sample.shape[0]\n", "\n", " count_mat = np.zeros((n+1, n+1))\n", " for i in prange(size):\n", " count_mat[vec1[i], vec2[i]] += 1\n", "\n", " return count_mat" ] }, { "cell_type": "code", "execution_count": null, "id": "dbe84819", "metadata": { "hide-output": false }, "outputs": [], "source": [ "c = urn.c\n", "fig, axs = plt.subplots(c, c, figsize=(14, 14))\n", "\n", "# grids for ploting the bivariate Gaussian\n", "x_grid = np.linspace(-2, n+1, 100)\n", "y_grid = np.linspace(-2, n+1, 100)\n", "X, Y = np.meshgrid(x_grid, y_grid)\n", "\n", "for i in range(c):\n", " axs[i, i].hist(sample[:, i], bins=np.arange(0, n, 1), alpha=0.5, density=True, label='hypergeom')\n", " axs[i, i].hist(sample_normal[:, i], bins=np.arange(0, n, 1), alpha=0.5, density=True, label='normal')\n", " axs[i, i].legend()\n", " axs[i, i].set_title('$k_{' +str(i+1) +'}$')\n", " for j in range(c):\n", " if i == j:\n", " continue\n", "\n", " # bivariate Gaussian density function\n", " Z = bivariate_normal(X, Y, μ, Σ, i, j)\n", " cs = axs[i, j].contour(X, Y, Z, 4, colors=\"black\", alpha=0.6)\n", " axs[i, j].clabel(cs, inline=1, fontsize=10)\n", "\n", " # empirical multivariate hypergeometric distrbution\n", " count_mat = count(sample[:, i], sample[:, j], n)\n", " axs[i, j].pcolor(count_mat.T/size, cmap='Blues')\n", " axs[i, j].set_title('$(k_{' +str(i+1) +'}, k_{' + str(j+1) + '})$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e246f5b3", "metadata": {}, "source": [ "The diagonal graphs plot the marginal distributions of $ k_i $ for\n", "each $ i $ using histograms.\n", "\n", "Note the substantial differences between hypergeometric distribution and the approximating normal distribution.\n", "\n", "The off-diagonal graphs plot the empirical joint distribution of\n", "$ k_i $ and $ k_j $ for each pair $ (i, j) $.\n", "\n", "The darker the blue, the more data points are contained in the corresponding cell. (Note that $ k_i $ is on the x-axis and $ k_j $ is on the y-axis).\n", "\n", "The contour maps plot the bivariate Gaussian density function of $ \\left(k_i, k_j\\right) $ with the population mean and covariance given by slices of $ \\mu $ and $ \\Sigma $ that we computed above.\n", "\n", "Let’s also test the normality for each $ k_i $ using `scipy.stats.normaltest` that implements D’Agostino and Pearson’s\n", "test that combines skew and kurtosis to form an omnibus test of normality.\n", "\n", "The null hypothesis is that the sample follows normal distribution.\n", "\n", "> `normaltest` returns an array of p-values associated with tests for each $ k_i $ sample." ] }, { "cell_type": "code", "execution_count": null, "id": "7b63327a", "metadata": { "hide-output": false }, "outputs": [], "source": [ "test_multihyper = normaltest(sample)\n", "test_multihyper.pvalue" ] }, { "cell_type": "markdown", "id": "d6fc860a", "metadata": {}, "source": [ "As we can see, all the p-values are almost $ 0 $ and the null hypothesis is soundly rejected.\n", "\n", "By contrast, the sample from normal distribution does not reject the null hypothesis." ] }, { "cell_type": "code", "execution_count": null, "id": "b55f9f34", "metadata": { "hide-output": false }, "outputs": [], "source": [ "test_normal = normaltest(sample_normal)\n", "test_normal.pvalue" ] }, { "cell_type": "markdown", "id": "023d8de7", "metadata": {}, "source": [ "The lesson to take away from this is that the normal approximation is imperfect." ] } ], "metadata": { "date": 1714442507.9113364, "filename": "multi_hyper.md", "kernelspec": { "display_name": "Python", "language": "python3", "name": "python3" }, "title": "Multivariate Hypergeometric Distribution" }, "nbformat": 4, "nbformat_minor": 5 }