{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 3: Random Variables and their Distributions\n", " \n", "This Jupyter notebook is the Python equivalent of the R code in section 3.11 R, pp. 126 - 128, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n", "\n", "----" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributions in SciPy\n", "\n", "All of the named distributions that we'll encounter in this book have been implemented in R, but in this Python-based notebook we will use their equivalents in [SciPy](https://docs.scipy.org/doc/scipy/reference/index.html). In this section we'll explain how to work with the Binomial and Hypergeometric distributions in [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html). We will also explain in general how to generate r.v.s from any discrete distribution with a finite support. The aforementioned Statistical functions page is a handy list of the distributions in `scipy.stats`. Typing `scipy.stats.[distribution].__doc__` will display more information on the named distribution.\n", "\n", "In general, for many named discrete distributions, three functions `pmf`, `cdf`, and `rvs` will give the PMF, CDF, and random generation, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binomial distribution\n", "\n", "The Binomial distribution [`binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) provides the following three functions: `pmf`, `cdf`, and `rvs`. Unlike R, SciPy provides an implementation of the Bernoulli distribution in [`scipy.stats.bernoulli`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete_bernoulli.html). But as an alterative for `bernoulli`, we can just use the Binomial functions with $k \\in \\{0,1\\}$ and $n = 1$." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import binom\n", "\n", "# to learn more about scipy.stats.binom, un-comment ouf the following line\n", "#print(binom.__doc__)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* `binom.pmf` is `scipy.stats`' implementation of the Binomial PMF. It takes three inputs: the first is the value of `k` at which to evaluate the PMF, and the second and third are the parameters `n` and `p`. For example, `binom.pmf(3, 5, 0.2)` returns the probability $P(X = 3)$ where $X \\sim Bin(5, 0.2)$. In other words,\n", " \n", " \n", "\\begin{align}\n", " binom.pmf(3, 5, 0.2) &= \\binom{5}{3} (0.2)^{3} (0.8)^{2} \\\\\n", " &= 0.0512 \n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.051200000000000016" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k = 3\n", "n = 5\n", "p = 0.2\n", "\n", "binom.pmf(k, n, p)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* `binom.cdf` is the Binomial CDF. It takes three inputs: the first is the value of `k` at which to evaluate the CDF, and the second and third are the parameters. $binom.cdf(3, 5, 0.2)$ is the probability $P(X \\leq 3)$ where $X \\sim Bin(5, 0.2)$. So\n", "\n", "\\begin{align}\n", " binom.cdf(3, 5, 0.2) &= \\sum_{k=0}^{3} \\binom{5}{k} (0.2)^k (0.8)^{5-k} \\\\\n", " &= 0.9933\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.99328" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k = 3\n", "n = 5\n", "p = 0.2\n", "\n", "binom.cdf(k, n, p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `binom.rvs` is a function for generating Binomial random variables. For `rvs`, the first and second inputs are still the parameters `n` and `p`, and the `size` parameter is how many r.v.s we want to generate. Thus the command `binom.rvs(5, 0.2, size=7)` produces realizations of seven i.i.d. $Bin(5, 0.2)$ r.v.s. When we ran this command, we got" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 1, 0, 1, 1, 1])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# seed the random number generator\n", "np.random.seed(144)\n", "\n", "n = 5\n", "p = 0.2\n", "\n", "binom.rvs(n, p, size=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unless you change the `numpy.random.seed` parameter value in code cell [1] above, you'll get the same values above the first time this notebook is run.\n", "\n", "We can also evaluate PMFs and CDFs at an entire vector of values. For example, recall that `numpy.arange(0,n+1)` is a quick way to list the integers from $0$ to $n$. The command `binom.pmf(numpy.arange(0, 5+1), 5, 0.2)` returns 6 numbers, $P(X = 0)\\text{, } P(X = 1)\\text{, } \\cdots \\text{, } P(X = 5)$, where $X \\sim Bin(5, 0.2)$." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3.2768e-01, 4.0960e-01, 2.0480e-01, 5.1200e-02, 6.4000e-03,\n", " 3.2000e-04])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = 5\n", "p = 0.2\n", "\n", "binom.pmf(np.arange(0, n+1), n, p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hypergeometric distribution \n", "\n", "The Hypergeometric distribution is implemented in [`scipy.stats.hypergeom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html#scipy.stats.hypergeom), which similarly provides the following three functions: `pmf`, `cdf`, and `rvs`. As one might expect, `pmf` is the Hypergeometric PMF, `cdf` is the Hypergeometric CDF, and `rvs` generates Hypergeometric r.v.s. Since the Hypergeometric distribution has three parameters, each of these functions takes at least three inputs. For `pmf` and `cdf`, the first input is the value at which we wish to evaluate the PMF or CDF, and the remaining three inputs are the parameters of the distribution." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import hypergeom\n", "\n", "# to learn more about scipy.stats.hypergeom, un-comment ouf the following line\n", "#print(hypergeom.__doc__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The PMF is defined as\n", "\n", "\\begin{align}\n", " p(k, M, n, N) &= \\frac{\\binom{n}{k} \\, \\binom{M-n}{N-k}}{\\binom{M}{N}}\n", "\\end{align}\n", "\n", "where\n", "* $M$ = the total number of objects (black and white balls, for instance)\n", "* $n$ = the total number of objects of Type I (say we're interested in the white balls)\n", "* $N$ = the total number of objects drawn without replacement from $M$\n", "\n", "\n", "Consider the case where we have a total of $M = 17$ balls (black and white), with $n = 10$ among them being white (and 7 being black), and we draw $N = 8$ of the balls without replacement." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "n = 10 # white balls <- these are the ones we are interested in\n", "b = 7 # black balls\n", "M = n + b # total number of balls\n", "N = 8 # number of balls to draw w/out replacement" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `hypergeom.pmf(k, 17, 10, 8)` returns the probability $P(X = k)$ where $X \\sim HGeom(10, 7, 8)$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0004113533525298236" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# k = 1\n", "hypergeom.pmf(1, M, n, N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `hypergeom.cdf(k, 17, 10, 8)` returns $P(X \\leq k)$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.013368983957219274" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# k = 2\n", "hypergeom.cdf(2, M, n, N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `hypergeom.rvs(17, 10, 8, size=s)` uses the `size` parameter to specify the number of r.v.s. we want to generate. `hypergeom.rvs(17, 10, 8, size=100)` generates 100 i.i.d. $HGeom(10, 7, 8)$ r.v.s." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([6, 3, 5, 5, 5, 4, 3, 4, 7, 4, 6, 4, 6, 6, 4, 4, 5, 4, 6, 5, 5, 4,\n", " 6, 5, 5, 6, 5, 4, 4, 3, 5, 4, 5, 5, 6, 3, 5, 6, 6, 4, 5, 4, 5, 5,\n", " 4, 3, 5, 4, 5, 5, 3, 7, 5, 4, 5, 6, 4, 3, 5, 5, 3, 4, 4, 4, 5, 4,\n", " 4, 5, 6, 4, 4, 4, 5, 5, 5, 4, 5, 4, 4, 4, 7, 4, 5, 5, 5, 4, 5, 5,\n", " 3, 4, 4, 5, 4, 4, 5, 4, 6, 5, 6, 4])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(233)\n", "\n", "hypergeom.rvs(M, n, N, size=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discrete distributions with finite support\n", "\n", "We can generate r.v.s from any discrete distribution with finite support using the [`numpy.random.choice`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice) function. When we first introduced the `numpy.random.choice` function, we said that it can be used in the form `numpy.random.choice(numpy.arange(1,n+1), k, replace=False)` or `numpy.random.choice(np.arange(1,n+1), k)` to sample $k$ times from the integers 1 through $n$, either without or with replacement. For example, to generate 5 independent $DUnif(1, 2, \\ldots, 100$) r.v.s, we can use the function invocation `numpy.random.choice(numpy.arange(1,100+1), 5)`.\n", "\n", "It turns out that `numpy.random.choice` is far more versatile. If we want to sample from the values $x_1, \\ldots, x_n$ with probabilities $p_1, \\ldots, p_n$, we simply create an array `x` containing all the $x_i$ and an array `p` containing all the $p_i$, then feed them into `choice`. For example, suppose we are interested in generating realizations of i.i.d. r.v.s $X_1, \\ldots, X_{100}$ whose PMF is\n", "\n", "\\begin{align}\n", " P(X_j = 0) &= 0.25 \\\\\n", " P(X_j = 1) &= 0.5 \\\\\n", " P(X_j = 5) &= 0.1 \\\\\n", " P(X_j = 10) &= 0.15\n", "\\end{align}\n", "\n", "and $P(X_j = x) = 0$ for all other values of $x$. First, we use the `numpy.array` function to create arrays with the support of the distribution and the PMF probabilities." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "x = np.array([0, 1, 5, 10])\n", "p = np.array([0.25, 0.5, 0.1, 0.15])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we use the `choice` function. Here's how to get 100 samples from the above PMF:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 1, 5, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 5, 1,\n", " 0, 5, 1, 1, 1, 0, 1, 1, 0, 1, 0, 5, 1, 1, 1, 10, 0,\n", " 5, 0, 10, 1, 1, 1, 0, 1, 1, 5, 1, 5, 10, 10, 0, 1, 1,\n", " 1, 1, 10, 0, 1, 10, 0, 0, 1, 1, 5, 1, 10, 1, 0, 1, 1,\n", " 5, 1, 1, 10, 1, 1, 0, 0, 0, 1, 1, 10, 10, 0, 5, 1, 1,\n", " 5, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(377)\n", "\n", "np.random.choice(x, 100, p=p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inputs are the array `x` to sample from, the number of samples to generate (100 in this case), the probabilities `p` of sampling the values in `x` (if this is omitted, the probabilities are assumed equal), and whether to sample with replacement (the default is `replace=True`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }