{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 3: Random Variables and their Distributions\n",
    " \n",
    "This Jupyter notebook is the Python equivalent of the R code in section 3.11 R, pp. 126 - 128, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distributions in SciPy\n",
    "\n",
    "All of the named distributions that we'll encounter in this book have been implemented in R, but in this Python-based notebook we will use their equivalents in [SciPy](https://docs.scipy.org/doc/scipy/reference/index.html). In this section we'll explain how to work with the Binomial and Hypergeometric distributions in [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html). We will also explain in general how to generate r.v.s from any discrete distribution with a finite support. The aforementioned Statistical functions page is a handy list of the distributions in `scipy.stats`. Typing `scipy.stats.[distribution].__doc__` will display more information on the named distribution.\n",
    "\n",
    "In general, for many named discrete distributions, three functions `pmf`, `cdf`, and `rvs` will give the PMF, CDF, and random generation, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binomial distribution\n",
    "\n",
    "The Binomial distribution [`binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) provides the following three functions: `pmf`, `cdf`, and `rvs`. Unlike R, SciPy provides an implementation of the Bernoulli distribution in [`scipy.stats.bernoulli`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete_bernoulli.html). But as an alterative for `bernoulli`, we can just use the Binomial functions with $k \\in \\{0,1\\}$ and $n = 1$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import binom\n",
    "\n",
    "# to learn more about scipy.stats.binom, un-comment ouf the following line\n",
    "#print(binom.__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "* `binom.pmf` is `scipy.stats`' implementation of the Binomial PMF. It takes three inputs: the first is the value of `k` at which to evaluate the PMF, and the second and third are the parameters `n` and `p`. For example, `binom.pmf(3, 5, 0.2)` returns the probability $P(X = 3)$ where $X \\sim Bin(5, 0.2)$. In other words,\n",
    " \n",
    " \n",
    "\\begin{align}\n",
    "  binom.pmf(3, 5, 0.2) &= \\binom{5}{3} (0.2)^{3} (0.8)^{2} \\\\\n",
    "  &= 0.0512 \n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.051200000000000016"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3\n",
    "n = 5\n",
    "p = 0.2\n",
    "\n",
    "binom.pmf(k, n, p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "* `binom.cdf` is the Binomial CDF. It takes three inputs: the first is the value of `k` at which to evaluate the CDF, and the second and third are the parameters. $binom.cdf(3, 5, 0.2)$ is the probability $P(X \\leq 3)$ where $X \\sim Bin(5, 0.2)$. So\n",
    "\n",
    "\\begin{align}\n",
    "  binom.cdf(3, 5, 0.2) &= \\sum_{k=0}^{3} \\binom{5}{k} (0.2)^k (0.8)^{5-k} \\\\\n",
    "  &= 0.9933\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99328"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3\n",
    "n = 5\n",
    "p = 0.2\n",
    "\n",
    "binom.cdf(k, n, p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `binom.rvs` is a function for generating Binomial random variables. For `rvs`, the first and second inputs are still the parameters `n` and `p`, and the `size` parameter is how many r.v.s we want to generate. Thus the command `binom.rvs(5, 0.2, size=7)` produces realizations of seven i.i.d. $Bin(5, 0.2)$ r.v.s. When we ran this command, we got"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 2, 1, 0, 1, 1, 1])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# seed the random number generator\n",
    "np.random.seed(144)\n",
    "\n",
    "n = 5\n",
    "p = 0.2\n",
    "\n",
    "binom.rvs(n, p, size=7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unless you change the `numpy.random.seed` parameter value in code cell [1] above, you'll get the same values above the first time this notebook is run.\n",
    "\n",
    "We can also evaluate PMFs and CDFs at an entire vector of values. For example, recall that `numpy.arange(0,n+1)` is a quick way to list the integers from $0$ to $n$. The command `binom.pmf(numpy.arange(0, 5+1), 5, 0.2)` returns 6 numbers, $P(X = 0)\\text{, } P(X = 1)\\text{, } \\cdots \\text{, } P(X = 5)$, where $X \\sim Bin(5, 0.2)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3.2768e-01, 4.0960e-01, 2.0480e-01, 5.1200e-02, 6.4000e-03,\n",
       "       3.2000e-04])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = 5\n",
    "p = 0.2\n",
    "\n",
    "binom.pmf(np.arange(0, n+1), n, p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hypergeometric distribution \n",
    "\n",
    "The Hypergeometric distribution is implemented in [`scipy.stats.hypergeom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html#scipy.stats.hypergeom), which similarly provides the following three functions: `pmf`, `cdf`, and `rvs`.  As one might expect, `pmf` is the Hypergeometric PMF, `cdf` is the Hypergeometric CDF, and `rvs` generates Hypergeometric r.v.s. Since the Hypergeometric distribution has three parameters, each of these functions takes at least three inputs. For `pmf` and `cdf`, the first input is the value at which we wish to evaluate the PMF or CDF, and the remaining three inputs are the parameters of the distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import hypergeom\n",
    "\n",
    "# to learn more about scipy.stats.hypergeom, un-comment ouf the following line\n",
    "#print(hypergeom.__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The PMF is defined as\n",
    "\n",
    "\\begin{align}\n",
    "  p(k, M, n, N) &= \\frac{\\binom{n}{k} \\, \\binom{M-n}{N-k}}{\\binom{M}{N}}\n",
    "\\end{align}\n",
    "\n",
    "where\n",
    "* $M$ = the total number of objects (black and white balls, for instance)\n",
    "* $n$ = the total number of objects of Type I (say we're interested in the white balls)\n",
    "* $N$ = the total number of objects drawn without replacement from $M$\n",
    "\n",
    "\n",
    "Consider the case where we have a total of $M = 17$ balls (black and white), with $n = 10$ among them being white (and 7 being black), and we draw $N = 8$ of the balls without replacement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 10       # white balls <- these are the ones we are interested in\n",
    "b = 7        # black balls\n",
    "M = n + b    # total number of balls\n",
    "N = 8        # number of balls to draw w/out replacement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `hypergeom.pmf(k, 17, 10, 8)` returns the probability $P(X = k)$ where $X \\sim HGeom(10, 7, 8)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0004113533525298236"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# k = 1\n",
    "hypergeom.pmf(1, M, n, N)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `hypergeom.cdf(k, 17, 10, 8)` returns $P(X \\leq k)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.013368983957219274"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# k = 2\n",
    "hypergeom.cdf(2, M, n, N)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `hypergeom.rvs(17, 10, 8, size=s)` uses the `size` parameter to specify the number of r.v.s. we want to generate. `hypergeom.rvs(17, 10, 8, size=100)` generates 100 i.i.d. $HGeom(10, 7, 8)$ r.v.s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([6, 3, 5, 5, 5, 4, 3, 4, 7, 4, 6, 4, 6, 6, 4, 4, 5, 4, 6, 5, 5, 4,\n",
       "       6, 5, 5, 6, 5, 4, 4, 3, 5, 4, 5, 5, 6, 3, 5, 6, 6, 4, 5, 4, 5, 5,\n",
       "       4, 3, 5, 4, 5, 5, 3, 7, 5, 4, 5, 6, 4, 3, 5, 5, 3, 4, 4, 4, 5, 4,\n",
       "       4, 5, 6, 4, 4, 4, 5, 5, 5, 4, 5, 4, 4, 4, 7, 4, 5, 5, 5, 4, 5, 5,\n",
       "       3, 4, 4, 5, 4, 4, 5, 4, 6, 5, 6, 4])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(233)\n",
    "\n",
    "hypergeom.rvs(M, n, N, size=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discrete distributions with finite support\n",
    "\n",
    "We can generate r.v.s from any discrete distribution with finite support using the [`numpy.random.choice`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice) function. When we first introduced the `numpy.random.choice` function, we said that it can be used in the form `numpy.random.choice(numpy.arange(1,n+1), k, replace=False)` or `numpy.random.choice(np.arange(1,n+1), k)` to sample $k$ times from the integers 1 through $n$, either without or with replacement. For example, to generate 5 independent $DUnif(1, 2, \\ldots, 100$) r.v.s, we can use the function invocation `numpy.random.choice(numpy.arange(1,100+1), 5)`.\n",
    "\n",
    "It turns out that `numpy.random.choice` is far more versatile. If we want to sample from the values $x_1, \\ldots, x_n$ with probabilities $p_1, \\ldots, p_n$, we simply create an array `x` containing all the $x_i$ and an array `p` containing all the $p_i$, then feed them into `choice`. For example, suppose we are interested in generating realizations of i.i.d. r.v.s $X_1, \\ldots, X_{100}$ whose PMF is\n",
    "\n",
    "\\begin{align}\n",
    "  P(X_j = 0) &= 0.25 \\\\\n",
    "  P(X_j = 1) &= 0.5 \\\\\n",
    "  P(X_j = 5) &= 0.1 \\\\\n",
    "  P(X_j = 10) &= 0.15\n",
    "\\end{align}\n",
    "\n",
    "and $P(X_j = x) = 0$ for all other values of $x$. First, we use the `numpy.array`  function to create arrays with the support of the distribution and the PMF probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([0, 1, 5, 10])\n",
    "p = np.array([0.25, 0.5, 0.1, 0.15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use the `choice` function. Here's how to get 100 samples from the above PMF:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  1,  5,  0,  1,  1,  0,  1,  0,  0,  1,  1,  0,  1,  5,  1,\n",
       "        0,  5,  1,  1,  1,  0,  1,  1,  0,  1,  0,  5,  1,  1,  1, 10,  0,\n",
       "        5,  0, 10,  1,  1,  1,  0,  1,  1,  5,  1,  5, 10, 10,  0,  1,  1,\n",
       "        1,  1, 10,  0,  1, 10,  0,  0,  1,  1,  5,  1, 10,  1,  0,  1,  1,\n",
       "        5,  1,  1, 10,  1,  1,  0,  0,  0,  1,  1, 10, 10,  0,  5,  1,  1,\n",
       "        5,  1,  1,  0,  0,  0,  1,  1,  1,  0,  1,  0,  1,  1,  0])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(377)\n",
    "\n",
    "np.random.choice(x, 100, p=p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The inputs are the array `x` to sample from, the number of samples to generate (100 in this case), the probabilities `p` of sampling the values in `x` (if this is omitted, the probabilities are assumed equal), and whether to sample with replacement (the default is `replace=True`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "&copy; Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}