{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 1: Probability and counting\n",
    " \n",
    "This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 27 - 30, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectors\n",
    "\n",
    "Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a [NumPy](https://docs.scipy.org/doc/numpy/user/basics.creation.html) array.\n",
    "\n",
    "In Python, you can import the `numpy` module using the `np` shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we pass in a list of integers to the [`numpy.array`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) function to create an instance of a NumPy array, and then print the array as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[3 1 4 5 9]\n"
     ]
    }
   ],
   "source": [
    "v = np.array([3, 1, 4, 5, 9])\n",
    "print(v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `numpy` module (shortened to the alias `np`) provides many convenient object instance methods that work on an `array`. It is possible to use the functions of the `numpy` module, or invoke these functions as object instance methods of an `array`. Here are a few examples:\n",
    "\n",
    "* [`v.sum()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) - sum array elements; equivalent to `numpy.sum(v)`\n",
    "* [`v.max()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html#numpy.ndarray.max) - return the maximum value of the array; equivalent to `numpy.max(v)`\n",
    "* [`v.min()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) -  return the maximum value of the array; equivalent to `numpy.min(v)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sum of v = 22\n",
      "maximum of v = 9\n",
      "minimum of v = 1\n"
     ]
    }
   ],
   "source": [
    "print('sum of v =', v.sum())\n",
    "\n",
    "print('maximum of v =', v.max())\n",
    "\n",
    "print('minimum of v =', v.min())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To obtain the number of elements in the `array`, you can use either:\n",
    "\n",
    "* the Python built-in [`len`](https://docs.python.org/3.6/library/functions.html#len) function, passing in the `array`\n",
    "* the [`size`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.size.html#numpy.ndarray.size) attribute of the NumPy `array`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len(v) = 5\n",
      "v.size = 5\n"
     ]
    }
   ],
   "source": [
    "print('len(v) =', len(v))\n",
    "\n",
    "print('v.size =', v.size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a NumPy `array` from the sequence $(1, 2, \\dots, n)$, use the [`numpy.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) function. In general, an ascending-order sequence $(m, \\dots, n)$ can be generated with `arange(m, n+1)`. Note that the second parameter `stop` to the `arange` function is _not_ inclusive, so for an ascending-order sequence you must increment `stop` by 1.\n",
    "\n",
    "A descending-order sequence can be generated by providing -1 as the third parameter `step` to the `arange` function. However, for the reason mentioned above, you must conversely _decrement_ the second parameter `stop` if you want to include this value in the `array`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 1  2  3  4  5  6  7  8  9 10]\n",
      "[10  9  8  7  6  5  4  3  2  1]\n"
     ]
    }
   ],
   "source": [
    "m = 1\n",
    "n = 10\n",
    "\n",
    "# example of ascending order sequence from m to n, inclusive\n",
    "v = np.arange(m, n+1)\n",
    "print(v)\n",
    "\n",
    "## example of descending order sequence\n",
    "v_reversed = np.arange(n, m-1, -1)\n",
    "print(v_reversed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like a Python `list` (and unlike vectors in R), NumPy `array` is zero-indexed. To access the `i`<sup>th</sup> element of an `array` named `v`, use `v[i-1]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1st element (index 0) of v is 1\n",
      "10th element (index 9) of v = 10\n"
     ]
    }
   ],
   "source": [
    "print('1st element (index 0) of v is', v[0])\n",
    "\n",
    "print('10th element (index 9) of v =', v[9])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To access a subset of the `array` members, you can pass another NumPy `array` of the target indices to the indexer of your `array`.\n",
    "\n",
    "You can also use a NumPy `array` of boolean values (`True`, `False`) to index your `array`, keeping any elements where the index corresponds to `True` and filtering out elements where the corresponding index is `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "elements of v at indices 1, 3 and 5:  [2 4 6]\n",
      "elements of v except at indices 2, 4 and 6:  [ 1  2  4  6  8  9 10]\n"
     ]
    }
   ],
   "source": [
    "# NumPy array for the indices 1, 3, 5\n",
    "subset_ind = np.array([1, 3, 5])\n",
    "print('elements of v at indices 1, 3 and 5: ', v[subset_ind])\n",
    "\n",
    "# boolean indexer: all True indices will be included in the subset,\n",
    "# while all False indexes will be filtered out\n",
    "# keep all EXCEPT for array values at indices 2, 4, and 6\n",
    "boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])\n",
    "print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many operations on NumPy `array` are interpreted _component-wise_.  For example, in math the cube of a vector doesn't have a standard definition, but\n",
    "\n",
    "    v**3\n",
    "    \n",
    "simply cubes each element individually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[   1    8   27   64  125  216  343  512  729 1000]\n"
     ]
    }
   ],
   "source": [
    "print(v**3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, \n",
    "\n",
    "    1 / np.arange(1, 101)**2\n",
    "\n",
    "is a very compact way to get the vector $1, \\frac{1}{2^2}, \\frac{1}{3^2},  \\ldots , \\frac{1}{100^2}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02,\n",
       "       4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02,\n",
       "       1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03,\n",
       "       5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03,\n",
       "       3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03,\n",
       "       2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03,\n",
       "       1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03,\n",
       "       1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04,\n",
       "       9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04,\n",
       "       7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04,\n",
       "       5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04,\n",
       "       4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04,\n",
       "       4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04,\n",
       "       3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04,\n",
       "       3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04,\n",
       "       2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04,\n",
       "       2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04,\n",
       "       2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04,\n",
       "       1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04,\n",
       "       1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04,\n",
       "       1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04,\n",
       "       1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04,\n",
       "       1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04,\n",
       "       1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04,\n",
       "       1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 / np.arange(1, 101)**2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy `array`, the shorter vector gets [broadcast](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as `v`, thus adding 3 to each element of $v$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 4  5  6  7  8  9 10 11 12 13]\n"
     ]
    }
   ],
   "source": [
    "print(v + 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Factorials and binomial coefficients\n",
    "\n",
    "We can compute $n!$ using [`scipy.special.factorial(n)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.factorial.html#scipy.special.factorial) and $\\binom{n}{k}$ using [`scipy.special.comb(n, k)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.comb.html#scipy.special.comb). As we have seen, factorials grow extremely quickly. What is the largest $n$ for which `scipy.special.factorial` returns a number? Beyond that point, `scipy.special.factorial` will return `inf` (infinity), without a warning message. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5! = 120.0\n",
      "6 choose 2 = 15.0\n",
      "200! = inf ?\n"
     ]
    }
   ],
   "source": [
    "from scipy.special import factorial, comb\n",
    "\n",
    "# to learn more about scipy.special.factorial, un-comment out the following line\n",
    "#print(factorial.__doc__)\n",
    "\n",
    "# to learn more about scipy.special.comb, un-comment out the following line\n",
    "#print(comb.__doc__)\n",
    "\n",
    "print('5! =', factorial(5))\n",
    "\n",
    "print('6 choose 2 =', comb(6, 2))\n",
    "\n",
    "print('200! =', factorial(200), '?')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But it is still possible to use [`scipy.special.gammaln`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gammaln.html#scipy.special.gammaln) define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \\binom{n}{k}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "log(200!) = 863.2319871924054\n",
      "log 100000 choose 999 = 5591.190431187046\n"
     ]
    }
   ],
   "source": [
    "from scipy.special import gammaln\n",
    "\n",
    "# to learn more about scipy.special.gammaln, un-comment out the following line\n",
    "#print(gammaln.__doc__)\n",
    "\n",
    "def logfactorial(n):\n",
    "    return gammaln(n+1)\n",
    "\n",
    "def logchoose(n, k):\n",
    "    num = logfactorial(n)\n",
    "    denom = logfactorial(n-k) + logfactorial(k)\n",
    "    return num - denom\n",
    "\n",
    "print('log(200!) =', logfactorial(200))\n",
    "\n",
    "print('log 100000 choose 999 =', logchoose(100000, 999))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling and simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The function [`numpy.random.choice`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they \"look like\" random samples for almost all practical purposes.) For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 3, 10,  7,  5,  1])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# seed the random number generator\n",
    "np.random.seed(1)\n",
    "\n",
    "# Example: sampling without replacement\n",
    "#\n",
    "# do not forget that Python arrays are zero-indexed,\n",
    "# and the 2nd argument to NumPy arange must be incremented by 1\n",
    "# if you want to include that value\n",
    "n = 10\n",
    "k = 5\n",
    "np.random.choice(np.arange(1, n+1), k, replace=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify `replace=True`, or you may leave that argument out altogether since the default for `numpy.random.choice` is `replace=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 6,  9, 10,  6,  1])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(1)\n",
    "\n",
    "# Example: sampling with replacement\n",
    "np.random.choice(np.arange(1, n+1), k, replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To obtain a random permutation of an `array` of numbers $1, 2, \\ldots, n$ we can use [`numpy.random.shuffle`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html). Note that this function operates on the given `array` in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "v = [ 1  2  3  4  5  6  7  8  9 10]\n",
      "v, shuffled = [ 5  2  6  1  8  3  4  7 10  9]\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(2)\n",
    "\n",
    "m = 1\n",
    "n = 10\n",
    "\n",
    "v = np.arange(m, n+1)\n",
    "print('v =', v)\n",
    "\n",
    "np.random.shuffle(v)\n",
    "print('v, shuffled =', v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use `numpy.random.choice` to draw from a non-numeric `list` or `array`. For example, the Python built-in function [`list`](https://docs.python.org/3.6/library/functions.html#func-list) can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. `numpy.random.choice` will generate a random 7-letter \"word\" by sampling from the list of lowercase alphabet chars derived from [`string.ascii_lowercase`](https://docs.python.org/3/library/string.html), without replacement. Lastly, the Python String function [`join`](https://docs.python.org/3.6/library/stdtypes.html?highlight=join#str.join) concatenates the 7 randomly selected letters into a \"word\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'srmxpqn'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(3)\n",
    "\n",
    "import string\n",
    "\n",
    "# split string of lower-case alphabets into an array\n",
    "alpha = list(string.ascii_lowercase)\n",
    "\n",
    "# randomly choose 7 letters, concatenate into a string\n",
    "''.join(np.random.choice(alpha, 7, replace=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`numpy.random.choice` also allows us to specify general probabilities for sampling each number. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 1], dtype=int64)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(5)\n",
    "\n",
    "# from the 4 numbers starting from 0\n",
    "# obtain a sample of size 3\n",
    "# with replacement\n",
    "# using the probabilities listed in p\n",
    "np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter `p=[0.1, 0.2, 0.3, 0.4]`. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is _proportional_ to its original probability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching problem simulation\n",
    "\n",
    "Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using [`numpy.random.permutation`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html#numpy.random.permutation) while iterating in a for-loop (see Python flow controls [`for`](https://docs.python.org/3/tutorial/controlflow.html#for-statements) and [`range`](https://docs.python.org/3/tutorial/controlflow.html#the-range-function)), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "simulated value: 0.62\n",
      "expected value : 0.63\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(8)\n",
    "\n",
    "n = 100\n",
    "trials = 10000\n",
    "\n",
    "ordered = np.arange(1, n+1)\n",
    "\n",
    "tmp = []\n",
    "for i in range(trials):    \n",
    "    shuffled = np.random.permutation(np.arange(1, n+1))\n",
    "    m = np.sum(shuffled == ordered)\n",
    "    tmp.append(m)\n",
    "    \n",
    "results = np.array(tmp)\n",
    "ans = np.sum(results >= 1) / trials\n",
    "\n",
    "expected = 1 - 1/np.e\n",
    "\n",
    "print('simulated value: {:.2F}'.format(ans))\n",
    "print('expected value : {:.2F}'.format(expected))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we declare and assign values to variables for the size of the deck `n`, and the number of `trials` in our simulation.\n",
    "\n",
    "Next, we generate a sequence from 1 to `n` (stopping at `n+1` to include `n`) to represent our ordered deck of cards.\n",
    "\n",
    "The code then loops for `trial` number of times, where\n",
    "* a permutation of a new sequence from 1 to `n` is created\n",
    "* the number of cards (indices) that match with our `ordered` sequence are counted as `m`\n",
    "* the number of matches `m` are saved to a temporary accumulator array `tmp`\n",
    "\n",
    "After completing `trial` simulations, we create a NumPy `array` `results` from the `tmp` accumulator, which lets us count the number of simulations where there was at least 1 match.\n",
    "\n",
    "Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Birthday problem calculation and simulation\n",
    "\n",
    "The following code uses the NumPy module functions [`numpy.prod`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.prod.html)  (which gives the product of the elements of the given `array`) and [`numpy.sum`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) to calculate the probability of at least one birthday match in a group of 23 people:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5072972343239857"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 23\n",
    "days_in_year = 365\n",
    "\n",
    "numer = np.arange(days_in_year, days_in_year-k, -1)\n",
    "denom = np.array([days_in_year]*k)\n",
    "\n",
    "1.0 - np.sum(np.prod(numer/denom))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately, NumPy and SciPy do not have library functions such as [`pbirthday` and `qbirthday`](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/birthday) for computing birthday probabilities as does R. However, you can easily compose your own functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50.7% chance of at least 1 birthday match among 23 people\n",
      "For an approximately 50.0% chance of at least 1 birthday match, we need 23 people\n"
     ]
    }
   ],
   "source": [
    "def pbirthday_naive(n):\n",
    "    \"\"\" Return the probability of at least 1 matching pair of birthdays out of n people.\n",
    "        Assumes 365-day year.\n",
    "    \n",
    "        This is a naive implementation that may overflow or error under certain input.\n",
    "    \"\"\"\n",
    "    days_in_year = 365\n",
    "    if n < 2:\n",
    "        return 0.0\n",
    "    elif n > days_in_year:\n",
    "        return 1.0\n",
    "    else:\n",
    "        numer = np.arange(days_in_year, days_in_year-n, -1)\n",
    "        denom = np.array([days_in_year]*n)\n",
    "        return 1.0 - np.sum(np.prod( numer/denom ))\n",
    "\n",
    "def qbirthday_naive(p):\n",
    "    \"\"\" Return an approximation of the number of people required to have at least 1 birthday\n",
    "        match with probability p.\n",
    "        \n",
    "        This naive implemention is based on the Birthday problem article on Wikipedia.\n",
    "        https://en.wikipedia.org/wiki/Birthday_problem\n",
    "    \"\"\"\n",
    "    raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )\n",
    "    return np.ceil(raw).astype(np.int32)\n",
    "\n",
    "n = 23\n",
    "pbirthday_naive_ans = pbirthday_naive(n)\n",
    "print(('{:.1f}% chance of at least 1 birthday match '\n",
    "       'among {} people'.format(pbirthday_naive_ans*100.0, n)))\n",
    "\n",
    "p = 0.50\n",
    "qbirthday_naive_ans = qbirthday_naive(p)\n",
    "print(('For an approximately {:.1f}% chance of at least 1 birthday match, '\n",
    "       'we need {} people'.format(p*100.0, qbirthday_naive_ans)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also find the probability of having at least one _triple birthday match_, i.e., three people with the same birthday; with the `pbirthday` and `qbirthday` functions below, all we have to do is add `coincident=3` to say we're looking for triple matches. For example, `pbirthday(23, coincident=3)` returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. `qbirthday(0.5, coincident=3)` returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.4% chance of triple birthday match among 23 people\n",
      "We need 88 people for a triple match with probability of 50.0%\n"
     ]
    }
   ],
   "source": [
    "def pbirthday(n, classes=365, coincident=2):\n",
    "    \"\"\" Return the probability of a coincidence of a birthday match among n people.\n",
    "    \n",
    "        Python implementation of R's pbirthday {stats}.\n",
    "    \"\"\"\n",
    "    k = coincident\n",
    "    c = classes\n",
    "    \n",
    "    if k < 2:\n",
    "        return 1.0\n",
    "    if k == 2:\n",
    "        numer = np.arange(c, c-n, -1)\n",
    "        denom = np.array([c]*n)\n",
    "        return 1.0 - np.sum(np.prod( numer/denom ))\n",
    "    if k > n: \n",
    "        return 0.0\n",
    "    if n > c * (k - 1):\n",
    "        return 1.0\n",
    "    else:\n",
    "        lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)\n",
    "        lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)\n",
    "        return -np.expm1(-np.exp(lxx))\n",
    "\n",
    "def qbirthday(prob=0.5, classes=365, coincident=2):\n",
    "    \"\"\" Return the smallest number of people needed to have at least\n",
    "        the specified probability of coincidence.\n",
    "    \n",
    "        Python implementation of R's qbirthday {stats}.\n",
    "    \"\"\"\n",
    "    k = coincident\n",
    "    c = classes\n",
    "    p = prob\n",
    "    \n",
    "    if p <= 0.0:\n",
    "        return 1\n",
    "    if p >= 1.0:\n",
    "        return c * (k - 1) + 1\n",
    "    \n",
    "    N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)\n",
    "    N = np.ceil(N).astype(np.int32)\n",
    "    if pbirthday(N, c, k) < p:\n",
    "        N += 1\n",
    "        while pbirthday(N, c, k) < p:\n",
    "            N += 1\n",
    "    elif pbirthday(N - 1, c, k) >= p:\n",
    "        N -= 1\n",
    "        while pbirthday(N - 1, c, k) >= p:\n",
    "            N -= 1\n",
    "    return N\n",
    "\n",
    "n = 23\n",
    "pbirthday_ans = pbirthday(n, coincident=3)\n",
    "print(('{:.1F}% chance of triple birthday match '\n",
    "       'among {} people'.format(pbirthday_ans*100.0, n)))\n",
    "\n",
    "p = 0.5\n",
    "qbirthday_ans = qbirthday(p, coincident=3)\n",
    "print(('We need {} people for a triple match '\n",
    "       'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To simulate the birthday problem, we can use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(13)\n",
    "\n",
    "b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)\n",
    "u, c = np.unique(b, return_counts=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to generate random birthdays for 23 people and then use [`numpy.unique`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html), specifiying `return_counts=True`,  to tally up the counts of how many people were born on each day. We can run 10<sup>4</sup> repetitions as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(21)\n",
    "\n",
    "n = 23\n",
    "trials = 10**4\n",
    "\n",
    "ordered = np.arange(1, n+1)\n",
    "\n",
    "m = 0\n",
    "for i in range(trials):    \n",
    "    b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)\n",
    "    u, c = np.unique(b, return_counts=True)\n",
    "    if np.sum(c > 1):\n",
    "        m += 1\n",
    "\n",
    "p =  m / trials  \n",
    "print(('In a simulation with {} trials, ' \n",
    "       'the probability of a birthday match with {} people '\n",
    "       'is {}'.format(trials, n, p)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since `numpy.random.choice` allows us to specify a list `p` of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "&copy; Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "notebook_info": {
   "author": "Joseph K. Blitzstein, Jessica Hwang",
   "chapter": "1. Probability and Counting",
   "creator": "buruzaemon",
   "section": "1.8",
   "title": "Introduction to Probability, 1st Edition"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}