{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 1: Probability and counting\n", " \n", "This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 27 - 30, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n", "\n", "----" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectors\n", "\n", "Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a [NumPy](https://docs.scipy.org/doc/numpy/user/basics.creation.html) array.\n", "\n", "In Python, you can import the `numpy` module using the `np` shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we pass in a list of integers to the [`numpy.array`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) function to create an instance of a NumPy array, and then print the array as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3 1 4 5 9]\n" ] } ], "source": [ "v = np.array([3, 1, 4, 5, 9])\n", "print(v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `numpy` module (shortened to the alias `np`) provides many convenient object instance methods that work on an `array`. It is possible to use the functions of the `numpy` module, or invoke these functions as object instance methods of an `array`. Here are a few examples:\n", "\n", "* [`v.sum()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) - sum array elements; equivalent to `numpy.sum(v)`\n", "* [`v.max()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html#numpy.ndarray.max) - return the maximum value of the array; equivalent to `numpy.max(v)`\n", "* [`v.min()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) - return the maximum value of the array; equivalent to `numpy.min(v)`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sum of v = 22\n", "maximum of v = 9\n", "minimum of v = 1\n" ] } ], "source": [ "print('sum of v =', v.sum())\n", "\n", "print('maximum of v =', v.max())\n", "\n", "print('minimum of v =', v.min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the number of elements in the `array`, you can use either:\n", "\n", "* the Python built-in [`len`](https://docs.python.org/3.6/library/functions.html#len) function, passing in the `array`\n", "* the [`size`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.size.html#numpy.ndarray.size) attribute of the NumPy `array`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len(v) = 5\n", "v.size = 5\n" ] } ], "source": [ "print('len(v) =', len(v))\n", "\n", "print('v.size =', v.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a NumPy `array` from the sequence $(1, 2, \\dots, n)$, use the [`numpy.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) function. In general, an ascending-order sequence $(m, \\dots, n)$ can be generated with `arange(m, n+1)`. Note that the second parameter `stop` to the `arange` function is _not_ inclusive, so for an ascending-order sequence you must increment `stop` by 1.\n", "\n", "A descending-order sequence can be generated by providing -1 as the third parameter `step` to the `arange` function. However, for the reason mentioned above, you must conversely _decrement_ the second parameter `stop` if you want to include this value in the `array`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1 2 3 4 5 6 7 8 9 10]\n", "[10 9 8 7 6 5 4 3 2 1]\n" ] } ], "source": [ "m = 1\n", "n = 10\n", "\n", "# example of ascending order sequence from m to n, inclusive\n", "v = np.arange(m, n+1)\n", "print(v)\n", "\n", "## example of descending order sequence\n", "v_reversed = np.arange(n, m-1, -1)\n", "print(v_reversed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like a Python `list` (and unlike vectors in R), NumPy `array` is zero-indexed. To access the `i`th element of an `array` named `v`, use `v[i-1]`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1st element (index 0) of v is 1\n", "10th element (index 9) of v = 10\n" ] } ], "source": [ "print('1st element (index 0) of v is', v[0])\n", "\n", "print('10th element (index 9) of v =', v[9])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a subset of the `array` members, you can pass another NumPy `array` of the target indices to the indexer of your `array`.\n", "\n", "You can also use a NumPy `array` of boolean values (`True`, `False`) to index your `array`, keeping any elements where the index corresponds to `True` and filtering out elements where the corresponding index is `False`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elements of v at indices 1, 3 and 5: [2 4 6]\n", "elements of v except at indices 2, 4 and 6: [ 1 2 4 6 8 9 10]\n" ] } ], "source": [ "# NumPy array for the indices 1, 3, 5\n", "subset_ind = np.array([1, 3, 5])\n", "print('elements of v at indices 1, 3 and 5: ', v[subset_ind])\n", "\n", "# boolean indexer: all True indices will be included in the subset,\n", "# while all False indexes will be filtered out\n", "# keep all EXCEPT for array values at indices 2, 4, and 6\n", "boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])\n", "print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many operations on NumPy `array` are interpreted _component-wise_. For example, in math the cube of a vector doesn't have a standard definition, but\n", "\n", " v**3\n", " \n", "simply cubes each element individually." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1 8 27 64 125 216 343 512 729 1000]\n" ] } ], "source": [ "print(v**3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, \n", "\n", " 1 / np.arange(1, 101)**2\n", "\n", "is a very compact way to get the vector $1, \\frac{1}{2^2}, \\frac{1}{3^2}, \\ldots , \\frac{1}{100^2}$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02,\n", " 4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02,\n", " 1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03,\n", " 5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03,\n", " 3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03,\n", " 2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03,\n", " 1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03,\n", " 1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04,\n", " 9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04,\n", " 7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04,\n", " 5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04,\n", " 4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04,\n", " 4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04,\n", " 3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04,\n", " 3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04,\n", " 2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04,\n", " 2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04,\n", " 2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04,\n", " 1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04,\n", " 1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04,\n", " 1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04,\n", " 1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04,\n", " 1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04,\n", " 1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04,\n", " 1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 / np.arange(1, 101)**2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy `array`, the shorter vector gets [broadcast](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as `v`, thus adding 3 to each element of $v$." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 4 5 6 7 8 9 10 11 12 13]\n" ] } ], "source": [ "print(v + 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Factorials and binomial coefficients\n", "\n", "We can compute $n!$ using [`scipy.special.factorial(n)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.factorial.html#scipy.special.factorial) and $\\binom{n}{k}$ using [`scipy.special.comb(n, k)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.comb.html#scipy.special.comb). As we have seen, factorials grow extremely quickly. What is the largest $n$ for which `scipy.special.factorial` returns a number? Beyond that point, `scipy.special.factorial` will return `inf` (infinity), without a warning message. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5! = 120.0\n", "6 choose 2 = 15.0\n", "200! = inf ?\n" ] } ], "source": [ "from scipy.special import factorial, comb\n", "\n", "# to learn more about scipy.special.factorial, un-comment out the following line\n", "#print(factorial.__doc__)\n", "\n", "# to learn more about scipy.special.comb, un-comment out the following line\n", "#print(comb.__doc__)\n", "\n", "print('5! =', factorial(5))\n", "\n", "print('6 choose 2 =', comb(6, 2))\n", "\n", "print('200! =', factorial(200), '?')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it is still possible to use [`scipy.special.gammaln`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gammaln.html#scipy.special.gammaln) define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \\binom{n}{k}$." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "log(200!) = 863.2319871924054\n", "log 100000 choose 999 = 5591.190431187046\n" ] } ], "source": [ "from scipy.special import gammaln\n", "\n", "# to learn more about scipy.special.gammaln, un-comment out the following line\n", "#print(gammaln.__doc__)\n", "\n", "def logfactorial(n):\n", " return gammaln(n+1)\n", "\n", "def logchoose(n, k):\n", " num = logfactorial(n)\n", " denom = logfactorial(n-k) + logfactorial(k)\n", " return num - denom\n", "\n", "print('log(200!) =', logfactorial(200))\n", "\n", "print('log 100000 choose 999 =', logchoose(100000, 999))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling and simulation" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The function [`numpy.random.choice`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they \"look like\" random samples for almost all practical purposes.) For example," ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 3, 10, 7, 5, 1])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# seed the random number generator\n", "np.random.seed(1)\n", "\n", "# Example: sampling without replacement\n", "#\n", "# do not forget that Python arrays are zero-indexed,\n", "# and the 2nd argument to NumPy arange must be incremented by 1\n", "# if you want to include that value\n", "n = 10\n", "k = 5\n", "np.random.choice(np.arange(1, n+1), k, replace=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify `replace=True`, or you may leave that argument out altogether since the default for `numpy.random.choice` is `replace=True`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 6, 9, 10, 6, 1])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "\n", "# Example: sampling with replacement\n", "np.random.choice(np.arange(1, n+1), k, replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain a random permutation of an `array` of numbers $1, 2, \\ldots, n$ we can use [`numpy.random.shuffle`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html). Note that this function operates on the given `array` in-place." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "v = [ 1 2 3 4 5 6 7 8 9 10]\n", "v, shuffled = [ 5 2 6 1 8 3 4 7 10 9]\n" ] } ], "source": [ "np.random.seed(2)\n", "\n", "m = 1\n", "n = 10\n", "\n", "v = np.arange(m, n+1)\n", "print('v =', v)\n", "\n", "np.random.shuffle(v)\n", "print('v, shuffled =', v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use `numpy.random.choice` to draw from a non-numeric `list` or `array`. For example, the Python built-in function [`list`](https://docs.python.org/3.6/library/functions.html#func-list) can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. `numpy.random.choice` will generate a random 7-letter \"word\" by sampling from the list of lowercase alphabet chars derived from [`string.ascii_lowercase`](https://docs.python.org/3/library/string.html), without replacement. Lastly, the Python String function [`join`](https://docs.python.org/3.6/library/stdtypes.html?highlight=join#str.join) concatenates the 7 randomly selected letters into a \"word\"." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'srmxpqn'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(3)\n", "\n", "import string\n", "\n", "# split string of lower-case alphabets into an array\n", "alpha = list(string.ascii_lowercase)\n", "\n", "# randomly choose 7 letters, concatenate into a string\n", "''.join(np.random.choice(alpha, 7, replace=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`numpy.random.choice` also allows us to specify general probabilities for sampling each number. For example," ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 3, 1], dtype=int64)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(5)\n", "\n", "# from the 4 numbers starting from 0\n", "# obtain a sample of size 3\n", "# with replacement\n", "# using the probabilities listed in p\n", "np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter `p=[0.1, 0.2, 0.3, 0.4]`. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is _proportional_ to its original probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching problem simulation\n", "\n", "Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using [`numpy.random.permutation`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html#numpy.random.permutation) while iterating in a for-loop (see Python flow controls [`for`](https://docs.python.org/3/tutorial/controlflow.html#for-statements) and [`range`](https://docs.python.org/3/tutorial/controlflow.html#the-range-function)), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulated value: 0.62\n", "expected value : 0.63\n" ] } ], "source": [ "np.random.seed(8)\n", "\n", "n = 100\n", "trials = 10000\n", "\n", "ordered = np.arange(1, n+1)\n", "\n", "tmp = []\n", "for i in range(trials): \n", " shuffled = np.random.permutation(np.arange(1, n+1))\n", " m = np.sum(shuffled == ordered)\n", " tmp.append(m)\n", " \n", "results = np.array(tmp)\n", "ans = np.sum(results >= 1) / trials\n", "\n", "expected = 1 - 1/np.e\n", "\n", "print('simulated value: {:.2F}'.format(ans))\n", "print('expected value : {:.2F}'.format(expected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we declare and assign values to variables for the size of the deck `n`, and the number of `trials` in our simulation.\n", "\n", "Next, we generate a sequence from 1 to `n` (stopping at `n+1` to include `n`) to represent our ordered deck of cards.\n", "\n", "The code then loops for `trial` number of times, where\n", "* a permutation of a new sequence from 1 to `n` is created\n", "* the number of cards (indices) that match with our `ordered` sequence are counted as `m`\n", "* the number of matches `m` are saved to a temporary accumulator array `tmp`\n", "\n", "After completing `trial` simulations, we create a NumPy `array` `results` from the `tmp` accumulator, which lets us count the number of simulations where there was at least 1 match.\n", "\n", "Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Birthday problem calculation and simulation\n", "\n", "The following code uses the NumPy module functions [`numpy.prod`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.prod.html) (which gives the product of the elements of the given `array`) and [`numpy.sum`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) to calculate the probability of at least one birthday match in a group of 23 people:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5072972343239857" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k = 23\n", "days_in_year = 365\n", "\n", "numer = np.arange(days_in_year, days_in_year-k, -1)\n", "denom = np.array([days_in_year]*k)\n", "\n", "1.0 - np.sum(np.prod(numer/denom))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, NumPy and SciPy do not have library functions such as [`pbirthday` and `qbirthday`](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/birthday) for computing birthday probabilities as does R. However, you can easily compose your own functions." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50.7% chance of at least 1 birthday match among 23 people\n", "For an approximately 50.0% chance of at least 1 birthday match, we need 23 people\n" ] } ], "source": [ "def pbirthday_naive(n):\n", " \"\"\" Return the probability of at least 1 matching pair of birthdays out of n people.\n", " Assumes 365-day year.\n", " \n", " This is a naive implementation that may overflow or error under certain input.\n", " \"\"\"\n", " days_in_year = 365\n", " if n < 2:\n", " return 0.0\n", " elif n > days_in_year:\n", " return 1.0\n", " else:\n", " numer = np.arange(days_in_year, days_in_year-n, -1)\n", " denom = np.array([days_in_year]*n)\n", " return 1.0 - np.sum(np.prod( numer/denom ))\n", "\n", "def qbirthday_naive(p):\n", " \"\"\" Return an approximation of the number of people required to have at least 1 birthday\n", " match with probability p.\n", " \n", " This naive implemention is based on the Birthday problem article on Wikipedia.\n", " https://en.wikipedia.org/wiki/Birthday_problem\n", " \"\"\"\n", " raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )\n", " return np.ceil(raw).astype(np.int32)\n", "\n", "n = 23\n", "pbirthday_naive_ans = pbirthday_naive(n)\n", "print(('{:.1f}% chance of at least 1 birthday match '\n", " 'among {} people'.format(pbirthday_naive_ans*100.0, n)))\n", "\n", "p = 0.50\n", "qbirthday_naive_ans = qbirthday_naive(p)\n", "print(('For an approximately {:.1f}% chance of at least 1 birthday match, '\n", " 'we need {} people'.format(p*100.0, qbirthday_naive_ans)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also find the probability of having at least one _triple birthday match_, i.e., three people with the same birthday; with the `pbirthday` and `qbirthday` functions below, all we have to do is add `coincident=3` to say we're looking for triple matches. For example, `pbirthday(23, coincident=3)` returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. `qbirthday(0.5, coincident=3)` returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.4% chance of triple birthday match among 23 people\n", "We need 88 people for a triple match with probability of 50.0%\n" ] } ], "source": [ "def pbirthday(n, classes=365, coincident=2):\n", " \"\"\" Return the probability of a coincidence of a birthday match among n people.\n", " \n", " Python implementation of R's pbirthday {stats}.\n", " \"\"\"\n", " k = coincident\n", " c = classes\n", " \n", " if k < 2:\n", " return 1.0\n", " if k == 2:\n", " numer = np.arange(c, c-n, -1)\n", " denom = np.array([c]*n)\n", " return 1.0 - np.sum(np.prod( numer/denom ))\n", " if k > n: \n", " return 0.0\n", " if n > c * (k - 1):\n", " return 1.0\n", " else:\n", " lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)\n", " lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)\n", " return -np.expm1(-np.exp(lxx))\n", "\n", "def qbirthday(prob=0.5, classes=365, coincident=2):\n", " \"\"\" Return the smallest number of people needed to have at least\n", " the specified probability of coincidence.\n", " \n", " Python implementation of R's qbirthday {stats}.\n", " \"\"\"\n", " k = coincident\n", " c = classes\n", " p = prob\n", " \n", " if p <= 0.0:\n", " return 1\n", " if p >= 1.0:\n", " return c * (k - 1) + 1\n", " \n", " N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)\n", " N = np.ceil(N).astype(np.int32)\n", " if pbirthday(N, c, k) < p:\n", " N += 1\n", " while pbirthday(N, c, k) < p:\n", " N += 1\n", " elif pbirthday(N - 1, c, k) >= p:\n", " N -= 1\n", " while pbirthday(N - 1, c, k) >= p:\n", " N -= 1\n", " return N\n", "\n", "n = 23\n", "pbirthday_ans = pbirthday(n, coincident=3)\n", "print(('{:.1F}% chance of triple birthday match '\n", " 'among {} people'.format(pbirthday_ans*100.0, n)))\n", "\n", "p = 0.5\n", "qbirthday_ans = qbirthday(p, coincident=3)\n", "print(('We need {} people for a triple match '\n", " 'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To simulate the birthday problem, we can use" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "np.random.seed(13)\n", "\n", "b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)\n", "u, c = np.unique(b, return_counts=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to generate random birthdays for 23 people and then use [`numpy.unique`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html), specifiying `return_counts=True`, to tally up the counts of how many people were born on each day. We can run 104 repetitions as follows:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118\n" ] } ], "source": [ "np.random.seed(21)\n", "\n", "n = 23\n", "trials = 10**4\n", "\n", "ordered = np.arange(1, n+1)\n", "\n", "m = 0\n", "for i in range(trials): \n", " b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)\n", " u, c = np.unique(b, return_counts=True)\n", " if np.sum(c > 1):\n", " m += 1\n", "\n", "p = m / trials \n", "print(('In a simulation with {} trials, ' \n", " 'the probability of a birthday match with {} people '\n", " 'is {}'.format(trials, n, p)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since `numpy.random.choice` allows us to specify a list `p` of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" }, "notebook_info": { "author": "Joseph K. Blitzstein, Jessica Hwang", "chapter": "1. Probability and Counting", "creator": "buruzaemon", "section": "1.8", "title": "Introduction to Probability, 1st Edition" } }, "nbformat": 4, "nbformat_minor": 2 }