{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 1: Probability and counting\n",
" \n",
"This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 27 - 30, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n",
"\n",
"----"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vectors\n",
"\n",
"Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a [NumPy](https://docs.scipy.org/doc/numpy/user/basics.creation.html) array.\n",
"\n",
"In Python, you can import the `numpy` module using the `np` shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we pass in a list of integers to the [`numpy.array`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) function to create an instance of a NumPy array, and then print the array as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[3 1 4 5 9]\n"
]
}
],
"source": [
"v = np.array([3, 1, 4, 5, 9])\n",
"print(v)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `numpy` module (shortened to the alias `np`) provides many convenient object instance methods that work on an `array`. It is possible to use the functions of the `numpy` module, or invoke these functions as object instance methods of an `array`. Here are a few examples:\n",
"\n",
"* [`v.sum()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) - sum array elements; equivalent to `numpy.sum(v)`\n",
"* [`v.max()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html#numpy.ndarray.max) - return the maximum value of the array; equivalent to `numpy.max(v)`\n",
"* [`v.min()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) - return the maximum value of the array; equivalent to `numpy.min(v)`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sum of v = 22\n",
"maximum of v = 9\n",
"minimum of v = 1\n"
]
}
],
"source": [
"print('sum of v =', v.sum())\n",
"\n",
"print('maximum of v =', v.max())\n",
"\n",
"print('minimum of v =', v.min())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To obtain the number of elements in the `array`, you can use either:\n",
"\n",
"* the Python built-in [`len`](https://docs.python.org/3.6/library/functions.html#len) function, passing in the `array`\n",
"* the [`size`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.size.html#numpy.ndarray.size) attribute of the NumPy `array`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"len(v) = 5\n",
"v.size = 5\n"
]
}
],
"source": [
"print('len(v) =', len(v))\n",
"\n",
"print('v.size =', v.size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create a NumPy `array` from the sequence $(1, 2, \\dots, n)$, use the [`numpy.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) function. In general, an ascending-order sequence $(m, \\dots, n)$ can be generated with `arange(m, n+1)`. Note that the second parameter `stop` to the `arange` function is _not_ inclusive, so for an ascending-order sequence you must increment `stop` by 1.\n",
"\n",
"A descending-order sequence can be generated by providing -1 as the third parameter `step` to the `arange` function. However, for the reason mentioned above, you must conversely _decrement_ the second parameter `stop` if you want to include this value in the `array`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 1 2 3 4 5 6 7 8 9 10]\n",
"[10 9 8 7 6 5 4 3 2 1]\n"
]
}
],
"source": [
"m = 1\n",
"n = 10\n",
"\n",
"# example of ascending order sequence from m to n, inclusive\n",
"v = np.arange(m, n+1)\n",
"print(v)\n",
"\n",
"## example of descending order sequence\n",
"v_reversed = np.arange(n, m-1, -1)\n",
"print(v_reversed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like a Python `list` (and unlike vectors in R), NumPy `array` is zero-indexed. To access the `i`th element of an `array` named `v`, use `v[i-1]`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st element (index 0) of v is 1\n",
"10th element (index 9) of v = 10\n"
]
}
],
"source": [
"print('1st element (index 0) of v is', v[0])\n",
"\n",
"print('10th element (index 9) of v =', v[9])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To access a subset of the `array` members, you can pass another NumPy `array` of the target indices to the indexer of your `array`.\n",
"\n",
"You can also use a NumPy `array` of boolean values (`True`, `False`) to index your `array`, keeping any elements where the index corresponds to `True` and filtering out elements where the corresponding index is `False`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"elements of v at indices 1, 3 and 5: [2 4 6]\n",
"elements of v except at indices 2, 4 and 6: [ 1 2 4 6 8 9 10]\n"
]
}
],
"source": [
"# NumPy array for the indices 1, 3, 5\n",
"subset_ind = np.array([1, 3, 5])\n",
"print('elements of v at indices 1, 3 and 5: ', v[subset_ind])\n",
"\n",
"# boolean indexer: all True indices will be included in the subset,\n",
"# while all False indexes will be filtered out\n",
"# keep all EXCEPT for array values at indices 2, 4, and 6\n",
"boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])\n",
"print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many operations on NumPy `array` are interpreted _component-wise_. For example, in math the cube of a vector doesn't have a standard definition, but\n",
"\n",
" v**3\n",
" \n",
"simply cubes each element individually."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 1 8 27 64 125 216 343 512 729 1000]\n"
]
}
],
"source": [
"print(v**3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, \n",
"\n",
" 1 / np.arange(1, 101)**2\n",
"\n",
"is a very compact way to get the vector $1, \\frac{1}{2^2}, \\frac{1}{3^2}, \\ldots , \\frac{1}{100^2}$"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02,\n",
" 4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02,\n",
" 1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03,\n",
" 5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03,\n",
" 3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03,\n",
" 2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03,\n",
" 1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03,\n",
" 1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04,\n",
" 9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04,\n",
" 7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04,\n",
" 5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04,\n",
" 4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04,\n",
" 4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04,\n",
" 3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04,\n",
" 3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04,\n",
" 2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04,\n",
" 2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04,\n",
" 2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04,\n",
" 1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04,\n",
" 1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04,\n",
" 1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04,\n",
" 1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04,\n",
" 1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04,\n",
" 1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04,\n",
" 1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 / np.arange(1, 101)**2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy `array`, the shorter vector gets [broadcast](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as `v`, thus adding 3 to each element of $v$."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 4 5 6 7 8 9 10 11 12 13]\n"
]
}
],
"source": [
"print(v + 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Factorials and binomial coefficients\n",
"\n",
"We can compute $n!$ using [`scipy.special.factorial(n)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.factorial.html#scipy.special.factorial) and $\\binom{n}{k}$ using [`scipy.special.comb(n, k)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.comb.html#scipy.special.comb). As we have seen, factorials grow extremely quickly. What is the largest $n$ for which `scipy.special.factorial` returns a number? Beyond that point, `scipy.special.factorial` will return `inf` (infinity), without a warning message. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5! = 120.0\n",
"6 choose 2 = 15.0\n",
"200! = inf ?\n"
]
}
],
"source": [
"from scipy.special import factorial, comb\n",
"\n",
"# to learn more about scipy.special.factorial, un-comment out the following line\n",
"#print(factorial.__doc__)\n",
"\n",
"# to learn more about scipy.special.comb, un-comment out the following line\n",
"#print(comb.__doc__)\n",
"\n",
"print('5! =', factorial(5))\n",
"\n",
"print('6 choose 2 =', comb(6, 2))\n",
"\n",
"print('200! =', factorial(200), '?')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But it is still possible to use [`scipy.special.gammaln`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gammaln.html#scipy.special.gammaln) define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \\binom{n}{k}$."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"log(200!) = 863.2319871924054\n",
"log 100000 choose 999 = 5591.190431187046\n"
]
}
],
"source": [
"from scipy.special import gammaln\n",
"\n",
"# to learn more about scipy.special.gammaln, un-comment out the following line\n",
"#print(gammaln.__doc__)\n",
"\n",
"def logfactorial(n):\n",
" return gammaln(n+1)\n",
"\n",
"def logchoose(n, k):\n",
" num = logfactorial(n)\n",
" denom = logfactorial(n-k) + logfactorial(k)\n",
" return num - denom\n",
"\n",
"print('log(200!) =', logfactorial(200))\n",
"\n",
"print('log 100000 choose 999 =', logchoose(100000, 999))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sampling and simulation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The function [`numpy.random.choice`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they \"look like\" random samples for almost all practical purposes.) For example,"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 3, 10, 7, 5, 1])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# seed the random number generator\n",
"np.random.seed(1)\n",
"\n",
"# Example: sampling without replacement\n",
"#\n",
"# do not forget that Python arrays are zero-indexed,\n",
"# and the 2nd argument to NumPy arange must be incremented by 1\n",
"# if you want to include that value\n",
"n = 10\n",
"k = 5\n",
"np.random.choice(np.arange(1, n+1), k, replace=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify `replace=True`, or you may leave that argument out altogether since the default for `numpy.random.choice` is `replace=True`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 6, 9, 10, 6, 1])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(1)\n",
"\n",
"# Example: sampling with replacement\n",
"np.random.choice(np.arange(1, n+1), k, replace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To obtain a random permutation of an `array` of numbers $1, 2, \\ldots, n$ we can use [`numpy.random.shuffle`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html). Note that this function operates on the given `array` in-place."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"v = [ 1 2 3 4 5 6 7 8 9 10]\n",
"v, shuffled = [ 5 2 6 1 8 3 4 7 10 9]\n"
]
}
],
"source": [
"np.random.seed(2)\n",
"\n",
"m = 1\n",
"n = 10\n",
"\n",
"v = np.arange(m, n+1)\n",
"print('v =', v)\n",
"\n",
"np.random.shuffle(v)\n",
"print('v, shuffled =', v)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use `numpy.random.choice` to draw from a non-numeric `list` or `array`. For example, the Python built-in function [`list`](https://docs.python.org/3.6/library/functions.html#func-list) can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. `numpy.random.choice` will generate a random 7-letter \"word\" by sampling from the list of lowercase alphabet chars derived from [`string.ascii_lowercase`](https://docs.python.org/3/library/string.html), without replacement. Lastly, the Python String function [`join`](https://docs.python.org/3.6/library/stdtypes.html?highlight=join#str.join) concatenates the 7 randomly selected letters into a \"word\"."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'srmxpqn'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(3)\n",
"\n",
"import string\n",
"\n",
"# split string of lower-case alphabets into an array\n",
"alpha = list(string.ascii_lowercase)\n",
"\n",
"# randomly choose 7 letters, concatenate into a string\n",
"''.join(np.random.choice(alpha, 7, replace=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`numpy.random.choice` also allows us to specify general probabilities for sampling each number. For example,"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 3, 1], dtype=int64)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(5)\n",
"\n",
"# from the 4 numbers starting from 0\n",
"# obtain a sample of size 3\n",
"# with replacement\n",
"# using the probabilities listed in p\n",
"np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter `p=[0.1, 0.2, 0.3, 0.4]`. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is _proportional_ to its original probability."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matching problem simulation\n",
"\n",
"Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using [`numpy.random.permutation`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html#numpy.random.permutation) while iterating in a for-loop (see Python flow controls [`for`](https://docs.python.org/3/tutorial/controlflow.html#for-statements) and [`range`](https://docs.python.org/3/tutorial/controlflow.html#the-range-function)), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"simulated value: 0.62\n",
"expected value : 0.63\n"
]
}
],
"source": [
"np.random.seed(8)\n",
"\n",
"n = 100\n",
"trials = 10000\n",
"\n",
"ordered = np.arange(1, n+1)\n",
"\n",
"tmp = []\n",
"for i in range(trials): \n",
" shuffled = np.random.permutation(np.arange(1, n+1))\n",
" m = np.sum(shuffled == ordered)\n",
" tmp.append(m)\n",
" \n",
"results = np.array(tmp)\n",
"ans = np.sum(results >= 1) / trials\n",
"\n",
"expected = 1 - 1/np.e\n",
"\n",
"print('simulated value: {:.2F}'.format(ans))\n",
"print('expected value : {:.2F}'.format(expected))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we declare and assign values to variables for the size of the deck `n`, and the number of `trials` in our simulation.\n",
"\n",
"Next, we generate a sequence from 1 to `n` (stopping at `n+1` to include `n`) to represent our ordered deck of cards.\n",
"\n",
"The code then loops for `trial` number of times, where\n",
"* a permutation of a new sequence from 1 to `n` is created\n",
"* the number of cards (indices) that match with our `ordered` sequence are counted as `m`\n",
"* the number of matches `m` are saved to a temporary accumulator array `tmp`\n",
"\n",
"After completing `trial` simulations, we create a NumPy `array` `results` from the `tmp` accumulator, which lets us count the number of simulations where there was at least 1 match.\n",
"\n",
"Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Birthday problem calculation and simulation\n",
"\n",
"The following code uses the NumPy module functions [`numpy.prod`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.prod.html) (which gives the product of the elements of the given `array`) and [`numpy.sum`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) to calculate the probability of at least one birthday match in a group of 23 people:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5072972343239857"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"k = 23\n",
"days_in_year = 365\n",
"\n",
"numer = np.arange(days_in_year, days_in_year-k, -1)\n",
"denom = np.array([days_in_year]*k)\n",
"\n",
"1.0 - np.sum(np.prod(numer/denom))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, NumPy and SciPy do not have library functions such as [`pbirthday` and `qbirthday`](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/birthday) for computing birthday probabilities as does R. However, you can easily compose your own functions."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50.7% chance of at least 1 birthday match among 23 people\n",
"For an approximately 50.0% chance of at least 1 birthday match, we need 23 people\n"
]
}
],
"source": [
"def pbirthday_naive(n):\n",
" \"\"\" Return the probability of at least 1 matching pair of birthdays out of n people.\n",
" Assumes 365-day year.\n",
" \n",
" This is a naive implementation that may overflow or error under certain input.\n",
" \"\"\"\n",
" days_in_year = 365\n",
" if n < 2:\n",
" return 0.0\n",
" elif n > days_in_year:\n",
" return 1.0\n",
" else:\n",
" numer = np.arange(days_in_year, days_in_year-n, -1)\n",
" denom = np.array([days_in_year]*n)\n",
" return 1.0 - np.sum(np.prod( numer/denom ))\n",
"\n",
"def qbirthday_naive(p):\n",
" \"\"\" Return an approximation of the number of people required to have at least 1 birthday\n",
" match with probability p.\n",
" \n",
" This naive implemention is based on the Birthday problem article on Wikipedia.\n",
" https://en.wikipedia.org/wiki/Birthday_problem\n",
" \"\"\"\n",
" raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )\n",
" return np.ceil(raw).astype(np.int32)\n",
"\n",
"n = 23\n",
"pbirthday_naive_ans = pbirthday_naive(n)\n",
"print(('{:.1f}% chance of at least 1 birthday match '\n",
" 'among {} people'.format(pbirthday_naive_ans*100.0, n)))\n",
"\n",
"p = 0.50\n",
"qbirthday_naive_ans = qbirthday_naive(p)\n",
"print(('For an approximately {:.1f}% chance of at least 1 birthday match, '\n",
" 'we need {} people'.format(p*100.0, qbirthday_naive_ans)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also find the probability of having at least one _triple birthday match_, i.e., three people with the same birthday; with the `pbirthday` and `qbirthday` functions below, all we have to do is add `coincident=3` to say we're looking for triple matches. For example, `pbirthday(23, coincident=3)` returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. `qbirthday(0.5, coincident=3)` returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.4% chance of triple birthday match among 23 people\n",
"We need 88 people for a triple match with probability of 50.0%\n"
]
}
],
"source": [
"def pbirthday(n, classes=365, coincident=2):\n",
" \"\"\" Return the probability of a coincidence of a birthday match among n people.\n",
" \n",
" Python implementation of R's pbirthday {stats}.\n",
" \"\"\"\n",
" k = coincident\n",
" c = classes\n",
" \n",
" if k < 2:\n",
" return 1.0\n",
" if k == 2:\n",
" numer = np.arange(c, c-n, -1)\n",
" denom = np.array([c]*n)\n",
" return 1.0 - np.sum(np.prod( numer/denom ))\n",
" if k > n: \n",
" return 0.0\n",
" if n > c * (k - 1):\n",
" return 1.0\n",
" else:\n",
" lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)\n",
" lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)\n",
" return -np.expm1(-np.exp(lxx))\n",
"\n",
"def qbirthday(prob=0.5, classes=365, coincident=2):\n",
" \"\"\" Return the smallest number of people needed to have at least\n",
" the specified probability of coincidence.\n",
" \n",
" Python implementation of R's qbirthday {stats}.\n",
" \"\"\"\n",
" k = coincident\n",
" c = classes\n",
" p = prob\n",
" \n",
" if p <= 0.0:\n",
" return 1\n",
" if p >= 1.0:\n",
" return c * (k - 1) + 1\n",
" \n",
" N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)\n",
" N = np.ceil(N).astype(np.int32)\n",
" if pbirthday(N, c, k) < p:\n",
" N += 1\n",
" while pbirthday(N, c, k) < p:\n",
" N += 1\n",
" elif pbirthday(N - 1, c, k) >= p:\n",
" N -= 1\n",
" while pbirthday(N - 1, c, k) >= p:\n",
" N -= 1\n",
" return N\n",
"\n",
"n = 23\n",
"pbirthday_ans = pbirthday(n, coincident=3)\n",
"print(('{:.1F}% chance of triple birthday match '\n",
" 'among {} people'.format(pbirthday_ans*100.0, n)))\n",
"\n",
"p = 0.5\n",
"qbirthday_ans = qbirthday(p, coincident=3)\n",
"print(('We need {} people for a triple match '\n",
" 'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To simulate the birthday problem, we can use"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(13)\n",
"\n",
"b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)\n",
"u, c = np.unique(b, return_counts=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"to generate random birthdays for 23 people and then use [`numpy.unique`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html), specifiying `return_counts=True`, to tally up the counts of how many people were born on each day. We can run 104 repetitions as follows:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118\n"
]
}
],
"source": [
"np.random.seed(21)\n",
"\n",
"n = 23\n",
"trials = 10**4\n",
"\n",
"ordered = np.arange(1, n+1)\n",
"\n",
"m = 0\n",
"for i in range(trials): \n",
" b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)\n",
" u, c = np.unique(b, return_counts=True)\n",
" if np.sum(c > 1):\n",
" m += 1\n",
"\n",
"p = m / trials \n",
"print(('In a simulation with {} trials, ' \n",
" 'the probability of a birthday match with {} people '\n",
" 'is {}'.format(trials, n, p)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since `numpy.random.choice` allows us to specify a list `p` of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
},
"notebook_info": {
"author": "Joseph K. Blitzstein, Jessica Hwang",
"chapter": "1. Probability and Counting",
"creator": "buruzaemon",
"section": "1.8",
"title": "Introduction to Probability, 1st Edition"
}
},
"nbformat": 4,
"nbformat_minor": 2
}