{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 1: Probability and counting\n", " \n", "This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 29 - 32, [Introduction to Probability, Second Edition](https://www.crcpress.com/Introduction-to-Probability-Second-Edition/Blitzstein-Hwang/p/book/9781138369917), Blitzstein & Hwang.\n", "\n", "----" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectors\n", "\n", "Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a [NumPy](https://docs.scipy.org/doc/numpy/user/basics.creation.html) array.\n", "\n", "In Python, you can import the numpy module using the np shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we pass in a list of integers to the [numpy.array](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) function to create an instance of a NumPy array, and then print the array as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3 1 4 5 9]\n" ] } ], "source": [ "v = np.array([3, 1, 4, 5, 9])\n", "print(v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The numpy module (shortened to the alias np) provides many convenient object instance methods that work on an array. It is possible to use the functions of the numpy module, or invoke these functions as object instance methods of an array. Here are a few examples:\n", "\n", "* [v.sum()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) - sum array elements; equivalent to numpy.sum(v)\n", "* [v.max()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html#numpy.ndarray.max) - return the maximum value of the array; equivalent to numpy.max(v)\n", "* [v.min()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) - return the maximum value of the array; equivalent to numpy.min(v)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sum of v = 22\n", "maximum of v = 9\n", "minimum of v = 1\n" ] } ], "source": [ "print('sum of v =', v.sum())\n", "\n", "print('maximum of v =', v.max())\n", "\n", "print('minimum of v =', v.min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the number of elements in the array, you can use either:\n", "\n", "* the Python built-in [len](https://docs.python.org/3.6/library/functions.html#len) function, passing in the array\n", "* the [size](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.size.html#numpy.ndarray.size) attribute of the NumPy array" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len(v) = 5\n", "v.size = 5\n" ] } ], "source": [ "print('len(v) =', len(v))\n", "\n", "print('v.size =', v.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a NumPy array from the sequence $(1, 2, \\dots, n)$, use the [numpy.arange](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) function. In general, an ascending-order sequence $(m, \\dots, n)$ can be generated with arange(m, n+1). Note that the second parameter stop to the arange function is _not_ inclusive, so for an ascending-order sequence you must increment stop by 1.\n", "\n", "A descending-order sequence can be generated by providing -1 as the third parameter step to the arange function. However, for the reason mentioned above, you must conversely _decrement_ the second parameter stop if you want to include this value in the array." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1 2 3 4 5 6 7 8 9 10]\n", "[10 9 8 7 6 5 4 3 2 1]\n" ] } ], "source": [ "m = 1\n", "n = 10\n", "\n", "# example of ascending order sequence from m to n, inclusive\n", "v = np.arange(m, n+1)\n", "print(v)\n", "\n", "## example of descending order sequence\n", "v_reversed = np.arange(n, m-1, -1)\n", "print(v_reversed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like a Python list (and unlike vectors in R), NumPy array is zero-indexed. To access the ith element of an array named v, use v[i-1]." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1st element (index 0) of v is 1\n", "10th element (index 9) of v = 10\n" ] } ], "source": [ "print('1st element (index 0) of v is', v[0])\n", "\n", "print('10th element (index 9) of v =', v[9])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a subset of the array members, you can pass another NumPy array of the target indices to the indexer of your array.\n", "\n", "You can also use a NumPy array of boolean values (True, False) to index your array, keeping any elements where the index corresponds to True and filtering out elements where the corresponding index is False." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elements of v at indices 1, 3 and 5: [2 4 6]\n", "elements of v except at indices 2, 4 and 6: [ 1 2 4 6 8 9 10]\n" ] } ], "source": [ "# NumPy array for the indices 1, 3, 5\n", "subset_ind = np.array([1, 3, 5])\n", "print('elements of v at indices 1, 3 and 5: ', v[subset_ind])\n", "\n", "# boolean indexer: all True indices will be included in the subset,\n", "# while all False indexes will be filtered out\n", "# keep all EXCEPT for array values at indices 2, 4, and 6\n", "boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])\n", "print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many operations on NumPy array are interpreted _component-wise_. For example, in math the cube of a vector doesn't have a standard definition, but\n", "\n", " v**3\n", " \n", "simply cubes each element individually." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1 8 27 64 125 216 343 512 729 1000]\n" ] } ], "source": [ "print(v**3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, \n", "\n", " 1 / np.arange(1, 101)**2\n", "\n", "is a very compact way to get the vector $1, \\frac{1}{2^2}, \\frac{1}{3^2}, \\ldots , \\frac{1}{100^2}$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02,\n", " 4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02,\n", " 1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03,\n", " 5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03,\n", " 3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03,\n", " 2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03,\n", " 1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03,\n", " 1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04,\n", " 9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04,\n", " 7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04,\n", " 5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04,\n", " 4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04,\n", " 4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04,\n", " 3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04,\n", " 3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04,\n", " 2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04,\n", " 2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04,\n", " 2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04,\n", " 1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04,\n", " 1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04,\n", " 1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04,\n", " 1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04,\n", " 1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04,\n", " 1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04,\n", " 1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 / np.arange(1, 101)**2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy array, the shorter vector gets [broadcast](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as v, thus adding 3 to each element of $v$." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 4 5 6 7 8 9 10 11 12 13]\n" ] } ], "source": [ "print(v + 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Factorials and binomial coefficients\n", "\n", "We can compute $n!$ using [scipy.special.factorial(n)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.factorial.html#scipy.special.factorial) and $\\binom{n}{k}$ using [scipy.special.comb(n, k)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.comb.html#scipy.special.comb). As we have seen, factorials grow extremely quickly. What is the largest $n$ for which scipy.special.factorial returns a number? Beyond that point, scipy.special.factorial will return inf (infinity), without a warning message. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5! = 120.0\n", "6 choose 2 = 15.0\n", "200! = inf ?\n" ] } ], "source": [ "from scipy.special import factorial, comb\n", "\n", "# to learn more about scipy.special.factorial, un-comment out the following line\n", "#print(factorial.__doc__)\n", "\n", "# to learn more about scipy.special.comb, un-comment out the following line\n", "#print(comb.__doc__)\n", "\n", "print('5! =', factorial(5))\n", "\n", "print('6 choose 2 =', comb(6, 2))\n", "\n", "print('200! =', factorial(200), '?')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it is still possible to use [scipy.special.gammaln](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gammaln.html#scipy.special.gammaln) define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \\binom{n}{k}$." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "log(200!) = 863.2319871924054\n", "log 100000 choose 999 = 5591.190431187046\n" ] } ], "source": [ "from scipy.special import gammaln\n", "\n", "# to learn more about scipy.special.gammaln, un-comment out the following line\n", "#print(gammaln.__doc__)\n", "\n", "def logfactorial(n):\n", " return gammaln(n+1)\n", "\n", "def logchoose(n, k):\n", " num = logfactorial(n)\n", " denom = logfactorial(n-k) + logfactorial(k)\n", " return num - denom\n", "\n", "print('log(200!) =', logfactorial(200))\n", "\n", "print('log 100000 choose 999 =', logchoose(100000, 999))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling and simulation" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The function [numpy.random.choice](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they \"look like\" random samples for almost all practical purposes.) For example," ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 3, 10, 7, 5, 1])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# seed the random number generator\n", "np.random.seed(1)\n", "\n", "# Example: sampling without replacement\n", "#\n", "# do not forget that Python arrays are zero-indexed,\n", "# and the 2nd argument to NumPy arange must be incremented by 1\n", "# if you want to include that value\n", "n = 10\n", "k = 5\n", "np.random.choice(np.arange(1, n+1), k, replace=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify replace=True, or you may leave that argument out altogether since the default for numpy.random.choice is replace=True." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 6, 9, 10, 6, 1])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "\n", "# Example: sampling with replacement\n", "np.random.choice(np.arange(1, n+1), k, replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain a random permutation of an array of numbers $1, 2, \\ldots, n$ we can use [numpy.random.shuffle](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html). Note that this function operates on the given array in-place." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "v = [ 1 2 3 4 5 6 7 8 9 10]\n", "v, shuffled = [ 5 2 6 1 8 3 4 7 10 9]\n" ] } ], "source": [ "np.random.seed(2)\n", "\n", "m = 1\n", "n = 10\n", "\n", "v = np.arange(m, n+1)\n", "print('v =', v)\n", "\n", "np.random.shuffle(v)\n", "print('v, shuffled =', v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use numpy.random.choice to draw from a non-numeric list or array. For example, the Python built-in function [list](https://docs.python.org/3.6/library/functions.html#func-list) can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. numpy.random.choice will generate a random 7-letter \"word\" by sampling from the list of lowercase alphabet chars derived from [string.ascii_lowercase](https://docs.python.org/3/library/string.html), without replacement. Lastly, the Python String function [join](https://docs.python.org/3.6/library/stdtypes.html?highlight=join#str.join) concatenates the 7 randomly selected letters into a \"word\"." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'srmxpqn'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(3)\n", "\n", "import string\n", "\n", "# split string of lower-case alphabets into an array\n", "alpha = list(string.ascii_lowercase)\n", "\n", "# randomly choose 7 letters, concatenate into a string\n", "''.join(np.random.choice(alpha, 7, replace=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "numpy.random.choice also allows us to specify general probabilities for sampling each number. For example," ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 3, 1], dtype=int64)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(5)\n", "\n", "# from the 4 numbers starting from 0\n", "# obtain a sample of size 3\n", "# with replacement\n", "# using the probabilities listed in p\n", "np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter p=[0.1, 0.2, 0.3, 0.4]. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is _proportional_ to its original probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching problem simulation\n", "\n", "Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using [numpy.random.permutation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html#numpy.random.permutation) while iterating in a for-loop (see Python flow controls [for](https://docs.python.org/3/tutorial/controlflow.html#for-statements) and [range](https://docs.python.org/3/tutorial/controlflow.html#the-range-function)), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulated value: 0.62\n", "expected value : 0.63\n" ] } ], "source": [ "np.random.seed(8)\n", "\n", "n = 100\n", "trials = 10000\n", "\n", "ordered = np.arange(1, n+1)\n", "\n", "tmp = []\n", "for i in range(trials): \n", " shuffled = np.random.permutation(np.arange(1, n+1))\n", " m = np.sum(shuffled == ordered)\n", " tmp.append(m)\n", " \n", "results = np.array(tmp)\n", "ans = np.sum(results >= 1) / trials\n", "\n", "expected = 1 - 1/np.e\n", "\n", "print('simulated value: {:.2F}'.format(ans))\n", "print('expected value : {:.2F}'.format(expected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we declare and assign values to variables for the size of the deck n, and the number of trials in our simulation.\n", "\n", "Next, we generate a sequence from 1 to n (stopping at n+1 to include n) to represent our ordered deck of cards.\n", "\n", "The code then loops for trial number of times, where\n", "* a permutation of a new sequence from 1 to n is created\n", "* the number of cards (indices) that match with our ordered sequence are counted as m\n", "* the number of matches m are saved to a temporary accumulator array tmp\n", "\n", "After completing trial simulations, we create a NumPy array results from the tmp accumulator, which lets us count the number of simulations where there was at least 1 match.\n", "\n", "Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Birthday problem calculation and simulation\n", "\n", "The following code uses the NumPy module functions [numpy.prod](https://docs.scipy.org/doc/numpy/reference/generated/numpy.prod.html) (which gives the product of the elements of the given array) and [numpy.sum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) to calculate the probability of at least one birthday match in a group of 23 people:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5072972343239857" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k = 23\n", "days_in_year = 365\n", "\n", "numer = np.arange(days_in_year, days_in_year-k, -1)\n", "denom = np.array([days_in_year]*k)\n", "\n", "1.0 - np.sum(np.prod(numer/denom))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, NumPy and SciPy do not have library functions such as [pbirthday and qbirthday](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/birthday) for computing birthday probabilities as does R. However, you can easily compose your own functions." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50.7% chance of at least 1 birthday match among 23 people\n", "For an approximately 50.0% chance of at least 1 birthday match, we need 23 people\n" ] } ], "source": [ "def pbirthday_naive(n):\n", " \"\"\" Return the probability of at least 1 matching pair of birthdays out of n people.\n", " Assumes 365-day year.\n", " \n", " This is a naive implementation that may overflow or error under certain input.\n", " \"\"\"\n", " days_in_year = 365\n", " if n < 2:\n", " return 0.0\n", " elif n > days_in_year:\n", " return 1.0\n", " else:\n", " numer = np.arange(days_in_year, days_in_year-n, -1)\n", " denom = np.array([days_in_year]*n)\n", " return 1.0 - np.sum(np.prod( numer/denom ))\n", "\n", "def qbirthday_naive(p):\n", " \"\"\" Return an approximation of the number of people required to have at least 1 birthday\n", " match with probability p.\n", " \n", " This naive implemention is based on the Birthday problem article on Wikipedia.\n", " https://en.wikipedia.org/wiki/Birthday_problem\n", " \"\"\"\n", " raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )\n", " return np.ceil(raw).astype(np.int32)\n", "\n", "n = 23\n", "pbirthday_naive_ans = pbirthday_naive(n)\n", "print(('{:.1f}% chance of at least 1 birthday match '\n", " 'among {} people'.format(pbirthday_naive_ans*100.0, n)))\n", "\n", "p = 0.50\n", "qbirthday_naive_ans = qbirthday_naive(p)\n", "print(('For an approximately {:.1f}% chance of at least 1 birthday match, '\n", " 'we need {} people'.format(p*100.0, qbirthday_naive_ans)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also find the probability of having at least one _triple birthday match_, i.e., three people with the same birthday; with the pbirthday and qbirthday functions below, all we have to do is add coincident=3 to say we're looking for triple matches. For example, pbirthday(23, coincident=3) returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. qbirthday(0.5, coincident=3) returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.4% chance of triple birthday match among 23 people\n", "We need 88 people for a triple match with probability of 50.0%\n" ] } ], "source": [ "def pbirthday(n, classes=365, coincident=2):\n", " \"\"\" Return the probability of a coincidence of a birthday match among n people.\n", " \n", " Python implementation of R's pbirthday {stats}.\n", " \"\"\"\n", " k = coincident\n", " c = classes\n", " \n", " if k < 2:\n", " return 1.0\n", " if k == 2:\n", " numer = np.arange(c, c-n, -1)\n", " denom = np.array([c]*n)\n", " return 1.0 - np.sum(np.prod( numer/denom ))\n", " if k > n: \n", " return 0.0\n", " if n > c * (k - 1):\n", " return 1.0\n", " else:\n", " lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)\n", " lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)\n", " return -np.expm1(-np.exp(lxx))\n", "\n", "def qbirthday(prob=0.5, classes=365, coincident=2):\n", " \"\"\" Return the smallest number of people needed to have at least\n", " the specified probability of coincidence.\n", " \n", " Python implementation of R's qbirthday {stats}.\n", " \"\"\"\n", " k = coincident\n", " c = classes\n", " p = prob\n", " \n", " if p <= 0.0:\n", " return 1\n", " if p >= 1.0:\n", " return c * (k - 1) + 1\n", " \n", " N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)\n", " N = np.ceil(N).astype(np.int32)\n", " if pbirthday(N, c, k) < p:\n", " N += 1\n", " while pbirthday(N, c, k) < p:\n", " N += 1\n", " elif pbirthday(N - 1, c, k) >= p:\n", " N -= 1\n", " while pbirthday(N - 1, c, k) >= p:\n", " N -= 1\n", " return N\n", "\n", "n = 23\n", "pbirthday_ans = pbirthday(n, coincident=3)\n", "print(('{:.1F}% chance of triple birthday match '\n", " 'among {} people'.format(pbirthday_ans*100.0, n)))\n", "\n", "p = 0.5\n", "qbirthday_ans = qbirthday(p, coincident=3)\n", "print(('We need {} people for a triple match '\n", " 'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To simulate the birthday problem, we can use" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "np.random.seed(13)\n", "\n", "b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)\n", "u, c = np.unique(b, return_counts=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to generate random birthdays for 23 people and then use [numpy.unique](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html), specifiying return_counts=True, to tally up the counts of how many people were born on each day. We can run 104 repetitions as follows:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118\n" ] } ], "source": [ "np.random.seed(21)\n", "\n", "n = 23\n", "trials = 10**4\n", "\n", "ordered = np.arange(1, n+1)\n", "\n", "m = 0\n", "for i in range(trials): \n", " b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)\n", " u, c = np.unique(b, return_counts=True)\n", " if np.sum(c > 1):\n", " m += 1\n", "\n", "p = m / trials \n", "print(('In a simulation with {} trials, ' \n", " 'the probability of a birthday match with {} people '\n", " 'is {}'.format(trials, n, p)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since numpy.random.choice allows us to specify a list p of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Joseph K. Blitzstein and Jessica Hwang, Harvard University and Stanford University, © 2019 by Taylor and Francis Group, LLC" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" }, "notebook_info": { "author": "Joseph K. Blitzstein, Jessica Hwang", "chapter": "1. Probability and Counting", "creator": "buruzaemon", "section": "1.8", "title": "Introduction to Probability, 1st Edition" } }, "nbformat": 4, "nbformat_minor": 2 }