{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "\n", "from datascience import *\n", "import numpy as np\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Inspired by Peter Norvig's [A Concrete Introduction to Probability](http://nbviewer.jupyter.org/url/norvig.com/ipython/Probability.ipynb)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probability is the study of the observations that are generated by sampling from a known distribution. The statistical methods we have studied so far allow us to reason about the world from data. Probability allows us to reason in the opposite directions: what observations result from know facts about the world.\n", "\n", "In practice, we rarely know precisely how random processes in the world work. Even the simplest random process such as [flipping a coin](http://statweb.stanford.edu/~susan/papers/headswithJ.pdf) is not as simple as one might assume. Nonetheless, our ability to reason about what data will result from a known random process is useful in many aspects of data science. Fundamentally, the rules of probability allow us to reason about the consequences of assumptions we make." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probability Distributions\n", "\n", "Probability uses the following vocabulary to describe the relationship between known distributions and their observed outcomes.\n", "\n", "- **Experiment**: An occurrence with an uncertain outcome that we can observe. For example, the result of rolling two dice in sequence.\n", "- **Outcome**: The result of an experiment; one particular state of the world. For example: the first die comes up 4 and the second comes up 2. This outcome could be summarized as `(4, 2)`.\n", "- **Sample Space**: The set of all possible outcomes for the experiment. For example, `{(1, 1), (1, 2), (1, 3), ..., (1, 6), (2, 1), (2, 2), ... (6, 4), (6, 5), (6, 6)}` are all possible outcomes of rolling two dice in sequence. There are `6 * 6 = 36` different outcomes.\n", "- **Event**: A subset of possible outcomes that together have some property we are interested in. For example, the event \"the two dice sum to 5\" is the set of outcomes {(1, 4), (2, 3), (3, 2), (4, 1)}.\n", "- **Probability**: The proportion of experiments for which the event occurs. For example, the probability that the two dice sum to 5 is `4/36` or `1/9`.\n", "- **Distribution**: The probability of all events.\n", "\n", "The important part of this terminology is that the *sample space* is the set of all *outcomes*, and an *event* is a subset of the sample space. For an outcome that is an element of an event, we say that it is an outcome for which that event occurs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multinomial Distributions\n", "\n", "There are many kinds of distributions, but we will focus on the case where the sample space is a fixed, finite set of mutually exclusive outcomes, called a *multinomial* distribution. \n", "\n", "For example, either the two dice come up `(2, 1)` or they come up `(4, 5)` in a single roll (but both cannot occur), and there are exactly 36 outcome possibilities that each occur with equal chance." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
First Second Chance
1 1 2.8%
1 2 2.8%
1 3 2.8%
1 4 2.8%
1 5 2.8%
1 6 2.8%
2 1 2.8%
2 2 2.8%
2 3 2.8%
2 4 2.8%
\n", "

... (26 rows omitted)\n", " \n", " \n", " Sum Chance\n", " \n", " \n", " \n", " \n", " 2 2.8% \n", " \n", " \n", " \n", " 3 5.6% \n", " \n", " \n", " \n", " 4 8.3% \n", " \n", " \n", " \n", " 5 11.1% \n", " \n", " \n", " \n", " 6 13.9% \n", " \n", " \n", " \n", " 7 16.7% \n", " \n", " \n", " \n", " 8 13.9% \n", " \n", " \n", " \n", " 9 11.1% \n", " \n", " \n", " \n", " 10 8.3% \n", " \n", " \n", " \n", " 11 5.6% \n", " \n", " \n", "\n", "

... (1 rows omitted)\n", " \n", " \n", " First Second Chance\n", " \n", " \n", " \n", " \n", " 1 4 2.8% \n", " \n", " \n", " \n", " 2 3 2.8% \n", " \n", " \n", " \n", " 3 2 2.8% \n", " \n", " \n", " \n", " 4 1 2.8% \n", " \n", " \n", "" ], "text/plain": [ "First | Second | Chance\n", "1 | 4 | 2.8%\n", "2 | 3 | 2.8%\n", "3 | 2 | 2.8%\n", "4 | 1 | 2.8%" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dice_sums = two_dice.column('First') + two_dice.column('Second')\n", "sum_of_5 = two_dice.where(dice_sums == 5)\n", "sum_of_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probability of the event is the sum of the resulting chances." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.1111111111111111" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(sum_of_5.column('Chance'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we consistently store the probabilities of individual outcomes in a column called `Chance`, then we can define the probability of any event using the `P` function." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.1111111111111111" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def P(event):\n", " return sum(event.column('Chance'))\n", "\n", "P(sum_of_5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One distribution's events can be another distribution's outcomes, as is the case with the `two_dice` and `two_dice_sum` distributions. There are 11 different possible sums in the `two_dice` table, and these are 11 different mutually exclusive events." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
First Second Chance Sum
1 1 2.8% 2
1 2 2.8% 3
1 3 2.8% 4
1 4 2.8% 5
1 5 2.8% 6
1 6 2.8% 7
2 1 2.8% 3
2 2 2.8% 4
2 3 2.8% 5
2 4 2.8% 6
\n", "

... (26 rows omitted)\n", " \n", " \n", " Sum count\n", " \n", " \n", " \n", " \n", " 2 1 \n", " \n", " \n", " \n", " 3 2 \n", " \n", " \n", " \n", " 4 3 \n", " \n", " \n", " \n", " 5 4 \n", " \n", " \n", " \n", " 6 5 \n", " \n", " \n", " \n", " 7 6 \n", " \n", " \n", " \n", " 8 5 \n", " \n", " \n", " \n", " 9 4 \n", " \n", " \n", " \n", " 10 3 \n", " \n", " \n", " \n", " 11 2 \n", " \n", " \n", "\n", "

... (1 rows omitted)\n", " \n", " \n", " Sum Chance\n", " \n", " \n", " \n", " \n", " 2 2.8% \n", " \n", " \n", " \n", " 3 5.6% \n", " \n", " \n", " \n", " 4 8.3% \n", " \n", " \n", " \n", " 5 11.1% \n", " \n", " \n", " \n", " 6 13.9% \n", " \n", " \n", " \n", " 7 16.7% \n", " \n", " \n", " \n", " 8 13.9% \n", " \n", " \n", " \n", " 9 11.1% \n", " \n", " \n", " \n", " 10 8.3% \n", " \n", " \n", " \n", " 11 5.6% \n", " \n", " \n", "\n", "

... (1 rows omitted)\n", " \n", " \n", " Time Hour Chance\n", " \n", " \n", " \n", " \n", " 6 a.m. 6 2.9% \n", " \n", " \n", " \n", " 7 a.m. 7 4.5% \n", " \n", " \n", " \n", " 8 a.m. 8 6.3% \n", " \n", " \n", " \n", " 9 a.m. 9 5.0% \n", " \n", " \n", " \n", " 10 a.m. 10 5.0% \n", " \n", " \n", " \n", " 11 a.m. 11 5.0% \n", " \n", " \n", " \n", " Noon 12 6.0% \n", " \n", " \n", " \n", " 1 p.m. 13 5.7% \n", " \n", " \n", " \n", " 2 p.m. 14 5.1% \n", " \n", " \n", " \n", " 3 p.m. 15 4.8% \n", " \n", " \n", " \n", " 4 p.m. 16 4.9% \n", " \n", " \n", " \n", " 5 p.m. 17 5.0% \n", " \n", " \n", " \n", " 6 p.m. 18 4.5% \n", " \n", " \n", " \n", " 7 p.m. 19 4.0% \n", " \n", " \n", " \n", " 8 p.m. 20 4.0% \n", " \n", " \n", " \n", " 9 p.m. 21 3.7% \n", " \n", " \n", " \n", " 10 p.m. 22 3.5% \n", " \n", " \n", " \n", " 11 p.m. 23 3.3% \n", " \n", " \n", " \n", " Midnight 0 2.9% \n", " \n", " \n", " \n", " 1 a.m. 1 2.9% \n", " \n", " \n", " \n", " 2 a.m. 2 2.8% \n", " \n", " \n", " \n", " 3 a.m. 3 2.7% \n", " \n", " \n", " \n", " 4 a.m. 4 2.7% \n", " \n", " \n", " \n", " 5 a.m. 5 2.8% \n", " \n", " \n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "birth = Table.read_table('birth_time.csv').select(['Time', 'Hour', 'Chance'])\n", "birth.set_format('Chance', PercentFormatter(1)).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we assume that this distribution describes the world correctly, what is the chance that a baby will be born between 8 a.m. and 6 p.m.? It turns out that more than half of all babies are born during \"business hours\", even though this time interval only covers 5/12 of the day." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time Hour Chance
8 a.m. 8 6.3%
9 a.m. 9 5.0%
10 a.m. 10 5.0%
11 a.m. 11 5.0%
Noon 12 6.0%
1 p.m. 13 5.7%
2 p.m. 14 5.1%
3 p.m. 15 4.8%
4 p.m. 16 4.9%
5 p.m. 17 5.0%
" ], "text/plain": [ "Time | Hour | Chance\n", "8 a.m. | 8 | 6.3%\n", "9 a.m. | 9 | 5.0%\n", "10 a.m. | 10 | 5.0%\n", "11 a.m. | 11 | 5.0%\n", "Noon | 12 | 6.0%\n", "1 p.m. | 13 | 5.7%\n", "2 p.m. | 14 | 5.1%\n", "3 p.m. | 15 | 4.8%\n", "4 p.m. | 16 | 4.9%\n", "5 p.m. | 17 | 5.0%" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "business_hours = birth.where('Hour', are.between(8, 18))\n", "business_hours" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.52800000000000002" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(business_hours)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, births late at night are uncommon. The chance of having a baby between midnight and 6 a.m. is much less than 25%, even though this time interval covers 1/4 of the day." ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.16799999999999998" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(birth.where('Hour', are.between(0, 6)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conditional Distributions\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common way to transform one distribution into another is to *condition* on an event. Conditioning means assuming that the event occurs. The conditional distribution *given* an event takes the chances of all outcomes for which the event occurs and scales up their chances so that they sum to 1.\n", "\n", "To scale up the chances, we divide the chance of each outcome in the event by the probability of the event. For example, if we condition on the event that the sum of two dice is above 8, we arrive at the following conditional distribution." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sum Chance
9 40.0%
10 30.0%
11 20.0%
12 10.0%
" ], "text/plain": [ "Sum | Chance\n", "9 | 40.0%\n", "10 | 30.0%\n", "11 | 20.0%\n", "12 | 10.0%" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "above_8 = two_dice_sums.where('Sum', are.above(8))\n", "given_8 = above_8.with_column('Chance', above_8.column('Chance') / P(above_8))\n", "given_8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A conditional distribution describes the chances under the original distribution, updated to reflect a new piece of information. In this case, we see the chances under the original `two_dice_sums` distribution, *given that* the sum is above 8.\n", "\n", "The conditioning operation can be expressed by the `given` function, which takes an event table and returns the corresponding conditional distributon." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sum Chance
9 40.0%
10 30.0%
11 20.0%
12 10.0%
" ], "text/plain": [ "Sum | Chance\n", "9 | 40.0%\n", "10 | 30.0%\n", "11 | 20.0%\n", "12 | 10.0%" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def given(event):\n", " return event.with_column('Chance', event.column('Chance') / P(event))\n", "\n", "given(two_dice_sums.where('Sum', are.above(8)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given that a baby is born in the US during business hours (between 8 a.m. and 6 p.m.), what are the chances that it is born before noon? To answer this question, we first form the conditional distribution given business hours, then compute the probability of the event that the baby is born before noon." ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time Hour Chance
8 a.m. 8 11.9%
9 a.m. 9 9.5%
10 a.m. 10 9.5%
11 a.m. 11 9.5%
Noon 12 11.4%
1 p.m. 13 10.8%
2 p.m. 14 9.7%
3 p.m. 15 9.1%
4 p.m. 16 9.3%
5 p.m. 17 9.5%
" ], "text/plain": [ "Time | Hour | Chance\n", "8 a.m. | 8 | 11.9%\n", "9 a.m. | 9 | 9.5%\n", "10 a.m. | 10 | 9.5%\n", "11 a.m. | 11 | 9.5%\n", "Noon | 12 | 11.4%\n", "1 p.m. | 13 | 10.8%\n", "2 p.m. | 14 | 9.7%\n", "3 p.m. | 15 | 9.1%\n", "4 p.m. | 16 | 9.3%\n", "5 p.m. | 17 | 9.5%" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "given(business_hours)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.40340909090909094" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(given(business_hours).where('Hour', are.below(12)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result is not the same as the chance that a baby is born between 8 a.m. and noon in general. " ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.21300000000000002" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "morning = birth.where('Hour', are.between(8, 12))\n", "P(morning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nor is it the chance that a baby is born before noon in general." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.45500000000000007" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(birth.where('Hour', are.below(12)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, it is the probability that both the event and the conditioning event are true (i.e., that the birth is in the morning), divided by the probability of the conditioning event (i.e., that the birth is during business hours)." ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.40340909090909094" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(morning) / P(business_hours)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `morning` event is a *joint* event that can be described as both during business hours and before noon. A joint event is just an event, but one that is described by the intersection of two other events." ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time Hour Chance
8 a.m. 8 6.3%
9 a.m. 9 5.0%
10 a.m. 10 5.0%
11 a.m. 11 5.0%
" ], "text/plain": [ "Time | Hour | Chance\n", "8 a.m. | 8 | 6.3%\n", "9 a.m. | 9 | 5.0%\n", "10 a.m. | 10 | 5.0%\n", "11 a.m. | 11 | 5.0%" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "business_hours.where('Hour', are.below(12))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, the probability of an event B (e.g., before noon) conditioned on an event A (e.g., during business hours) is the probability of the joint event of A and B divided by the probability of the conditioning event A. " ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.40340909090909094" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(business_hours.where('Hour', are.below(12))) / P(business_hours)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standard notation for this relationship uses a vertical bar for conditioning and a comma for a joint event.\n", "\n", "$$P(B | A) = \\frac{P(A, B)}{P(A)}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joint Distributions\n", "\n", "A joint event is one in which two different events both occur. Similarly, a joint outcome is an outcome that is composed of two different outcomes. For example, the outcomes of the `two_dice` distribution are each joint outcomes of the first and second dice rolls." ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
First Second Chance
1 1 2.8%
1 2 2.8%
1 3 2.8%
1 4 2.8%
1 5 2.8%
1 6 2.8%
2 1 2.8%
2 2 2.8%
2 3 2.8%
2 4 2.8%
\n", "

... (26 rows omitted)\n", "\n", "All of the proportions used to generate this chart appear in the `birth_day` table." ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time Hour Weekday Weekend
6 a.m. 6 2.7% 3.6%
7 a.m. 7 4.7% 3.8%
8 a.m. 8 6.7% 4.6%
9 a.m. 9 5.1% 5.0%
10 a.m. 10 5.0% 5.0%
11 a.m. 11 5.0% 4.9%
Noon 12 6.3% 5.0%
1 p.m. 13 5.9% 4.7%
2 p.m. 14 5.3% 4.6%
3 p.m. 15 4.9% 4.6%
\n", "

... (14 rows omitted)\n", " \n", " \n", " Day Hour Chance\n", " \n", " \n", " \n", " \n", " Weekday 6 2.1% \n", " \n", " \n", " \n", " Weekday 7 3.7% \n", " \n", " \n", " \n", " Weekday 8 5.2% \n", " \n", " \n", " \n", " Weekday 9 4.0% \n", " \n", " \n", " \n", " Weekday 10 3.9% \n", " \n", " \n", " \n", " Weekday 11 3.9% \n", " \n", " \n", " \n", " Weekday 12 4.9% \n", " \n", " \n", " \n", " Weekday 13 4.6% \n", " \n", " \n", " \n", " Weekday 14 4.1% \n", " \n", " \n", " \n", " Weekday 15 3.8% \n", " \n", " \n", "\n", "

... (38 rows omitted)\n", " \n", " \n", " Day Hour Chance\n", " \n", " \n", " \n", " \n", " Weekday 5 2.0% \n", " \n", " \n", " \n", " Weekend 5 0.8% \n", " \n", " \n", "" ], "text/plain": [ "Day | Hour | Chance\n", "Weekday | 5 | 2.0%\n", "Weekend | 5 | 0.8%" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "early_morning = birth_joint.where('Hour', 5)\n", "early_morning" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.28584466551063248" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(given(early_morning).where('Day', 'Weekend'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayes' Rule\n", "\n", "In the final example above, we began with a distribution over weekday and weekend births, $P(A)$, along with conditional distributions over hours, $P(B|A)$. We were able to compute the probability of a weekend birth conditioned on an hour, an outcome in the distribution $P(A|B)$. Bayes' rule is the formula that expresses the relationship between all of these quantities.\n", "\n", "$$ P(A|B) = \\frac{P(A) \\times P(B|A)}{P(B)} $$\n", "\n", "The numerator on the right-hand side is just the joint probability of A and B. Bayes' rule writes this probability in its expanded form because so often we are given these two components and must form the joint distribution through multiplication. \n", "\n", "The denominator on the right-hand side is an event that can be computed from the joint probability. For example, the `early_morning` event in the example above is an event just about hours, but it includes all joint outcomes where the correct hour occurs.\n", "\n", "Each probability in this equation has a name. The names are derived from the most typical application of Bayes' rule, which is to update one's beliefs based on new evidence.\n", "\n", "- $P(A)$ is called the *prior*; it is the probability of event A before any evidence is observed.\n", "- $P(B|A)$ is called the *likelihood*; it is the conditional probability of the evidence event B given the event A.\n", "- $P(B)$ is called the *evidence*; it is the probability of the evidence event B for any outcome.\n", "- $P(A|B)$ is called the *posterior*; it is the probabilyt of event A after evidence event B is observed.\n", "\n", "Depending on the joint distribution of A and B, observing some B can make A more or less likely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Diagnostic Example\n", "\n", "In a population, there is a rare disease. Researchers have developed a medical test for the disease. Mostly, the test correctly identifies whether or not the tested person has the disease. But sometimes, the test is wrong. Here are the relevant proportions.\n", "\n", "- 1% of the population has the disease\n", "- If a person has the disease, the test returns the correct result with chance 99%.\n", "- If a person does not have the disease, the test returns the correct result with chance 99.5%.\n", "\n", "**One person is picked at random from the population.** Given that the person tests positive, what is the chance that the person has the disease?\n", "\n", "We begin by partitioning the population into four categories in the tree diagram below.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By Bayes' Rule, the chance that the person has the disease given that he or she has tested positive is the chance of the top \"Test Positive\" branch relative to the total chance of the two \"Test Positive\" branches. The answer is\n", "$$\n", "\\frac{0.01 \\times 0.99}{0.01 \\times 0.99 ~+~ 0.99 \\times 0.005} ~=~ 0.667\n", "$$" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.6666666666666666" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The person is picked at random from the population.\n", "\n", "# By Bayes' Rule:\n", "# Chance that the person has the disease, given that test was +\n", "\n", "(0.01*0.99)/(0.01*0.99 + 0.99*0.005)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is 2/3, and it seems rather small. The test has very high accuracy, 99% or higher. Yet is our answer saying that if a patient gets tested and the test result is positive, there is only a 2/3 chance that he or she has the disease?\n", "\n", "To understand our answer, it is important to recall the chance model: our calculation is valid for **a person picked at random from the population**. Among all the people in the population, the people who test positive split into two groups: those who have the disease and test positive, and those who don't have the disease and test positive. The latter group is called the group of *false positives*. The proportion of true positives is twice as high as that of the false positives – $0.01 \\times 0.99$ compared to $0.99 \\times 0.005$ – and hence the chance of a true positive given a positive test result is $2/3$. The chance is affected both by the accuracy of the test and also by the probability that the sampled person has the disease.\n", "\n", "The same result can be computed using a table. Below, we begin with the joint distribution over true health and test results." ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Health Test Chance
Diseased Positive 0.0099
Diseased Negative 0.0001
Not Diseased Positive 0.00495
Not Diseased Negative 0.98505
" ], "text/plain": [ "Health | Test | Chance\n", "Diseased | Positive | 0.0099\n", "Diseased | Negative | 0.0001\n", "Not Diseased | Positive | 0.00495\n", "Not Diseased | Negative | 0.98505" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rare = Table(['Health', 'Test', 'Chance']).with_rows([\n", " ['Diseased', 'Positive', 0.01 * 0.99],\n", " ['Diseased', 'Negative', 0.01 * 0.01],\n", " ['Not Diseased', 'Positive', 0.99 * 0.005],\n", " ['Not Diseased', 'Negative', 0.99 * 0.995]\n", " ])\n", "rare" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The chance that a person selected at random is diseased, given that they tested positive, is computed from the following expression." ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.66666666666666663" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive = rare.where('Test', 'Positive')\n", "P(given(positive).where('Health', 'Diseased'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But a patient who goes to get tested for a disease is not well modeled as a random member of the population. People get tested because they think they might have the disease, or because their doctor thinks so. In such a case, saying that their chance of having the disease is 1% is not justified; they are not picked at random from the population.\n", "\n", "So, while our answer is correct for a random member of the population, it does not answer the question for a person who has walked into a doctor's office to be tested.\n", "\n", "To answer the question for such a person, we must first ask ourselves what is the probability that the person has the disease. It is natural to think that it is larger than 1%, as the person has some reason to believe that he or she might have the disease. But how much larger?\n", "\n", "This cannot be decided based on relative frequencies. The probability that a particular individual has the disease has to be based on a subjective opinion, and is therefore called a *subjective probability*. Some researchers insist that all probabilities must be relative frequencies, but subjective probabilities abound. The chance that a candidate wins the next election, the chance that a big earthquake will hit the Bay Area in the next decade, the chance that a particular country wins the next soccer World Cup: none of these are based on relative frequencies or long run frequencies. Each one contains a subjective element. \n", "\n", "It is fine to work with subjective probabilities as long as you keep in mind that there will be a subjective element in your answer. Be aware also that different people can have different subjective probabilities of the same event. For example, the patient's subjective probability that he or she has the disease could be quite different from the doctor's subjective probability of the same event. Here we will work from the patient's point of view.\n", "\n", "Suppose the patient assigned a number to his/her degree of uncertainty about whether he/she had the disease, *before* seeing the test result. This number is the patient's *subjective prior probability* of having the disease.\n", "\n", "If that probability were 10%, then the probabilities on the left side of the tree diagram would change accordingly, with the 0.1 and 0.9 now interpreted as subjective probabilities:\n", "\n", "\n", "\n", "The change has a noticeable effect on the answer, as you can see by running the cell below." ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9565217391304347" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Subjective prior probability of 10% that the person has the disease\n", "\n", "# By Bayes' Rule:\n", "# Chance that the person has the disease, given that test was +\n", "\n", "(0.1*0.99)/(0.1*0.99 + 0.9*0.005)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the patient's prior probability of having the disease is 10%, then after a positive test result the patient must update that probability to over 95%. This updated probability is called a *posterior* probability. It is calculated *after* learning the test result.\n", "\n", "If the patient's prior probability of havng the disease is 50%, then the result changes yet again. \n", "\n", "" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9949748743718593" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Subjective prior probability of 50% that the person has the disease\n", "\n", "# By Bayes' Rule: \n", "# Chance that the person has the disease, given that test was +\n", "\n", "(0.5*0.99)/(0.5*0.99 + 0.5*0.005)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting out with a 50-50 subjective chance of having the disease, the patient must update that chance to about 99.5% after getting a positive test result. \n", "\n", "**Computational Note**. In the calculation above, the factor of 0.5 is common to all the terms and cancels out. Hence arithmetically it is the same as a calculation where the prior probabilities are apparently missing:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9949748743718593" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "0.99/(0.99 + 0.005)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But in fact, they are not missing. They have just canceled out. When the prior probabilities are not all equal, then they are all visible in the calculation as we have seen earlier.\n", "\n", "In machine learning applications such as spam detection, Bayes' Rule is used to update probabilities of messages being spam, based on new messages being labeled Spam or Not Spam. You will need more advanced mathematics to carry out all the calculations. But the fundamental method is the same as what you have seen in this section." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.4" } }, "nbformat": 4, "nbformat_minor": 0 }