{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Left Handed Sister Problem" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "Think Bayes, Second Edition\n", "\n", "Copyright 2021 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you meet someone who looks like the brother of your friend Mary. \n", "You ask if he has a sister named Mary, and he says \"Yes I do, but I don't think I know you.\"\n", "\n", "You remember that Mary has a sister who is left-handed, but you don't remember her name.\n", "So you ask your new friend if he has another sister who is left-handed.\n", "\n", "If he does, how much evidence does that provide that he is the brother of your friend, rather than a random person who coincidentally has a sister named Mary and another sister who is left-handed. In other words, what is the Bayes factor of the left-handed sister?\n", "\n", "Let's assume:\n", "\n", "* Out of 100 families with children, 20 have one child, 30 have two children, 40 have three children, and 10 have four children.\n", "\n", "* All children are either boys or girls with equal probability, one girl in 10 is left-handed, and one girl in 100 is named Mary.\n", "\n", "* Name, sex, and handedness are independent, so every child has the same probability of being a girl, left-handed, or named Mary.\n", "\n", "* If the person you met had more than one sister named Mary, he would have said so, but he could have more than one sister who is left handed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Constructing the prior\n", "\n", "I'll make a Pandas `Series` that enumerates possible families with 2, 3, or 4 children." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "qs = [(2, 0),\n", " (1, 1),\n", " (0, 2),\n", " (3, 0),\n", " (2, 1),\n", " (1, 2),\n", " (0, 3),\n", " (4, 0),\n", " (3, 1),\n", " (2, 2),\n", " (1, 3),\n", " (0, 4),\n", " ]\n", "index = pd.MultiIndex.from_tuples(qs, names=['Boys', 'Girls'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compute the proportion of each type of family, I'll use Scipy to compute the binomial distribution." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import binom\n", "\n", "boys = index.to_frame()['Boys']\n", "girls = index.to_frame()['Girls']\n", "ps = binom.pmf(girls, boys+girls, 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And put the results into a Pandas `Series`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Prior
BoysGirls
200.2500
110.5000
020.2500
300.1250
210.3750
120.3750
030.1250
400.0625
310.2500
220.3750
130.2500
040.0625
\n", "
" ], "text/plain": [ " Prior\n", "Boys Girls \n", "2 0 0.2500\n", "1 1 0.5000\n", "0 2 0.2500\n", "3 0 0.1250\n", "2 1 0.3750\n", "1 2 0.3750\n", "0 3 0.1250\n", "4 0 0.0625\n", "3 1 0.2500\n", "2 2 0.3750\n", "1 3 0.2500\n", "0 4 0.0625" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prior1 = pd.Series(ps, index, name='Prior')\n", "pd.DataFrame(prior1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we also have the information frequencies of these families are proportional to 30%, 40%, and 10%, so we can multiply through." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Prior
BoysGirls
207.500
1115.000
027.500
305.000
2115.000
1215.000
035.000
400.625
312.500
223.750
132.500
040.625
\n", "
" ], "text/plain": [ " Prior\n", "Boys Girls \n", "2 0 7.500\n", "1 1 15.000\n", "0 2 7.500\n", "3 0 5.000\n", "2 1 15.000\n", "1 2 15.000\n", "0 3 5.000\n", "4 0 0.625\n", "3 1 2.500\n", "2 2 3.750\n", "1 3 2.500\n", "0 4 0.625" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ps = [30, 30, 30, 40, 40, 40, 40, 10, 10, 10, 10, 10]\n", "\n", "prior1 *= ps\n", "pd.DataFrame(prior1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that's the (unnormalized) prior.\n", "\n", "I'll use the following function to do Bayesian updates." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def make_table(prior, likelihood):\n", " \"\"\"Make a DataFrame representing a Bayesian update.\"\"\"\n", " table = pd.DataFrame(prior)\n", " table.columns = ['Prior']\n", " table['Likelihood'] = likelihood\n", " table['Product'] = (table['Prior'] * table['Likelihood'])\n", " total = table['Product'].sum()\n", " table['Posterior'] = table['Product'] / total\n", " return table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function takes a prior and a likelihood and returns a `DataFrame`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The first update\n", "\n", "Due to [length-biased sampling](https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709), the person you met is more likely to come from family with more boys.\n", "Specifically, the likelihood of meeting someone from a family with $n$ boys is proportional to $n$." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriorLikelihoodProductPosterior
BoysGirls
207.500215.00.136364
1115.000115.00.136364
027.50000.00.000000
305.000315.00.136364
2115.000230.00.272727
1215.000115.00.136364
035.00000.00.000000
400.62542.50.022727
312.50037.50.068182
223.75027.50.068182
132.50012.50.022727
040.62500.00.000000
\n", "
" ], "text/plain": [ " Prior Likelihood Product Posterior\n", "Boys Girls \n", "2 0 7.500 2 15.0 0.136364\n", "1 1 15.000 1 15.0 0.136364\n", "0 2 7.500 0 0.0 0.000000\n", "3 0 5.000 3 15.0 0.136364\n", "2 1 15.000 2 30.0 0.272727\n", "1 2 15.000 1 15.0 0.136364\n", "0 3 5.000 0 0.0 0.000000\n", "4 0 0.625 4 2.5 0.022727\n", "3 1 2.500 3 7.5 0.068182\n", "2 2 3.750 2 7.5 0.068182\n", "1 3 2.500 1 2.5 0.022727\n", "0 4 0.625 0 0.0 0.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "likelihood1 = prior1.index.to_frame()['Boys']\n", "table1 = make_table(prior1, likelihood1)\n", "table1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that's what we should believe about the family after the first update." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The second update\n", "\n", "The likelihood that a person has exactly one sister named Mary is given by the binomial distribution where `n` is the number of girls in the family and `p` is the probability that a girl is named Mary." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 0.01 , 0.0198 , 0. , 0.01 ,\n", " 0.0198 , 0.029403 , 0. , 0.01 , 0.0198 ,\n", " 0.029403 , 0.03881196])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import binom\n", "\n", "ns = prior1.index.to_frame()['Girls']\n", "p = 1 / 100\n", "k = 1\n", "\n", "likelihood2 = binom.pmf(k, ns, p)\n", "likelihood2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the second update." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriorLikelihoodProductPosterior
BoysGirls
200.1363640.0000000.0000000.000000
110.1363640.0100000.0013640.143677
020.0000000.0198000.0000000.000000
300.1363640.0000000.0000000.000000
210.2727270.0100000.0027270.287354
120.1363640.0198000.0027000.284481
030.0000000.0294030.0000000.000000
400.0227270.0000000.0000000.000000
310.0681820.0100000.0006820.071839
220.0681820.0198000.0013500.142240
130.0227270.0294030.0006680.070409
040.0000000.0388120.0000000.000000
\n", "
" ], "text/plain": [ " Prior Likelihood Product Posterior\n", "Boys Girls \n", "2 0 0.136364 0.000000 0.000000 0.000000\n", "1 1 0.136364 0.010000 0.001364 0.143677\n", "0 2 0.000000 0.019800 0.000000 0.000000\n", "3 0 0.136364 0.000000 0.000000 0.000000\n", "2 1 0.272727 0.010000 0.002727 0.287354\n", "1 2 0.136364 0.019800 0.002700 0.284481\n", "0 3 0.000000 0.029403 0.000000 0.000000\n", "4 0 0.022727 0.000000 0.000000 0.000000\n", "3 1 0.068182 0.010000 0.000682 0.071839\n", "2 2 0.068182 0.019800 0.001350 0.142240\n", "1 3 0.022727 0.029403 0.000668 0.070409\n", "0 4 0.000000 0.038812 0.000000 0.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prior2 = table1['Posterior']\n", "table2 = make_table(prior2, likelihood2)\n", "table2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the sister named Mary, we can rule out families with no girls, and families with more than one girls are more likely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probability of a left-handed sister\n", "\n", "Finally, we can compute the probability that he has at least one left-handed sister.\n", "The likelihood comes from the binomial distribution again, where `n` is the number of *additional* sisters, and we use the survival function to compute the probability that one or more are left-handed." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Additional sisters
BoysGirls
200
110
021
300
210
121
032
400
310
221
132
043
\n", "
" ], "text/plain": [ " Additional sisters\n", "Boys Girls \n", "2 0 0\n", "1 1 0\n", "0 2 1\n", "3 0 0\n", "2 1 0\n", "1 2 1\n", "0 3 2\n", "4 0 0\n", "3 1 0\n", "2 2 1\n", "1 3 2\n", "0 4 3" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ns = prior1.index.to_frame()['Girls'] - 1\n", "ns.name = 'Additional sisters'\n", "neg = (ns < 0)\n", "ns[neg] = 0\n", "pd.DataFrame(ns)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 0. , 0.1 , 0. , 0. , 0.1 , 0.19 , 0. , 0. ,\n", " 0.1 , 0.19 , 0.271])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = 1 / 10\n", "k = 1\n", "\n", "likelihood3 = binom.sf(k-1, ns, p)\n", "likelihood3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A convenient way to compute the total probability of an outcome is to do an update as if it happened, ignore the posterior probabilities, and compute the sum of the products. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriorLikelihoodProductPosterior
BoysGirls
200.0000000.0000.0000000.000000
110.1436770.0000.0000000.000000
020.0000000.1000.0000000.000000
300.0000000.0000.0000000.000000
210.2873540.0000.0000000.000000
120.2844810.1000.0284480.507550
030.0000000.1900.0000000.000000
400.0000000.0000.0000000.000000
310.0718390.0000.0000000.000000
220.1422400.1000.0142240.253775
130.0704090.1900.0133780.238675
040.0000000.2710.0000000.000000
\n", "
" ], "text/plain": [ " Prior Likelihood Product Posterior\n", "Boys Girls \n", "2 0 0.000000 0.000 0.000000 0.000000\n", "1 1 0.143677 0.000 0.000000 0.000000\n", "0 2 0.000000 0.100 0.000000 0.000000\n", "3 0 0.000000 0.000 0.000000 0.000000\n", "2 1 0.287354 0.000 0.000000 0.000000\n", "1 2 0.284481 0.100 0.028448 0.507550\n", "0 3 0.000000 0.190 0.000000 0.000000\n", "4 0 0.000000 0.000 0.000000 0.000000\n", "3 1 0.071839 0.000 0.000000 0.000000\n", "2 2 0.142240 0.100 0.014224 0.253775\n", "1 3 0.070409 0.190 0.013378 0.238675\n", "0 4 0.000000 0.271 0.000000 0.000000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prior3 = table2['Posterior']\n", "table3 = make_table(prior3, likelihood3)\n", "table3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, there are only three family types left standing, (1,2), (2,2), and (1,3).\n", "\n", "Here's the total probability that your new friend has a left-handed sister." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0560498128605398" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = table3['Product'].sum()\n", "p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Bayes factor\n", "\n", "If your interlocutor is the brother of your friend, the probability is 1 that he has a left-handed sister.\n", "If he is not the brother of your friend, the probability is `p`.\n", "So the Bayes factor is the ratio of these probabilities." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.841272770850235" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1/p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This might be the hardest Bayesian puzzle I've created.\n", "In fact, I got it wrong the first time, until [Aubrey Clayton convinced me](https://twitter.com/aubreyclayton/status/1420041376377475075) I needed to take into account the number of boys and girls in each family, not just the size.\n", "He solved the problem by enumerating the possible families in a giant spreadsheet!\n", "So the fact that we get the same answer gives me more confidence it is correct.\n", "\n", "Thanks to Aubrey and the other folks on Twitter who submitted answers, including \n", "[Corey Yanofsky](https://twitter.com/Corey_Yanofsky/status/1418627294256582664) and\n", "[Michal Haltuf](https://twitter.com/MichalHaltuf/status/1418685902717693952).\n", "\n", "If you like this puzzle, you might like the [new second edition of *Think Bayes*](https://thinkbayes.com)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 2 }