{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bite Size Bayes\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The \"Girl Named Florida\" problem\n", "\n", "In [The Drunkard's Walk](https://www.goodreads.com/book/show/2272880.The_Drunkard_s_Walk), Leonard Mlodinow presents \"The Girl Named Florida Problem\":\n", "\n", ">\"In a family with two children, what are the chances, if [at least] one of the children is a girl named Florida, that both children are girls?\"\n", "\n", "I added \"at least\" to Mlodinow's statement of the problem to avoid a subtle ambiguity (which I'll explain at the end).\n", "\n", "To avoid some real-world complications, let's assume that this question takes place in an imaginary city called Statesville where:\n", "\n", "* Every family has two children.\n", "\n", "* 50% of children are male and 50% are female.\n", "\n", "* All children are named after U.S. states, and all state names are chosen with equal probability.\n", "\n", "* Genders and names within each family are chosen independently." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To answer Mlodinow's question, I'll create a DataFrame with one row for each family in Statesville and a column for the gender and name of each child.\n", "\n", "Here's a list of genders and a [dictionary of state names](https://gist.github.com/tlancon/9794920a0c3a9990279de704f936050c):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "gender = ['B', 'G']" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "us_states = {\n", " 'Alabama': 'AL',\n", " 'Alaska': 'AK',\n", " 'Arizona': 'AZ',\n", " 'Arkansas': 'AR',\n", " 'California': 'CA',\n", " 'Colorado': 'CO',\n", " 'Connecticut': 'CT',\n", " 'Delaware': 'DE',\n", "# 'District of Columbia': 'DC',\n", " 'Florida': 'FL',\n", " 'Georgia': 'GA',\n", " 'Hawaii': 'HI',\n", " 'Idaho': 'ID',\n", " 'Illinois': 'IL',\n", " 'Indiana': 'IN',\n", " 'Iowa': 'IA',\n", " 'Kansas': 'KS',\n", " 'Kentucky': 'KY',\n", " 'Louisiana': 'LA',\n", " 'Maine': 'ME',\n", " 'Maryland': 'MD',\n", " 'Massachusetts': 'MA',\n", " 'Michigan': 'MI',\n", " 'Minnesota': 'MN',\n", " 'Mississippi': 'MS',\n", " 'Missouri': 'MO',\n", " 'Montana': 'MT',\n", " 'Nebraska': 'NE',\n", " 'Nevada': 'NV',\n", " 'New Hampshire': 'NH',\n", " 'New Jersey': 'NJ',\n", " 'New Mexico': 'NM',\n", " 'New York': 'NY',\n", " 'North Carolina': 'NC',\n", " 'North Dakota': 'ND',\n", " 'Ohio': 'OH',\n", " 'Oklahoma': 'OK',\n", " 'Oregon': 'OR',\n", " 'Pennsylvania': 'PA',\n", " 'Rhode Island': 'RI',\n", " 'South Carolina': 'SC',\n", " 'South Dakota': 'SD',\n", " 'Tennessee': 'TN',\n", " 'Texas': 'TX',\n", " 'Utah': 'UT',\n", " 'Vermont': 'VT',\n", " 'Virginia': 'VA',\n", " 'Washington': 'WA',\n", " 'West Virginia': 'WV',\n", " 'Wisconsin': 'WI',\n", " 'Wyoming': 'WY'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To enumerate all possible combinations of genders and names, I'll use `from_product`, which makes a Pandas MultiIndex." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "names = ['gender1', 'name1', 'gender2', 'name2']\n", "\n", "index = pd.MultiIndex.from_product([gender, us_states]*2, \n", " names=names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I'll create a DataFrame with that index:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gender1name1gender2name2
BAlabamaBAlabama
Alaska
Arizona
Arkansas
California
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [(B, Alabama, B, Alabama), (B, Alabama, B, Alaska), (B, Alabama, B, Arizona), (B, Alabama, B, Arkansas), (B, Alabama, B, California)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(index=index)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be easier to work with if I reindex it so the levels in the MultiIndex become columns." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gender1name1gender2name2
0BAlabamaBAlabama
1BAlabamaBAlaska
2BAlabamaBArizona
3BAlabamaBArkansas
4BAlabamaBCalifornia
\n", "
" ], "text/plain": [ " gender1 name1 gender2 name2\n", "0 B Alabama B Alabama\n", "1 B Alabama B Alaska\n", "2 B Alabama B Arizona\n", "3 B Alabama B Arkansas\n", "4 B Alabama B California" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.reset_index()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This DataFrame contains one row for each family in Statesville; for example, the first row represents a family with two boys, both named Alabama.\n", "\n", "As it turns out, there are 10,000 families in Statesville:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probabilities\n", "\n", "To compute probabilities, we'll use Boolean Series. For example, the following Series is `True` for each family where the first child is a girl:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "girl1 = (df['gender1']=='G')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function takes a Boolean Series and counts the number of `True` values, which is the probability that the condition is true." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def prob(A):\n", " \"\"\"Computes the probability of a proposition, A.\n", " \n", " A: Boolean series\n", " \n", " returns: probability\n", " \"\"\"\n", " assert isinstance(A, pd.Series)\n", " assert A.dtype == 'bool'\n", " \n", " return A.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly, the probability is 50% that the first child is a girl." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prob(girl1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And so is the probability that the second child is a girl." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "girl2 = (df['gender2']=='G')\n", "prob(girl2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mlodinow's question is a conditional probability: given that one of the children is a girl named Florida, what is the probability that both children are girls?\n", "\n", "To compute conditional probabilities, I'll use this function, which takes two Boolean Series, `A` and `B`, and computes the conditional probability $P(A~\\mathrm{given}~B)$." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def conditional(A, B):\n", " \"\"\"Conditional probability of A given B.\n", " \n", " A: Boolean series\n", " B: Boolean series\n", " \n", " returns: probability\n", " \"\"\"\n", " return prob(A[B])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, here's the probability that the second child is a girl, given that the first child is a girl." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(girl2, girl1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is 50%, which is the same as the unconditioned probability that the second child is a girl:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prob(girl2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that confirms that the genders of the two children are independent, which is one of my assumptions.\n", "\n", "Now, Mlodinow's question asks about the probability that both children are girls, so let's compute that." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.25" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gg = (girl1 & girl2)\n", "prob(gg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 25% of families, both children are girls. And that should be no surprise: because they are independent, the probability of the conjunction is the product of the probabilities:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.25" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prob(girl1) * prob(girl2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While we're at it, we can also compute the conditional probability of two girls, given that the first child is a girl." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(gg, girl1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's what we should expect. If we know the first child is a girl, and the probability is 50% that the second child is a girl, the probability of two girls is 50%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## At least one girl\n", "\n", "Before I answer Mlodinow's question, I'll warm up with a simpler version: given that at least one of the children is a girl, what is the probability that both are?\n", "\n", "To compute the probability of \"at least one girl\" I will use the `|` operator, which computes the logical `OR` of the two Series:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "at_least_one_girl = (girl1 | girl2)\n", "prob(at_least_one_girl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "75% of the families in Statesville have at least one girl.\n", "\n", "Now we can compute the conditional probability of two girls, given that the family has at least one girl." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3333333333333333" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(gg, at_least_one_girl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of the families that have at least one girl, `1/3` have two girls.\n", "\n", "If you have not thought about questions like this before, that result might surprise you. The following figure might help:\n", "\n", "\n", "\n", "In the top left, the gray square represents a family with two boys; in the lower right, the dark blue square represents a family with two girls." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other two quadrants represent families with one girl, but note that there are two ways that can happen: the first child can be a girl or the second child can be a girl.\n", "\n", "There are an equal number of families in each quadrant.\n", "\n", "If we select families with at least one girl, we eliminate the gray square in the upper left. Of the remaining three squares, one of them has two girls.\n", "\n", "So if we know a family has at least one girl, the probability they have two girls is 33%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's in a name?\n", "\n", "So far, we have computed two conditional probabilities:\n", "\n", "* Given that the first child is a girl, the probability is 50% that both children are girls.\n", "\n", "* Given that at least one child is a girl, the probability is 33% that both children are girls.\n", "\n", "Now we're ready to answer Mlodinow's question:\n", "\n", "* Given that at least one child is a girl *named Florida*, what is the probability that both children are girls?\n", "\n", "If your intuition is telling you that the name of the child can't possibly matter, brace yourself.\n", "\n", "Here's the probability that the first child is a girl named Florida." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.01" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gf1 = girl1 & (df['name1']=='Florida')\n", "prob(gf1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the probability that the second child is a girl named Florida." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.01" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gf2 = girl2 & (df['name2']=='Florida')\n", "prob(gf2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compute the probability that at least one of the children is a girl named Florida, we can use the `|` operator again. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0199" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "at_least_one_girl_named_florida = (gf1 | gf2)\n", "prob(at_least_one_girl_named_florida)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can double-check it by using the disjunction rule:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0199" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prob(gf1) + prob(gf2) - prob(gf1 & gf2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, the percentage of families with at least one girl named Florida is a little less than 2%.\n", "\n", "Now, finally, here is the answer to Mlodinow's question:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.49748743718592964" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(gg, at_least_one_girl_named_florida)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's right, the answer is about 49.7%. To summarize:\n", "\n", "* Given that the first child is a girl, the probability is 50% that both children are girls.\n", "\n", "* Given that at least one child is a girl, the probability is 33% that both children are girls.\n", "\n", "* Given that at least one child is a girl *named Florida*, the probability is 49.7% that both children are girls.\n", "\n", "If your brain just exploded, I'm sorry." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's my best attempt to put your brain back together.\n", "\n", "For each child, there are three possibilities: boy (B), girl not named Florida (G), and girl named Florida (GF), with these probabilities:\n", "\n", "$P(B) = 1/2 $\n", "\n", "$P(G) = 1/2 - x $\n", "\n", "$P(GF) = x $\n", "\n", "where $x$ is the percentage of people who are girls named Florida. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In families with two children, here are the possible combinations and their probabilities:\n", "\n", "$P(B, B) = (1/2)(1/2)$\n", "\n", "$P(B, G) = (1/2)(1/2-x)$\n", "\n", "$P(B, GF) = (1/2)(x)$\n", "\n", "$P(G, B) = (1/2-x)(1/2)$\n", "\n", "$P(G, G) = (1/2-x)(1/2-x)$\n", "\n", "$P(G, GF) = (1/2-x)(x)$\n", "\n", "$P(GF, B) = (x)(1/2)$\n", "\n", "$P(GF, G) = (x)(1/2-x)$\n", "\n", "$P(GF, GF) = (x)(x)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we select only the families that have at least one girl named Florida, here are their probabilities:\n", "\n", "$P(B, GF) = (1/2)(x)$\n", "\n", "$P(G, GF) = (1/2-x)(x)$\n", "\n", "$P(GF, B) = (x)(1/2)$\n", "\n", "$P(GF, G) = (x)(1/2-x)$\n", "\n", "$P(GF, GF) = (x)(x)$\n", "\n", "Of those, if we select the families with two girls, here are their probabilities:\n", "\n", "$P(G, GF) = (1/2-x)(x)$\n", "\n", "$P(GF, G) = (x)(1/2-x)$\n", "\n", "$P(GF, GF) = (x)(x)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the conditional probability of two girls, given at least one girl named Florida, we can add up the last 3 probabilities and divide by the sum of the previous 5 probabilities.\n", "\n", "With a little algebra, we get:\n", "\n", "$P(\\mathrm{two~girls} ~|~ \\mathrm{at~least~one~girl~named~Florida}) = (1 - x) / (2 - x)$\n", "\n", "As $x$ approaches $0$ the answer approaches $1/2$.\n", "\n", "As $x$ approaches $1/2$, the answer approaches $1/3$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what all of that looks like graphically:\n", "\n", "\n", "\n", "Here `B` a boy, `Gx` is a girl with some property `X`, and `G` is a girl who doesn't have that property. If we select all families with at least one `Gx`, we get the five blue squares (light and dark). Of those, the families with two girls are the three dark blue squares.\n", "\n", "If property `X` is common, the ratio of dark blue to all blue approaches `1/3`. If `X` is rare, the same ratio approaches `1/2`.\n", "\n", "In the \"Girl Named Florida\" problem, `x` is 1/100, and we can compute the result: " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.49748743718592964" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = 1/100\n", "(1-x) / (2-x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which is what we got by counting all of the families in Statesville." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Controversy\n", "\n", "[I wrote about this problem in my blog in 2011](http://allendowney.blogspot.com/2011/11/girl-named-florida-solutions.html). As you can see in the comments, my explanation was not met with universal acclaim.\n", "\n", "One of the issues that came up is the challenge of stating the question unambiguously. In this article, I rephrased Mlodinow's statement to clarify it.\n", "\n", "But since we have come all this way, let me also answer a different version of the problem.\n", "\n", ">Suppose you choose a house in Statesville at random and ring the doorbell. A girl (who lives there) opens the door and you learn that her name is Florida. What is the probability that the other child in this house is a girl?\n", "\n", "In this version of the problem, the selection process is different. Instead of selecting houses with at least one girl named Florida, you selected a house, then selected a child, and learned that her name is Florida.\n", "\n", "Since the selection of the child was arbitrary, we can say without loss of generality that the child you met is the first child in the table.\n", "\n", "In that case, the conditional probability of two girls is:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(gg, gf1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which is the same as the conditional probability, given that the first child is a girl:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conditional(gg, girl1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in this version of the problem, the girl's name is irrelevant." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }