{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bite Size Bayes\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Elvis problem\n", "\n", "Here's a problem from [*Bayesian Data Analysis*](http://www.stat.columbia.edu/~gelman/book/):\n", "\n", "> Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?\n", "\n", "I will answer this question in three steps:\n", "\n", "1. First, we need some background information about the relative frequencies of identical and fraternal twins.\n", "\n", "2. Then we will use Bayes's Theorem to take into account one piece of data, which is that Elvis's twin was male.\n", "\n", "3. Finally, living up to the name of this blog, I will overthink the problem by taking into account a second piece of data, which is that Elvis's twin died at birth.\n", "\n", "For background information, I'll use data from 1935, the year Elvis was born, from the\n", "U.S. Census Bureau, [Birth, Stillbirth, and Infant Mortality Statistics for the Continental United States, the Territory of Hawaii, the Virgin Islands 1935](https://www.cdc.gov/nchs/data/vsushistorical/birthstat_1935.pdf).\n", "\n", "It includes this table:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table doesn't report which twins are identical or fraternal, but we can use the data to compute\n", "\n", "* $x$: The fraction of twins that are opposite sex." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3332539588046196" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "opposite = 8397\n", "same = 8678 + 8122\n", "\n", "x = opposite / (opposite + same)\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The quantity we want is\n", "\n", "* $f$: The fraction of twins who are fraternal.\n", "\n", "So let's see how we can get from $x$ to $f$.\n", "\n", "Because identical twins have the same genes, they are almost always the same sex. Fraternal twins do not have the same genes; like other siblings, they are about equally likely to be the same or opposite sex.\n", "\n", "So we can write this relationship between $x$ and $f$\n", "\n", "$x = f/2 + 0$\n", "\n", "which says that the opposite sex twins include half of the fraternal twins and none of the identical twins.\n", "\n", "And that implies\n", "\n", "$f = 2x$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6665079176092392" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = 2*x\n", "f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 1935, about 2/3 of twins were fraternal, and 1/3 were identical.\n", "\n", "Getting back to the Elvis problem, we can use $1-f$ and $f$ as prior probabilities for the two hypotheses, `identical` and `fraternal`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "index = ['identical', 'fraternal']\n", "prior = 1-f, f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can take into account the data:\n", "\n", "* $D$: Elvis's twin was male.\n", "\n", "The probability of $D$ is nearly 100% if they were identical twins and about 50% if they were fraternal." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "likelihood = 1, 0.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a function that takes the hypotheses, the priors, and the likelihoods and puts them in a Bayes table:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def make_bayes_table(index, prior, likelihood):\n", " \"\"\"Make a Bayes table.\n", " \n", " index: sequence of hypotheses\n", " prior: prior probabilities\n", " likelihood: sequence of likelihoods\n", " \n", " returns: DataFrame\n", " \"\"\"\n", " table = pd.DataFrame(index=index)\n", " table['prior'] = prior\n", " table['likelihood'] = likelihood\n", " table['unnorm'] = table['prior'] * table['likelihood']\n", " prob_data = table['unnorm'].sum()\n", " table['posterior'] = table['unnorm'] / prob_data\n", " return table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the results" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priorlikelihoodunnormposterior
identical0.3334921.00.3334920.500179
fraternal0.6665080.50.3332540.499821
\n", "
" ], "text/plain": [ " prior likelihood unnorm posterior\n", "identical 0.333492 1.0 0.333492 0.500179\n", "fraternal 0.666508 0.5 0.333254 0.499821" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = make_bayes_table(index, prior, likelihood)\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the Bayes table I'll extract $p_i$, which is the probability that same sex twins are identical:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5001785714285714" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_i = table['posterior']['identical']\n", "p_i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With priors based on data from 1935, the posterior probability that Elvis was a twin is close to 50%.\n", "\n", "But there is one more piece of data to take into account; the fact that Elvis's twin died at birth.\n", "\n", "Let's assume that there are different risks for fraternal and identical twins. The quantities we want are\n", "\n", "* $r_f$: The probability that one twin is stillborn, given that they are fraternal.\n", "\n", "* $r_i$: The probability that one twin is stillborn, given that they are identical.\n", "\n", "We can't get those quantities directly from the table, but we can compute:\n", "\n", "* $y$: the probability of \"1 living\", given that the twins are opposite sex\n", "\n", "* $z$: the probability of \"1 living\", given that the twins are the same sex" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.06633321424318209" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = (258 + 299) / opposite\n", "y" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07255952380952381" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "z = (655 + 564) / same\n", "z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming that all opposite sex twins are fraternal, we can infer that the risk for fraternal twins is $y$, the risk for opposite sex twins:\n", "\n", "$r_f = y$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.06633321424318209" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_f = y\n", "r_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And because we know the fraction of same sex twins who are identical, $p_i$, we can write the following relation\n", "\n", "$z = p_i r_i + (1-p_i) r_f$\n", "\n", "which says that the risk for same sex twins is the weighted sum of the risks for identical and fraternal twins, with the weight $p_i$.\n", "\n", "Solving for $r_i$, we get\n", "\n", "$r_i = 2z - r_f$\n", "\n", "And we have already computed $z$ and $r_f$" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07878583337586553" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_i = 2*z - r_f\n", "r_i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataset, it looks like the probability of \"1 alive\" is a little higher for identical twins.\n", "\n", "So we can do a second update to take into account the data that Elvis's twin died at birth. The posterior probabilities from the first update become the priors for the second." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "prior2 = table['posterior']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the likelihoods:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "likelihood2 = r_i, r_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the results." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priorlikelihoodunnormposterior
identical0.5001790.0787860.0394070.543082
fraternal0.4998210.0663330.0331550.456918
\n", "
" ], "text/plain": [ " prior likelihood unnorm posterior\n", "identical 0.500179 0.078786 0.039407 0.543082\n", "fraternal 0.499821 0.066333 0.033155 0.456918" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table2 = make_bayes_table(index, prior2, likelihood2)\n", "table2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the new data, the posterior probability that Elvis was an identical twin is about 54%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Credit:** Thanks to Jonah Spicher, who took my Bayesian Stats class at Olin and came up with the idea to use data from 1935 and take into account the fact that Elvis's twin died at birth." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }