{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Probability Mass Functions" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "This notebook is part of [Bite Size Bayes](https://allendowney.github.io/BiteSizeBayes/), an introduction to probability and Bayesian statistics using Python.\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The following cell downloads `utils.py`, which contains some utility function we'll need." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", "\n", "download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/utils.py')" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "If everything we need is installed, the following cell should run with no error messages." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "[In the previous notebook](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/05_test.ipynb) we used a Bayes table to interpret medical tests.\n", "\n", "In this notebook we'll solve an expanded version of the cookie problem with 101 Bowls. It might seem like a silly problem, but it's not: the solution demonstrates a Bayesian way to estimate a proportion, and it applies to lots of real problems that don't involve cookies.\n", "\n", "Then I'll introduce an alternative to the Bayes table, a probability mass function (PMF), which is a useful way to represent and do computations with distributions.\n", "\n", "Here's the function, from the previous notebook, we'll use to make Bayes tables:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def make_bayes_table(hypos, prior, likelihood):\n", " \"\"\"Make a Bayes table.\n", " \n", " hypos: sequence of hypotheses\n", " prior: prior probabilities\n", " likelihood: sequence of likelihoods\n", " \n", " returns: DataFrame\n", " \"\"\"\n", " table = pd.DataFrame(index=hypos)\n", " table['prior'] = prior\n", " table['likelihood'] = likelihood\n", " table['unnorm'] = table['prior'] * table['likelihood']\n", " prob_data = table['unnorm'].sum()\n", " table['posterior'] = table['unnorm'] / prob_data\n", " return table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 101 Bowls\n", "\n", "In [Notebook 4](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/05_dice.ipynb), we saw that the Bayes table works with more than two hypotheses. As an example, we solved a cookie problem with five bowls.\n", "\n", "Now we'll take it even farther and solve a cookie problem with 101 bowls:\n", "\n", "* Bowl 0 contains no vanilla cookies,\n", "\n", "* Bowl 1 contains 1% vanilla cookies,\n", "\n", "* Bowl 2 contains 2% vanilla cookies,\n", "\n", "and so on, up to\n", "\n", "* Bowl 99 contains 99% vanilla cookies, and\n", "\n", "* Bowl 100 contains all vanilla cookies.\n", "\n", "As in the previous problems, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on.\n", "\n", "Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl $x$, for each value of $x$?\n", "\n", "To solve this problem, I'll use `np.arange` to represent 101 hypotheses, numbered from 0 to 100." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "xs = np.arange(101)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prior probability for each bowl is $1/101$. I could create a sequence with 101 identical values, but if all of the priors are equal, we only have to probide one value:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "prior = 1/101" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of the way I numbered the bowls, the probability of a vanilla cookie from Bowl $x$ is $x/100$. So we can compute the likelihoods like this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "likelihood = xs/100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that's all we need; the Bayes table does the rest:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "table = make_bayes_table(xs, prior, likelihood)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a feature we have not seen before: we can give the index of the Bayes table a name, which will appear when we display the table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "table.index.name = 'Bowl'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the first few rows:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "table.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because Bowl 0 contains no vanilla cookies, its likelihood is 0, so its posterior probability is 0. That is, the cookie cannot have come from Bowl 0.\n", "\n", "Here are the last few rows of the table." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "table.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The posterior probabilities are substantially higher for the high-numbered bowls.\n", "\n", "There is a pattern here that will be clearer if we plot the results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot_table(table):\n", " \"\"\"Plot results from the 101 Bowls problem.\n", " \n", " table: DataFrame representing a Bayes table\n", " \"\"\"\n", " table['prior'].plot()\n", " table['posterior'].plot()\n", "\n", " plt.xlabel('Bowl #')\n", " plt.ylabel('Probability')\n", " plt.legend()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "plot_table(table)\n", "plt.title('One cookie');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prior probabilities are uniform; that is, they are the same for every bowl.\n", "\n", "The posterior probabilities increase linearly; Bowl 0 is the least likely (actually impossible), and Bowl 100 is the most likely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Two cookies\n", "\n", "Suppose we put the first cookie back, stir the bowl thoroughly, and draw another cookie from the same bowl. and suppose it turns out to be another vanilla cookie.\n", "\n", "Now what is the probability that we are drawing from Bowl $x$?\n", "\n", "To answer this question, we can use the posterior probabilities from the previous problem as prior probabilities for a new Bayes table, and then update with the new data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "prior2 = table['posterior']\n", "likelihood2 = likelihood\n", "\n", "table2 = make_bayes_table(xs, prior2, likelihood2)\n", "plot_table(table2)\n", "plt.title('Two cookies');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The blue line shows the posterior after one cookie, which is the prior before the second cookie.\n", "\n", "The orange line shows the posterior after two cookies, which curves upward. Having see two vanilla cookies, the high-numbered bowls are more likely; the low-numbered bowls are less likely.\n", "\n", "I bet you can guess what's coming next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Three cookies\n", "\n", "Suppose we put the cookie back, stir, draw another cookie from the same bowl, and get a chocolate cookie.\n", "\n", "What do you think the posterior distribution looks like after these three cookies?\n", "\n", "Hint: what's the probability that the chocolate cookie came from Bowl 100?\n", "\n", "We'll use the posterior after two cookies as the prior for the third cookie:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "prior3 = table2['posterior']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, what about the likelihoods? Remember that the probability of a vanilla cookie from Bowl $x$ is $x/100$. So the probability of a chocolate cookie is $(1 - x/100)$, which we can compute like this." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "likelihood3 = 1 - xs/100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it. Everything else is the same." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "table3 = make_bayes_table(xs, prior3, likelihood3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the results" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "plot_table(table3)\n", "plt.title('Three cookies');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The blue line is the posterior after two cookies; the orange line is the posterior after three cookies.\n", "\n", "Because Bowl 100 contains no chocolate cookies, the posterior probability for Bowl 100 is 0.\n", "\n", "The posterior distribution has a peak near 60%. We can use `idxmax` to find it: " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "table3['posterior'].idxmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The peak in the posterior distribution is at 67%.\n", "\n", "This value has a name; it is the **MAP**, which stands for \"Maximum Aposteori Probability\" (\"aposteori\" is Latin for posterior).\n", "\n", "In this example, the MAP is close to the proportion of vanilla cookies in the dataset: 2/3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Let's do a version of the dice problem where we roll the die more than once. Here's the statement of the problem again:\n", "\n", "> Suppose you have a 4-sided, 6-sided, 8-sided, and 12-sided die. You choose one at random, roll it and get a 1. What is the probability that the die you rolled is 4-sided? What are the posterior probabilities for the other dice?\n", "\n", "And here's a solution using a Bayes table:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "hypos = ['H4', 'H6', 'H8', 'H12']\n", "prior = 1/4\n", "likelihood = 1/4, 1/6, 1/8, 1/12\n", "\n", "table = make_bayes_table(hypos, prior, likelihood)\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose you roll the same die again and get a 6. What are the posterior probabilities after the second roll?\n", "\n", "Use `idxmax` to find the MAP." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probability Mass Functions\n", "\n", "When we do more than one update, we don't always want to keep the whole Bayes table. In this section we'll replace the Bayes table with a more compact representation, a probability mass function, or PMF.\n", "\n", "A PMF is a set of possible outcomes and their corresponding probabilities. There are many ways to represent a PMF; in this notebook I'll use a Pandas Series.\n", "\n", "Here's a function that takes a sequence of outcomes, `xs`, and a sequence of probabilities, `ps`, and returns a Pandas Series that represents a PMF." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def make_pmf(xs, ps, **options):\n", " \"\"\"Make a Series that represents a PMF.\n", " \n", " xs: sequence of values\n", " ps: sequence of probabilities\n", " options: keyword arguments passed to Series constructor\n", " \n", " returns: Pandas Series\n", " \"\"\"\n", " pmf = pd.Series(ps, index=xs, **options)\n", " return pmf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's a PMF that represents the prior from the 101 Bowls problem." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "xs = np.arange(101)\n", "prior = 1/101\n", "\n", "pmf = make_pmf(xs, prior)\n", "pmf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a priod, we need to compute likelihoods.\n", "\n", "Here are the likelihoods for a vanilla cookie:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "likelihood_vanilla = xs / 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And for a chocolate cookie." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "likelihood_chocolate = 1 - xs / 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compute posterior probabilities, I'll use the following function, which takes a PMF and a sequence of likelihoods, and updates the PMF:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def bayes_update(pmf, likelihood):\n", " \"\"\"Do a Bayesian update.\n", " \n", " pmf: Series that represents the prior\n", " likelihood: sequence of likelihoods\n", " \"\"\"\n", " pmf *= likelihood\n", " pmf /= pmf.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The steps here are the same as in the Bayes table:\n", "\n", "1. Multiply the prior by the likelihoods.\n", "\n", "2. Add up the products to get the total probability of the data.\n", "\n", "3. Divide through to normalize the posteriors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can do the update for a vanilla cookie." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [], "source": [ "bayes_update(pmf, likelihood_vanilla)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the PMF looks like after the update." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "pmf.plot()\n", "\n", "plt.xlabel('Bowl #')\n", "plt.ylabel('Probability')\n", "plt.title('One cookie');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's consistent with what we got with the Bayes table.\n", "\n", "The advantage of using a PMF is that it is easier to do multiple updates. The following cell starts again with the uniform prior and does updates with two vanilla cookies and one chocolate cookie:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "data = 'VVC'\n", "\n", "pmf = make_pmf(xs, prior)\n", "\n", "for cookie in data:\n", " if cookie == 'V':\n", " bayes_update(pmf, likelihood_vanilla)\n", " else:\n", " bayes_update(pmf, likelihood_chocolate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the results look like:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "pmf.plot()\n", "\n", "plt.xlabel('Bowl #')\n", "plt.ylabel('Probability')\n", "plt.title('Three cookies');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, that's consistent with what we got with the Bayes table.\n", "\n", "In the next section, I'll use a PMF and `bayes_update` to solve a dice problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dice problem\n", "\n", "As an exercise, let's do one more version of the dice problem:\n", "\n", "> Suppose you have a 4-sided, 6-sided, 8-sided, 12-sided, and a **20-sided die**. You choose one at random, roll it and **get a 7**. What is the probability that the die you rolled is 4-sided? What are the posterior probabilities for the other dice?\n", "\n", "Notice that in this version I've added a 20-sided die and the outcome is 7, not 1.\n", "\n", "Here's a PMF that represents the prior:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "sides = np.array([4, 6, 8, 12, 20])\n", "prior = 1/5\n", "\n", "pmf = make_pmf(sides, prior)\n", "pmf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this version, the hypotheses are integers rather than strings, so we can compute the likelihoods like this:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "likelihood = 1 / sides" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the outcome is 7, so any die with fewer than 7 sides has likelihood 0.\n", "\n", "We can adjust `likelihood` by making a Boolean Series:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "too_low = (sides < 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And using it to set the corresponding elements of `likelihood` to 0." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "likelihood[too_low] = 0\n", "likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can do the update and display the results." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "bayes_update(pmf, likelihood)\n", "pmf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 4-sided and 6-sided dice have been eliminated. Of the remaining dice, the 8-sided die is the most likely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Suppose you have the same set of 5 die. You choose a die, roll it six times, and get 6, 7, 2, 5, 1, and 2 again. Use `idxmax` to find the MAP. What is the posterior probability of the MAP?" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook, we extended the cookie problem with more bowls and the dice problem with more dice.\n", "\n", "I defined the MAP, which is the quantity in a posterior distribution with the highest probability.\n", "\n", "Although the cookie problem is not particularly realistic or useful, the method we used to solve it applies to many problems in the real world where we want to estimate a proportion.\n", "\n", "[In the next notebook](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/07_euro.ipynb) we'll use the same method on a related problem, and take another step toward doing Bayesian statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 2 }