{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian Zig Zag\n", "\n", "Developing probabilistic models using grid methods and MCMC.\n", "\n", "Thanks to Chris Fonnesback for his help with this notebook, and to Colin Carroll, who added features to pymc3 to support some of these examples.\n", "\n", "Copyright 2018 Allen Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# If we're running on Colab, install empiricaldist\n", "# https://pypi.org/project/empiricaldist/\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install empiricaldist\n", " !pip install pymc3>=3.8" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "import numpy as np\n", "import pymc3 as pm\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\", category=UserWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulating hockey\n", "\n", "I'll model hockey as a Poisson process, where each team has some long-term average scoring rate, $\\lambda$, in goals per game.\n", "\n", "For the first example, I'll assume that $\\lambda$ is somehow known to be 2.4. Since regulation play (as opposed to overtime) is 60 minutes, we can compute the goal scoring rate per minute." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.04" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lam_per_game = 2.4\n", "min_per_game = 60\n", "lam_per_min = lam_per_game / min_per_game\n", "lam_per_min" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we assume that a goal is equally likely during any minute of the game, and we ignore the possibility of scoring more than one goal in the same minute, we can simulate a game by generating one random value each minute." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def simulate_game(p, n=60):\n", " goals = np.random.choice([0, 1], n, p=[1-p, p])\n", " return np.sum(goals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And simulate 10 games." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "size = 10\n", "sample = [simulate_game(lam_per_min) for i in range(size)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we simulate 1000 games, we can see what the distribution looks like. The average of this sample should be close to `lam_per_game`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.35, 2.4)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "size = 1000\n", "sample_sim = [simulate_game(lam_per_min) for i in range(size)]\n", "np.mean(sample_sim), lam_per_game" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PMFs\n", "\n", "To visualize distributions, I'll start with a probability mass function (PMF).\n", "\n", "I'll use the `Pmf` object from `empiricaldist`, which is a subtype of a Pandas `Series`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | probs | \n", "
---|---|
0 | \n", "0.116 | \n", "
1 | \n", "0.218 | \n", "
2 | \n", "0.228 | \n", "
3 | \n", "0.207 | \n", "
4 | \n", "0.142 | \n", "
5 | \n", "0.055 | \n", "
6 | \n", "0.029 | \n", "
7 | \n", "0.003 | \n", "
8 | \n", "0.001 | \n", "
9 | \n", "0.001 | \n", "