{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#The best of both worlds: Hierarchical Linear Regression in PyMC3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Authors: Danne Elbers, Thomas Wiecki\n", "\n", "Today's blog post is co-written by [Danne Elbers](http://www.linkedin.com/pub/danne-elbers/69/3a2/7ba) who is doing her masters thesis with me on computational psychiatry using Bayesian modeling. This post also borrows heavily from a [Notebook](http://nbviewer.ipython.org/github/fonnesbeck/multilevel_modeling/blob/master/multilevel_modeling.ipynb?create=1) by [Chris Fonnesbeck](http://biostat.mc.vanderbilt.edu/wiki/Main/ChrisFonnesbeck).\n", "\n", "The power of Bayesian modelling really clicked for me when I was first introduced to hierarchical modelling. In this blog post we will:\n", " \n", " * provide and intuitive explanation of hierarchical/multi-level Bayesian modeling;\n", " * show how this type of model can easily be built and estimated in [PyMC3](https://github.com/pymc-devs/pymc);\n", " * demonstrate the advantage of using hierarchical Bayesian modelling as opposed to non-hierarchical Bayesian modelling by comparing the two;\n", " * visualize the \"shrinkage effect\" (explained below); and\n", " * highlight connections to the frequentist version of this model.\n", "\n", "Having multiple sets of related measurements comes up all the time. In mathematical psychology, for example, you test multiple subjects on the same task. We then want to estimate a computational/mathematical model that describes the behavior on the task by a set of parameters. We could thus fit a model to each subject individually, assuming they share no similarities; or, pool all the data and estimate one model assuming all subjects are identical. Hierarchical modeling allows the best of both worlds by modeling subjects' similarities but also allowing estimiation of individual parameters. As an aside, software from our lab, [HDDM](http://ski.cog.brown.edu/hddm_docs/), allows hierarchical Bayesian estimation of a widely used decision making model in psychology. In this blog post, however, we will use a more classical example of [hierarchical linear regression](http://en.wikipedia.org/wiki/Hierarchical_linear_modeling) to predict radon levels in houses.\n", "\n", "This is the 3rd blog post on the topic of Bayesian modeling in PyMC3, see here for the previous two:\n", "\n", " * [The Inference Button: Bayesian GLMs made easy with PyMC3](http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/)\n", " * [This world is far from Normal(ly distributed): Bayesian Robust Regression in PyMC3](http://twiecki.github.io/blog/2013/08/27/bayesian-glms-2/) \n", "\n", "## The data set\n", "Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil.\n", "Here we'll investigate this differences and try to make predictions of radonlevels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county. \n", "\n", "![radon](http://www.fix-your-radon.com/images/how_radon_enters.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll load the data: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pymc3 as pm \n", "import pandas as pd\n", "\n", "data = pd.read_csv('data/radon.csv')\n", "\n", "county_names = data.county.unique()\n", "county_idx = data['county_code'].values\n", "n_counties = len(data.county.unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The relevant part of the data we will model looks as follows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | county | \n", "log_radon | \n", "floor | \n", "
---|---|---|---|
0 | \n", "AITKIN | \n", "0.832909 | \n", "1 | \n", "
1 | \n", "AITKIN | \n", "0.832909 | \n", "0 | \n", "
2 | \n", "AITKIN | \n", "1.098612 | \n", "0 | \n", "
3 | \n", "AITKIN | \n", "0.095310 | \n", "0 | \n", "
4 | \n", "ANOKA | \n", "1.163151 | \n", "0 | \n", "