{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dependent density regression\n", "Author: [Austin Rochford](https://github.com/AustinRochford/)\n", "\n", "In another [example](dp_mix.ipynb), we showed how to use Dirichlet processes to perform Bayesian nonparametric density estimation. This example expands on the previous one, illustrating dependent density regression.\n", "\n", "Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite [mixtures of experts](https://en.wikipedia.org/wiki/Committee_machine) that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "from IPython.display import HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from matplotlib import animation as ani, pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc3 as pm\n", "import seaborn as sns\n", "from theano import shared, tensor as tt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.rc('animation', writer='avconv')\n", "blue, *_ = sns.color_palette()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "SEED = 972915 # from random.org; for reproducibility\n", "np.random.seed(SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the LIDAR data set from Larry Wasserman's excellent book, [_All of Nonparametric Statistics_](http://www.stat.cmu.edu/~larry/all-of-nonpar/). We standardize the data set to improve the rate of convergence of our samples." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/fonnescj/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/io/parsers.py:2108: FutureWarning: split() requires a non-empty pattern match.\n", " yield pat.split(line.strip())\n", "/Users/fonnescj/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/io/parsers.py:2110: FutureWarning: split() requires a non-empty pattern match.\n", " yield pat.split(line.strip())\n" ] } ], "source": [ "DATA_URI = 'http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat'\n", "\n", "def standardize(x):\n", " return (x - x.mean()) / x.std()\n", "\n", "df = (pd.read_csv(DATA_URI, sep=' *', engine='python')\n", " .assign(std_range=lambda df: standardize(df.range),\n", " std_logratio=lambda df: standardize(df.logratio)))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | range | \n", "logratio | \n", "std_logratio | \n", "std_range | \n", "
---|---|---|---|---|
0 | \n", "390 | \n", "-0.050356 | \n", "0.852467 | \n", "-1.717725 | \n", "
1 | \n", "391 | \n", "-0.060097 | \n", "0.817981 | \n", "-1.707299 | \n", "
2 | \n", "393 | \n", "-0.041901 | \n", "0.882398 | \n", "-1.686447 | \n", "
3 | \n", "394 | \n", "-0.050985 | \n", "0.850240 | \n", "-1.676020 | \n", "
4 | \n", "396 | \n", "-0.059913 | \n", "0.818631 | \n", "-1.655168 | \n", "