{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Batch Training\n", "\n", "Running algorithms which require the full data set for each update\n", "can be expensive when the data is large. In order to scale inferences,\n", "we can do _batch training_. This trains the model using\n", "only a subsample of data at a time.\n", "\n", "In this tutorial, we extend the\n", "[supervised learning tutorial](http://edwardlib.org/tutorials/supervised-regression), \n", "where the task is to infer hidden structure from\n", "labeled examples $\\{(x_n, y_n)\\}$.\n", "A webpage version is available at\n", "http://edwardlib.org/tutorials/batch-training." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import absolute_import\n", "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import edward as ed\n", "import numpy as np\n", "import tensorflow as tf\n", "\n", "from edward.models import Normal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "Simulate $N$ training examples and a fixed number of test examples.\n", "Each example is a pair of inputs $\\mathbf{x}_n\\in\\mathbb{R}^{10}$ and\n", "outputs $y_n\\in\\mathbb{R}$. They have a linear dependence with\n", "normally distributed noise.\n", "\n", "We also define a helper function to select the next batch of data\n", "points from the full set of examples. It keeps track of the current\n", "batch index and returns the next batch using the function \n", "``next()``. We will generate batches from `data` during inference." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def build_toy_dataset(N, w):\n", " D = len(w)\n", " x = np.random.normal(0.0, 2.0, size=(N, D))\n", " y = np.dot(x, w) + np.random.normal(0.0, 0.05, size=N)\n", " return x, y" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def generator(arrays, batch_size):\n", " \"\"\"Generate batches, one with respect to each array's first axis.\"\"\"\n", " starts = [0] * len(arrays) # pointers to where we are in iteration\n", " while True:\n", " batches = []\n", " for i, array in enumerate(arrays):\n", " start = starts[i]\n", " stop = start + batch_size\n", " diff = stop - array.shape[0]\n", " if diff <= 0:\n", " batch = array[start:stop]\n", " starts[i] += batch_size\n", " else:\n", " batch = np.concatenate((array[start:], array[:diff]))\n", " starts[i] = diff\n", " batches.append(batch)\n", " yield batches" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ed.set_seed(42)\n", "\n", "N = 10000 # size of training data\n", "M = 128 # batch size during training\n", "D = 10 # number of features\n", "\n", "w_true = np.ones(D) * 5\n", "X_train, y_train = build_toy_dataset(N, w_true)\n", "X_test, y_test = build_toy_dataset(235, w_true)\n", "\n", "data = generator([X_train, y_train], M)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model\n", "\n", "Posit the model as Bayesian linear regression (Murphy, 2012).\n", "For a set of $N$ data points $(\\mathbf{X},\\mathbf{y})=\\{(\\mathbf{x}_n, y_n)\\}$,\n", "the model posits the following distributions:\n", "\n", "\\begin{align*}\n", " p(\\mathbf{w})\n", " &=\n", " \\text{Normal}(\\mathbf{w} \\mid \\mathbf{0}, \\sigma_w^2\\mathbf{I}),\n", " \\\\[1.5ex]\n", " p(b)\n", " &=\n", " \\text{Normal}(b \\mid 0, \\sigma_b^2),\n", " \\\\\n", " p(\\mathbf{y} \\mid \\mathbf{w}, b, \\mathbf{X})\n", " &=\n", " \\prod_{n=1}^N\n", " \\text{Normal}(y_n \\mid \\mathbf{x}_n^\\top\\mathbf{w} + b, \\sigma_y^2).\n", "\\end{align*}\n", "\n", "The latent variables are the linear model's weights $\\mathbf{w}$ and\n", "intercept $b$, also known as the bias.\n", "Assume $\\sigma_w^2,\\sigma_b^2$ are known prior variances and $\\sigma_y^2$ is a\n", "known likelihood variance. The mean of the likelihood is given by a\n", "linear transformation of the inputs $\\mathbf{x}_n$.\n", "\n", "Let's build the model in Edward, fixing $\\sigma_w,\\sigma_b,\\sigma_y=1$. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = tf.placeholder(tf.float32, [None, D])\n", "y_ph = tf.placeholder(tf.float32, [None])\n", "\n", "w = Normal(loc=tf.zeros(D), scale=tf.ones(D))\n", "b = Normal(loc=tf.zeros(1), scale=tf.ones(1))\n", "y = Normal(loc=ed.dot(X, w) + b, scale=1.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we define a placeholder `X`. During inference, we pass in\n", "the value for this placeholder according to batches of data.\n", "To enable training with batches of varying size, \n", "we don't fix the number of rows for `X` and `y`. (Alternatively,\n", "we could fix it to be the batch size if training and testing \n", "with a fixed size.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "We now turn to inferring the posterior using variational inference.\n", "Define the variational model to be a fully factorized normal across\n", "the weights." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "qw = Normal(loc=tf.Variable(tf.random_normal([D])),\n", " scale=tf.nn.softplus(tf.Variable(tf.random_normal([D]))))\n", "qb = Normal(loc=tf.Variable(tf.random_normal([1])),\n", " scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run variational inference with the Kullback-Leibler divergence.\n", "We use $5$ latent variable samples for computing\n", "black box stochastic gradients in the algorithm.\n", "(For more details, see the\n", "[$\\text{KL}(q\\|p)$ tutorial](http://edwardlib.org/tutorials/klqp).)\n", "\n", "For batch training, we will iterate over the number of batches and\n", "feed them to the respective placeholder. We set the number of\n", "iterations to be equal to the number of batches times the number of\n", "epochs (full passes over the data set)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "390/390 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 10365.830\n" ] } ], "source": [ "n_batch = int(N / M)\n", "n_epoch = 5\n", "\n", "inference = ed.KLqp({w: qw, b: qb}, data={y: y_ph})\n", "inference.initialize(n_iter=n_batch * n_epoch, n_samples=5, scale={y: N / M})\n", "tf.global_variables_initializer().run()\n", "\n", "for _ in range(inference.n_iter):\n", " X_batch, y_batch = next(data)\n", " info_dict = inference.update({X: X_batch, y_ph: y_batch})\n", " inference.print_progress(info_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When initializing inference, note we scale $y$ by $N/M$, so it is as if the\n", "algorithm had seen $N/M$ as many data points per iteration.\n", "Algorithmically, this will scale all computation regarding $y$ by\n", "$N/M$ such as scaling the log-likelihood in a variational method's\n", "objective. (Statistically, this avoids inference being dominated by the prior.)\n", "\n", "The loop construction makes training very flexible. For example, we\n", "can also try running many updates for each batch." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "770/780 [ 98%] █████████████████████████████ ETA: 0s | Loss: 9589.711" ] } ], "source": [ "n_batch = int(N / M)\n", "n_epoch = 1\n", "\n", "inference = ed.KLqp({w: qw, b: qb}, data={y: y_ph})\n", "inference.initialize(\n", " n_iter=n_batch * n_epoch * 10, n_samples=5, scale={y: N / M})\n", "tf.global_variables_initializer().run()\n", "\n", "for _ in range(inference.n_iter // 10):\n", " X_batch, y_batch = next(data)\n", " for _ in range(10):\n", " info_dict = inference.update({X: X_batch, y_ph: y_batch})\n", "\n", " inference.print_progress(info_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, make sure that the total number of training iterations is \n", "specified correctly when initializing `inference`. Otherwise an incorrect\n", "number of training iterations can have unintended consequences; for example,\n", "`ed.KLqp` uses an internal counter to appropriately decay its optimizer's \n", "learning rate step size.\n", "\n", "Note also that the reported `loss` value as we run the\n", "algorithm corresponds to the computed objective given the current\n", "batch and not the total data set. We can instead have it report\n", "the loss over the total data set by summing `info_dict['loss']`\n", "for each epoch." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Criticism\n", "\n", "A standard evaluation for regression is to compare prediction accuracy on\n", "held-out \"testing\" data. We do this by first forming the posterior predictive\n", "distribution." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y_post = ed.copy(y, {w: qw, b: qb})\n", "# This is equivalent to\n", "# y_post = Normal(loc=ed.dot(X, qw) + qb, scale=tf.ones(N))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this we can evaluate various quantities using predictions from\n", "the model (posterior predictive)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean squared error on test data:\n", "0.00525331\n", "Mean absolute error on test data:\n", "0.0523023\n" ] } ], "source": [ "print(\"Mean squared error on test data:\")\n", "print(ed.evaluate('mean_squared_error', data={X: X_test, y_post: y_test}))\n", "\n", "print(\"Mean absolute error on test data:\")\n", "print(ed.evaluate('mean_absolute_error', data={X: X_test, y_post: y_test}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Footnotes\n", "\n", "Only certain algorithms support batch training such as\n", "`MAP`, `KLqp`, and `SGLD`. Also, above we\n", "illustrated batch training for models with only global latent variables,\n", "which are variables are shared across all data points.\n", "For more complex strategies, see the\n", "[inference data subsampling API](http://edwardlib.org/api/inference-data-subsampling)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }