{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistic Regression in Theano\n",
    "\n",
    "Credits: Forked from [summerschool2015](https://github.com/mila-udem/summerschool2015) by mila-udem\n",
    "\n",
    "This notebook is inspired from [the tutorial on logistic regression](http://deeplearning.net/tutorial/logreg.html) on [deeplearning.net](http://deeplearning.net).\n",
    "\n",
    "In this notebook, we show how Theano can be used to implement the most basic classifier: the **logistic regression**. We start off with a quick primer of the model, which serves both as a refresher but also to anchor the notation and show how mathematical expressions are mapped onto Theano graphs.\n",
    "\n",
    "In the deepest of machine learning traditions, this tutorial will tackle the exciting problem of MNIST digit classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get the data\n",
    "\n",
    "In the mean time, let's just download a pre-packaged version of MNIST, and load each split of the dataset as NumPy ndarrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "import gzip\n",
    "import six\n",
    "from six.moves import cPickle\n",
    "\n",
    "if not os.path.exists('mnist.pkl.gz'):\n",
    "    r = requests.get('http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz')\n",
    "    with open('mnist.pkl.gz', 'wb') as data_file:\n",
    "        data_file.write(r.content)\n",
    "\n",
    "with gzip.open('mnist.pkl.gz', 'rb') as data_file:\n",
    "    if six.PY3:\n",
    "        train_set, valid_set, test_set = cPickle.load(data_file, encoding='latin1')\n",
    "    else:\n",
    "        train_set, valid_set, test_set = cPickle.load(data_file)\n",
    "\n",
    "train_set_x, train_set_y = train_set\n",
    "valid_set_x, valid_set_y = valid_set\n",
    "test_set_x, test_set_y = test_set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The model\n",
    "Logistic regression is a probabilistic, linear classifier. It is parametrized\n",
    "by a weight matrix $W$ and a bias vector $b$. Classification is\n",
    "done by projecting an input vector onto a set of hyperplanes, each of which\n",
    "corresponds to a class. The distance from the input to a hyperplane reflects\n",
    "the probability that the input is a member of the corresponding class.\n",
    "\n",
    "Mathematically, the probability that an input vector $x$ is a member of a\n",
    "class $i$, a value of a stochastic variable $Y$, can be written as:\n",
    "\n",
    "$$P(Y=i|x, W,b) = softmax_i(W x + b) = \\frac {e^{W_i x + b_i}} {\\sum_j e^{W_j x + b_j}}$$\n",
    "\n",
    "The model's prediction $y_{pred}$ is the class whose probability is maximal, specifically:\n",
    "\n",
    "$$  y_{pred} = {\\rm argmax}_i P(Y=i|x,W,b)$$\n",
    "\n",
    "Now, let us define our input variables. First, we need to define the dimension of our tensors:\n",
    "- `n_in` is the length of each training vector,\n",
    "- `n_out` is the number of classes.\n",
    "\n",
    "Our variables will be:\n",
    "- `x` is a matrix, where each row contains a different example of the dataset. Its shape is `(batch_size, n_in)`, but `batch_size` does not have to be specified in advance, and can change during training.\n",
    "- `W` is a shared matrix, of shape `(n_in, n_out)`, initialized with zeros. Column `k` of `W` represents the separation hyperplane for class `k`.\n",
    "- `b` is a shared vector, of length `n_out`, initialized with zeros. Element `k` of `b` represents the free parameter of hyperplane `k`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "import theano\n",
    "from theano import tensor\n",
    "\n",
    "# Size of the data\n",
    "n_in = 28 * 28\n",
    "# Number of classes\n",
    "n_out = 10\n",
    "\n",
    "x = tensor.matrix('x')\n",
    "W = theano.shared(value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),\n",
    "                  name='W',\n",
    "                  borrow=True)\n",
    "b = theano.shared(value=numpy.zeros((n_out,), dtype=theano.config.floatX),\n",
    "                  name='b',\n",
    "                  borrow=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can build a symbolic expression for the matrix of class-membership probability (`p_y_given_x`), and for the class whose probability is maximal (`y_pred`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "p_y_given_x = tensor.nnet.softmax(tensor.dot(x, W) + b)\n",
    "y_pred = tensor.argmax(p_y_given_x, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a loss function\n",
    "Learning optimal model parameters involves minimizing a loss function. In the\n",
    "case of multi-class logistic regression, it is very common to use the negative\n",
    "log-likelihood as the loss. This is equivalent to maximizing the likelihood of the\n",
    "data set $\\cal{D}$ under the model parameterized by $\\theta$. Let\n",
    "us first start by defining the likelihood $\\cal{L}$ and loss\n",
    "$\\ell$:\n",
    "\n",
    "$$\\mathcal{L} (\\theta=\\{W,b\\}, \\mathcal{D}) =\n",
    "     \\sum_{i=0}^{|\\mathcal{D}|} \\log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\\\\n",
    "   \\ell (\\theta=\\{W,b\\}, \\mathcal{D}) = - \\mathcal{L} (\\theta=\\{W,b\\}, \\mathcal{D})\n",
    "$$\n",
    "\n",
    "Again, we will express those expressions using Theano. We have one additional input, the actual target class `y`:\n",
    "- `y` is an input vector of integers, of length `batch_size` (which will have to match the length of `x` at runtime). The length of `y` can be symbolically expressed by `y.shape[0]`.\n",
    "- `log_prob` is a `(batch_size, n_out)` matrix containing the log probabilities of class membership for each example.\n",
    "- `arange(y.shape[0])` is a symbolic vector which will contain `[0,1,2,... batch_size-1]`\n",
    "- `log_likelihood` is a vector containing the log probability of the target, for each example.\n",
    "- `loss` is the mean of the negative `log_likelihood` over the examples in the minibatch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y = tensor.lvector('y')\n",
    "log_prob = tensor.log(p_y_given_x)\n",
    "log_likelihood = log_prob[tensor.arange(y.shape[0]), y]\n",
    "loss = - log_likelihood.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training procedure\n",
    "This notebook will use the method of stochastic gradient descent with mini-batches (MSGD) to find values of `W` and `b` that minimize the loss.\n",
    "\n",
    "We can let Theano compute symbolic expressions for the gradient of the loss wrt `W` and `b`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "g_W, g_b = theano.grad(cost=loss, wrt=[W, b])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`g_W` and `g_b` are symbolic variables, which can be used as part of a computation graph. In particular, let us define the expressions for one step of gradient descent for `W` and `b`, for a fixed learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "learning_rate = numpy.float32(0.13)\n",
    "new_W = W - learning_rate * g_W\n",
    "new_b = b - learning_rate * g_b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then define **update expressions**, or pairs of (shared variable, expression for its update), that we will use when compiling the Theano function. The updates will be performed each time the function gets called.\n",
    "\n",
    "The following function, `train_model`, returns the loss on the current minibatch, then changes the values of the shared variables according to the update rules. It needs to be passed `x` and `y` as inputs, but not the shared variables, which are implicit inputs.\n",
    "\n",
    "The entire learning algorithm thus consists in looping over all examples in the dataset, considering all the examples in one minibatch at a time, and repeatedly calling the `train_model` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train_model = theano.function(inputs=[x, y],\n",
    "                              outputs=loss,\n",
    "                              updates=[(W, new_W),\n",
    "                                       (b, new_b)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing the model\n",
    "When testing the model, we are interested in the number of misclassified examples (and not only in the likelihood). Here, we build a symbolic expression for retrieving the number of misclassified examples in a minibatch.\n",
    "\n",
    "This will also be useful to apply on the validation and testing sets, in order to monitor the progress of the model during training, and to do early stopping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "misclass_nb = tensor.neq(y_pred, y)\n",
    "misclass_rate = misclass_nb.mean()\n",
    "\n",
    "test_model = theano.function(inputs=[x, y],\n",
    "                             outputs=misclass_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the model\n",
    "Here is the main training loop of the algorithm:\n",
    "- For each *epoch*, or pass through the training set\n",
    "  - split the training set in minibatches, and call `train_model` on each minibatch\n",
    "  - split the validation set in minibatches, and call `test_model` on each minibatch to measure the misclassification rate\n",
    "  - if the misclassification rate has not improved in a while, stop training\n",
    "- Measure performance on the test set\n",
    "\n",
    "The **early stopping procedure** is what decide whether the performance has improved enough. There are many variants, and we will not go into the details of this one here.\n",
    "\n",
    "We first need to define a few parameters for the training loop and the early stopping procedure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Define a couple of helper variables and functions for the optimization\n",
    "batch_size = 500\n",
    "# compute number of minibatches for training, validation and testing\n",
    "n_train_batches = train_set_x.shape[0] // batch_size\n",
    "n_valid_batches = valid_set_x.shape[0] // batch_size\n",
    "n_test_batches = test_set_x.shape[0] // batch_size\n",
    "\n",
    "def get_minibatch(i, dataset_x, dataset_y):\n",
    "    start_idx = i * batch_size\n",
    "    end_idx = (i + 1) * batch_size\n",
    "    batch_x = dataset_x[start_idx:end_idx]\n",
    "    batch_y = dataset_y[start_idx:end_idx]\n",
    "    return (batch_x, batch_y)\n",
    "\n",
    "## early-stopping parameters\n",
    "# maximum number of epochs\n",
    "n_epochs = 1000\n",
    "# look as this many examples regardless\n",
    "patience = 5000\n",
    "# wait this much longer when a new best is found\n",
    "patience_increase = 2\n",
    "# a relative improvement of this much is considered significant\n",
    "improvement_threshold = 0.995\n",
    "\n",
    "# go through this many minibatches before checking the network on the validation set;\n",
    "# in this case we check every epoch\n",
    "validation_frequency = min(n_train_batches, patience / 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import timeit\n",
    "from six.moves import xrange\n",
    "\n",
    "best_validation_loss = numpy.inf\n",
    "test_score = 0.\n",
    "start_time = timeit.default_timer()\n",
    "\n",
    "done_looping = False\n",
    "epoch = 0\n",
    "while (epoch < n_epochs) and (not done_looping):\n",
    "    epoch = epoch + 1\n",
    "    for minibatch_index in xrange(n_train_batches):\n",
    "        minibatch_x, minibatch_y = get_minibatch(minibatch_index, train_set_x, train_set_y)\n",
    "        minibatch_avg_cost = train_model(minibatch_x, minibatch_y)\n",
    "\n",
    "        # iteration number\n",
    "        iter = (epoch - 1) * n_train_batches + minibatch_index\n",
    "        if (iter + 1) % validation_frequency == 0:\n",
    "            # compute zero-one loss on validation set\n",
    "            validation_losses = []\n",
    "            for i in xrange(n_valid_batches):\n",
    "                valid_xi, valid_yi = get_minibatch(i, valid_set_x, valid_set_y)\n",
    "                validation_losses.append(test_model(valid_xi, valid_yi))\n",
    "            this_validation_loss = numpy.mean(validation_losses)\n",
    "            print('epoch %i, minibatch %i/%i, validation error %f %%' %\n",
    "                  (epoch,\n",
    "                   minibatch_index + 1,\n",
    "                   n_train_batches,\n",
    "                   this_validation_loss * 100.))\n",
    "\n",
    "            # if we got the best validation score until now\n",
    "            if this_validation_loss < best_validation_loss:\n",
    "                # improve patience if loss improvement is good enough\n",
    "                if this_validation_loss < best_validation_loss * improvement_threshold:\n",
    "                    patience = max(patience, iter * patience_increase)\n",
    "\n",
    "                best_validation_loss = this_validation_loss\n",
    "\n",
    "                # test it on the test set\n",
    "                test_losses = []\n",
    "                for i in xrange(n_test_batches):\n",
    "                    test_xi, test_yi = get_minibatch(i, test_set_x, test_set_y)\n",
    "                    test_losses.append(test_model(test_xi, test_yi))\n",
    "\n",
    "                test_score = numpy.mean(test_losses)\n",
    "                print('     epoch %i, minibatch %i/%i, test error of  best model %f %%' %\n",
    "                      (epoch,\n",
    "                       minibatch_index + 1,\n",
    "                       n_train_batches,\n",
    "                       test_score * 100.))\n",
    "\n",
    "                # save the best parameters\n",
    "                numpy.savez('best_model.npz', W=W.get_value(), b=b.get_value())\n",
    "\n",
    "        if patience <= iter:\n",
    "            done_looping = True\n",
    "            break\n",
    "\n",
    "end_time = timeit.default_timer()\n",
    "print('Optimization complete with best validation score of %f %%, '\n",
    "      'with test performance %f %%' %\n",
    "      (best_validation_loss * 100., test_score * 100.))\n",
    "\n",
    "print('The code ran for %d epochs, with %f epochs/sec' %\n",
    "      (epoch, 1. * epoch / (end_time - start_time)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization\n",
    "You can visualize the columns of `W`, which correspond to the separation hyperplanes for each class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "from utils import tile_raster_images\n",
    "\n",
    "plt.clf()\n",
    "\n",
    "# Increase the size of the figure\n",
    "plt.gcf().set_size_inches(15, 10)\n",
    "\n",
    "plot_data = tile_raster_images(W.get_value(borrow=True).T,\n",
    "                               img_shape=(28, 28), tile_shape=(2, 5), tile_spacing=(1, 1))\n",
    "plt.imshow(plot_data, cmap='Greys', interpolation='none')\n",
    "plt.axis('off')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}