{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multilayer Perceptron in Theano\n",
    "\n",
    "Credits: Forked from [summerschool2015](https://github.com/mila-udem/summerschool2015) by mila-udem\n",
    "\n",
    "This notebook describes how to implement the building blocks for a multilayer perceptron in Theano, in particular how to define and combine layers.\n",
    "\n",
    "We will continue using the MNIST digits classification dataset, still using Fuel.\n",
    "\n",
    "## The Model\n",
    "We will focus on fully-connected layers, with an elementwise non-linearity on each hidden layer, and a softmax layer (similar to the logistic regression model) for classification on the top layer.\n",
    "\n",
    "### A class for hidden layers\n",
    "This class does all its work in its constructor:\n",
    "- Create and initialize shared variables for its parameters (`W` and `b`), unless there are explicitly provided. Note that the initialization scheme for `W` is the one described in [Glorot & Bengio (2010)](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf).\n",
    "- Build the Theano expression for the value of the output units, given a variable for the input.\n",
    "- Store the input, output, and shared parameters as members."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "import theano\n",
    "from theano import tensor\n",
    "\n",
    "# Set lower precision float, otherwise the notebook will take too long to run\n",
    "theano.config.floatX = 'float32'\n",
    "\n",
    "\n",
    "class HiddenLayer(object):\n",
    "    def __init__(self, rng, input, n_in, n_out, W=None, b=None,\n",
    "                 activation=tensor.tanh):\n",
    "        \"\"\"\n",
    "        Typical hidden layer of a MLP: units are fully-connected and have\n",
    "        sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)\n",
    "        and the bias vector b is of shape (n_out,).\n",
    "\n",
    "        NOTE : The nonlinearity used here is tanh\n",
    "\n",
    "        Hidden unit activation is given by: tanh(dot(input,W) + b)\n",
    "\n",
    "        :type rng: numpy.random.RandomState\n",
    "        :param rng: a random number generator used to initialize weights\n",
    "\n",
    "        :type input: theano.tensor.dmatrix\n",
    "        :param input: a symbolic tensor of shape (n_examples, n_in)\n",
    "\n",
    "        :type n_in: int\n",
    "        :param n_in: dimensionality of input\n",
    "\n",
    "        :type n_out: int\n",
    "        :param n_out: number of hidden units\n",
    "\n",
    "        :type activation: theano.Op or function\n",
    "        :param activation: Non linearity to be applied in the hidden layer\n",
    "        \"\"\"\n",
    "        self.input = input\n",
    "\n",
    "        # `W` is initialized with `W_values` which is uniformely sampled\n",
    "        # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))\n",
    "        # for tanh activation function\n",
    "        # the output of uniform if converted using asarray to dtype\n",
    "        # theano.config.floatX so that the code is runable on GPU\n",
    "        # Note : optimal initialization of weights is dependent on the\n",
    "        #        activation function used (among other things).\n",
    "        #        For example, results presented in Glorot & Bengio (2010)\n",
    "        #        suggest that you should use 4 times larger initial weights\n",
    "        #        for sigmoid compared to tanh\n",
    "        if W is None:\n",
    "            W_values = numpy.asarray(\n",
    "                rng.uniform(\n",
    "                    low=-numpy.sqrt(6. / (n_in + n_out)),\n",
    "                    high=numpy.sqrt(6. / (n_in + n_out)),\n",
    "                    size=(n_in, n_out)\n",
    "                ),\n",
    "                dtype=theano.config.floatX\n",
    "            )\n",
    "            if activation == tensor.nnet.sigmoid:\n",
    "                W_values *= 4\n",
    "\n",
    "            W = theano.shared(value=W_values, name='W', borrow=True)\n",
    "\n",
    "        if b is None:\n",
    "            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)\n",
    "            b = theano.shared(value=b_values, name='b', borrow=True)\n",
    "\n",
    "        self.W = W\n",
    "        self.b = b\n",
    "\n",
    "        lin_output = tensor.dot(input, self.W) + self.b\n",
    "        self.output = (\n",
    "            lin_output if activation is None\n",
    "            else activation(lin_output)\n",
    "        )\n",
    "        # parameters of the model\n",
    "        self.params = [self.W, self.b]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A softmax class for the output\n",
    "This class performs computations similar to what was performed in the [logistic regression tutorial](../intro_theano/logistic_regression.ipynb).\n",
    "\n",
    "Here as well, the expression for the output is built in the class constructor, which takes the input as argument. We also add the target, `y`, and store it as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class LogisticRegression(object):\n",
    "    \"\"\"Multi-class Logistic Regression Class\n",
    "\n",
    "    The logistic regression is fully described by a weight matrix :math:`W`\n",
    "    and bias vector :math:`b`. Classification is done by projecting data\n",
    "    points onto a set of hyperplanes, the distance to which is used to\n",
    "    determine a class membership probability.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, input, target, n_in, n_out):\n",
    "        \"\"\" Initialize the parameters of the logistic regression\n",
    "\n",
    "        :type input: theano.tensor.TensorType\n",
    "        :param input: symbolic variable that describes the input of the\n",
    "                      architecture (one minibatch)\n",
    "        \n",
    "        :type target: theano.tensor.TensorType\n",
    "        :type target: column tensor that describes the target for training\n",
    "\n",
    "        :type n_in: int\n",
    "        :param n_in: number of input units, the dimension of the space in\n",
    "                     which the datapoints lie\n",
    "\n",
    "        :type n_out: int\n",
    "        :param n_out: number of output units, the dimension of the space in\n",
    "                      which the labels lie\n",
    "\n",
    "        \"\"\"\n",
    "        # keep track of model input and target.\n",
    "        # We store a flattened (vector) version of target as y, which is easier to handle\n",
    "        self.input = input\n",
    "        self.target = target\n",
    "        self.y = target.flatten()\n",
    "\n",
    "        self.W = theano.shared(value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),\n",
    "                               name='W',\n",
    "                               borrow=True)\n",
    "        self.b = theano.shared(value=numpy.zeros((n_out,), dtype=theano.config.floatX),\n",
    "                               name='b',\n",
    "                               borrow=True)\n",
    "    \n",
    "        # class-membership probabilities\n",
    "        self.p_y_given_x = tensor.nnet.softmax(tensor.dot(input, self.W) + self.b)\n",
    "\n",
    "        # class whose probability is maximal\n",
    "        self.y_pred = tensor.argmax(self.p_y_given_x, axis=1)\n",
    "\n",
    "        # parameters of the model\n",
    "        self.params = [self.W, self.b]\n",
    "        \n",
    "\n",
    "    def negative_log_likelihood(self):\n",
    "        \"\"\"Return the mean of the negative log-likelihood of the prediction\n",
    "        of this model under a given target distribution.\n",
    "\n",
    "        Note: we use the mean instead of the sum so that\n",
    "              the learning rate is less dependent on the batch size\n",
    "        \"\"\"\n",
    "        log_prob = tensor.log(self.p_y_given_x)\n",
    "        log_likelihood = log_prob[tensor.arange(self.y.shape[0]), self.y]\n",
    "        loss = - log_likelihood.mean()\n",
    "        return loss\n",
    "\n",
    "    def errors(self):\n",
    "        \"\"\"Return a float representing the number of errors in the minibatch\n",
    "        over the total number of examples of the minibatch\n",
    "        \"\"\"\n",
    "        misclass_nb = tensor.neq(self.y_pred, self.y)\n",
    "        misclass_rate = misclass_nb.mean()\n",
    "        return misclass_rate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The MLP class\n",
    "That class brings together the different parts of the model.\n",
    "\n",
    "It also adds additional controls on the training of the full network, for instance an expression for L1 or L2 regularization (weight decay).\n",
    "\n",
    "We can specify an arbitrary number of hidden layers, providing an empty one will reproduce the logistic regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class MLP(object):\n",
    "    \"\"\"Multi-Layer Perceptron Class\n",
    "\n",
    "    A multilayer perceptron is a feedforward artificial neural network model\n",
    "    that has one layer or more of hidden units and nonlinear activations.\n",
    "    Intermediate layers usually have as activation function tanh or the\n",
    "    sigmoid function (defined here by a ``HiddenLayer`` class)  while the\n",
    "    top layer is a softmax layer (defined here by a ``LogisticRegression``\n",
    "    class).\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, rng, input, target, n_in, n_hidden, n_out, activation=tensor.tanh):\n",
    "        \"\"\"Initialize the parameters for the multilayer perceptron\n",
    "\n",
    "        :type rng: numpy.random.RandomState\n",
    "        :param rng: a random number generator used to initialize weights\n",
    "\n",
    "        :type input: theano.tensor.TensorType\n",
    "        :param input: symbolic variable that describes the input of the\n",
    "        architecture (one minibatch)\n",
    "        \n",
    "        :type target: theano.tensor.TensorType\n",
    "        :type target: column tensor that describes the target for training\n",
    "\n",
    "        :type n_in: int\n",
    "        :param n_in: number of input units, the dimension of the space in\n",
    "        which the datapoints lie\n",
    "\n",
    "        :type n_hidden: list of int\n",
    "        :param n_hidden: number of hidden units in each hidden layer\n",
    "\n",
    "        :type n_out: int\n",
    "        :param n_out: number of output units, the dimension of the space in\n",
    "        which the labels lie\n",
    "        \n",
    "        :type activation: theano.Op or function\n",
    "        :param activation: Non linearity to be applied in all hidden layers\n",
    "        \"\"\"\n",
    "        # keep track of model input and target.\n",
    "        # We store a flattened (vector) version of target as y, which is easier to handle\n",
    "        self.input = input\n",
    "        self.target = target\n",
    "        self.y = target.flatten()\n",
    "\n",
    "        # Build all necessary hidden layers and chain them\n",
    "        self.hidden_layers = []\n",
    "        layer_input = input\n",
    "        layer_n_in = n_in\n",
    "\n",
    "        for nh in n_hidden:\n",
    "            hidden_layer = HiddenLayer(\n",
    "                rng=rng,\n",
    "                input=layer_input,\n",
    "                n_in=layer_n_in,\n",
    "                n_out=nh,\n",
    "                activation=activation)\n",
    "            self.hidden_layers.append(hidden_layer)\n",
    "\n",
    "            # prepare variables for next layer\n",
    "            layer_input = hidden_layer.output\n",
    "            layer_n_in = nh\n",
    "\n",
    "        # The logistic regression layer gets as input the hidden units of the hidden layer,\n",
    "        # and the target\n",
    "        self.log_reg_layer = LogisticRegression(\n",
    "            input=layer_input,\n",
    "            target=target,\n",
    "            n_in=layer_n_in,\n",
    "            n_out=n_out)\n",
    "        \n",
    "        # self.params has all the parameters of the model,\n",
    "        # self.weights contains only the `W` variables.\n",
    "        # We also give unique name to the parameters, this will be useful to save them.\n",
    "        self.params = []\n",
    "        self.weights = []\n",
    "        layer_idx = 0\n",
    "        for hl in self.hidden_layers:\n",
    "            self.params.extend(hl.params)\n",
    "            self.weights.append(hl.W)\n",
    "            for hlp in hl.params:\n",
    "                prev_name = hlp.name\n",
    "                hlp.name = 'layer' + str(layer_idx) + '.' + prev_name\n",
    "            layer_idx += 1\n",
    "        self.params.extend(self.log_reg_layer.params)\n",
    "        self.weights.append(self.log_reg_layer.W)\n",
    "        for lrp in self.log_reg_layer.params:\n",
    "            prev_name = lrp.name\n",
    "            lrp.name = 'layer' + str(layer_idx) + '.' + prev_name\n",
    "\n",
    "        # L1 norm ; one regularization option is to enforce L1 norm to be small\n",
    "        self.L1 = sum(abs(W).sum() for W in self.weights)\n",
    "\n",
    "        # square of L2 norm ; one regularization option is to enforce square of L2 norm to be small\n",
    "        self.L2_sqr = sum((W ** 2).sum() for W in self.weights)\n",
    "    \n",
    "    def negative_log_likelihood(self):\n",
    "        # negative log likelihood of the MLP is given by the negative\n",
    "        # log likelihood of the output of the model, computed in the\n",
    "        # logistic regression layer\n",
    "        return self.log_reg_layer.negative_log_likelihood()\n",
    "\n",
    "    def errors(self):\n",
    "        # same holds for the function computing the number of errors\n",
    "        return self.log_reg_layer.errors()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Procedure\n",
    "We will re-use the same training algorithm: stochastic gradient descent with mini-batches, and the same early-stopping criterion. Here, the number of parameters to train is variable, and we have to wait until the MLP model is actually instantiated to have an expression for the cost and the updates.\n",
    "\n",
    "### Gradient and Updates\n",
    "Let us define helper functions for getting expressions for the gradient of the cost wrt the parameters, and the parameter updates. The following ones are simple, but many variations can exist, for instance:\n",
    "- regularized costs, including L1 or L2 regularization\n",
    "- more complex learning rules, such as momentum, RMSProp, ADAM, ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def nll_grad(mlp_model):\n",
    "    loss = mlp_model.negative_log_likelihood()\n",
    "    params = mlp_model.params\n",
    "    grads = theano.grad(loss, wrt=params)\n",
    "    # Return (param, grad) pairs\n",
    "    return zip(params, grads)\n",
    "\n",
    "def sgd_updates(params_and_grads, learning_rate):\n",
    "    return [(param, param - learning_rate * grad)\n",
    "            for param, grad in params_and_grads]\n",
    "\n",
    "def get_simple_training_fn(mlp_model, learning_rate):\n",
    "    inputs = [mlp_model.input, mlp_model.target]\n",
    "    params_and_grads = nll_grad(mlp_model)\n",
    "    updates = sgd_updates(params_and_grads, learning_rate=lr)\n",
    "    \n",
    "    return theano.function(inputs=inputs, outputs=[], updates=updates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def regularized_cost_grad(mlp_model, L1_reg, L2_reg):\n",
    "    loss = (mlp_model.negative_log_likelihood() +\n",
    "            L1_reg * mlp_model.L1 + \n",
    "            L2_reg * mlp_model.L2_sqr)\n",
    "    params = mlp_model.params\n",
    "    grads = theano.grad(loss, wrt=params)\n",
    "    # Return (param, grad) pairs\n",
    "    return zip(params, grads)\n",
    "\n",
    "def get_regularized_training_fn(mlp_model, L1_reg, L2_reg, learning_rate):\n",
    "    inputs = [mlp_model.input, mlp_model.target]\n",
    "    params_and_grads = regularized_cost_grad(mlp_model, L1_reg, L2_reg)\n",
    "    updates = sgd_updates(params_and_grads, learning_rate=lr)\n",
    "    return theano.function(inputs, updates=updates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def get_test_fn(mlp_model):\n",
    "    return theano.function([mlp_model.input, mlp_model.target], mlp_model.errors())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the Model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training procedure\n",
    "We first need to define a few parameters for the training loop and the early stopping procedure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import timeit\n",
    "from fuel.streams import DataStream\n",
    "from fuel.schemes import SequentialScheme\n",
    "from fuel.transformers import Flatten\n",
    "\n",
    "## early-stopping parameters tuned for 1-2 min runtime\n",
    "def sgd_training(train_model, test_model, train_set, valid_set, test_set, model_name='mlp_model',\n",
    "                 # maximum number of epochs\n",
    "                 n_epochs=20,\n",
    "                 # look at this many examples regardless\n",
    "                 patience=5000,\n",
    "                 # wait this much longer when a new best is found\n",
    "                 patience_increase=2,\n",
    "                 # a relative improvement of this much is considered significant\n",
    "                 improvement_threshold=0.995,\n",
    "                 batch_size=20):\n",
    "\n",
    "    n_train_batches = train_set.num_examples // batch_size\n",
    "\n",
    "    # Create data streams to iterate through the data.\n",
    "    train_stream = Flatten(DataStream.default_stream(\n",
    "        train_set, iteration_scheme=SequentialScheme(train_set.num_examples, batch_size)))\n",
    "    valid_stream = Flatten(DataStream.default_stream(\n",
    "        valid_set, iteration_scheme=SequentialScheme(valid_set.num_examples, batch_size)))\n",
    "    test_stream = Flatten(DataStream.default_stream(\n",
    "        test_set, iteration_scheme=SequentialScheme(test_set.num_examples, batch_size)))\n",
    "\n",
    "    # go through this many minibatches before checking the network on the validation set;\n",
    "    # in this case we check every epoch\n",
    "    validation_frequency = min(n_train_batches, patience / 2)\n",
    "\n",
    "    best_validation_loss = numpy.inf\n",
    "    test_score = 0.\n",
    "    start_time = timeit.default_timer()\n",
    "\n",
    "    done_looping = False\n",
    "    epoch = 0\n",
    "    while (epoch < n_epochs) and (not done_looping):\n",
    "        epoch = epoch + 1\n",
    "        minibatch_index = 0\n",
    "        for minibatch_x, minibatch_y in train_stream.get_epoch_iterator():\n",
    "            train_model(minibatch_x, minibatch_y)\n",
    "\n",
    "            # iteration number\n",
    "            iter = (epoch - 1) * n_train_batches + minibatch_index\n",
    "            if (iter + 1) % validation_frequency == 0:\n",
    "                # compute zero-one loss on validation set\n",
    "                validation_losses = []\n",
    "                for valid_xi, valid_yi in valid_stream.get_epoch_iterator():\n",
    "                    validation_losses.append(test_model(valid_xi, valid_yi))\n",
    "                this_validation_loss = numpy.mean(validation_losses)\n",
    "                print('epoch %i, minibatch %i/%i, validation error %f %%' %\n",
    "                      (epoch,\n",
    "                       minibatch_index + 1,\n",
    "                       n_train_batches,\n",
    "                       this_validation_loss * 100.))\n",
    "\n",
    "                # if we got the best validation score until now\n",
    "                if this_validation_loss < best_validation_loss:\n",
    "                    # improve patience if loss improvement is good enough\n",
    "                    if this_validation_loss < best_validation_loss * improvement_threshold:\n",
    "                        patience = max(patience, iter * patience_increase)\n",
    "\n",
    "                    best_validation_loss = this_validation_loss\n",
    "\n",
    "                    # test it on the test set\n",
    "                    test_losses = []\n",
    "                    for test_xi, test_yi in test_stream.get_epoch_iterator():\n",
    "                        test_losses.append(test_model(test_xi, test_yi))\n",
    "\n",
    "                    test_score = numpy.mean(test_losses)\n",
    "                    print('     epoch %i, minibatch %i/%i, test error of best model %f %%' %\n",
    "                          (epoch,\n",
    "                           minibatch_index + 1,\n",
    "                           n_train_batches,\n",
    "                           test_score * 100.))\n",
    "\n",
    "                    # save the best parameters\n",
    "                    # build a name -> value dictionary\n",
    "                    best = {param.name: param.get_value() for param in mlp_model.params}\n",
    "                    numpy.savez('best_{}.npz'.format(model_name), **best)\n",
    "\n",
    "            minibatch_index += 1\n",
    "            if patience <= iter:\n",
    "                done_looping = True\n",
    "                break\n",
    "\n",
    "    end_time = timeit.default_timer()\n",
    "    print('Optimization complete with best validation score of %f %%, '\n",
    "          'with test performance %f %%' %\n",
    "          (best_validation_loss * 100., test_score * 100.))\n",
    "\n",
    "    print('The code ran for %d epochs, with %f epochs/sec (%.2fm total time)' %\n",
    "          (epoch, 1. * epoch / (end_time - start_time), (end_time - start_time) / 60.))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then load our data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from fuel.datasets import MNIST\n",
    "\n",
    "# the full set is usually (0, 50000) for train, (50000, 60000) for valid and no slice for test.\n",
    "# We only selected a subset to go faster.\n",
    "train_set = MNIST(which_sets=('train',), sources=('features', 'targets'), subset=slice(0, 20000))\n",
    "valid_set = MNIST(which_sets=('train',), sources=('features', 'targets'), subset=slice(20000, 24000))\n",
    "test_set = MNIST(which_sets=('test',), sources=('features', 'targets'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build the Model\n",
    "Now is the time to specify and build a particular instance of the MLP. Let's start with one with a single hidden layer of 500 hidden units, and a tanh non-linearity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rng = numpy.random.RandomState(1234)\n",
    "x = tensor.matrix('x')\n",
    "# The labels coming from Fuel are in a \"column\" format\n",
    "y = tensor.icol('y')\n",
    "\n",
    "n_in = 28 * 28\n",
    "n_out = 10\n",
    "\n",
    "mlp_model = MLP(\n",
    "    rng=rng,\n",
    "    input=x,\n",
    "    target=y,\n",
    "    n_in=n_in,\n",
    "    n_hidden=[500],\n",
    "    n_out=n_out,\n",
    "    activation=tensor.tanh)\n",
    "\n",
    "lr = numpy.float32(0.1)\n",
    "L1_reg = numpy.float32(0)\n",
    "L2_reg = numpy.float32(0.0001)\n",
    "\n",
    "train_model = get_regularized_training_fn(mlp_model, L1_reg, L2_reg, lr)\n",
    "test_model = get_test_fn(mlp_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Launch the training phase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sgd_training(train_model, test_model, train_set, valid_set, test_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How can we make it better?\n",
    "\n",
    "- Max-column normalization\n",
    "- Dropout\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ReLU activation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def relu(x):\n",
    "    return x * (x > 0)\n",
    "\n",
    "rng = numpy.random.RandomState(1234)\n",
    "\n",
    "mlp_relu = MLP(\n",
    "    rng=rng,\n",
    "    input=x,\n",
    "    target=y,\n",
    "    n_in=n_in,\n",
    "    n_hidden=[500],\n",
    "    n_out=n_out,\n",
    "    activation=relu)\n",
    "\n",
    "lr = numpy.float32(0.1)\n",
    "L1_reg = numpy.float32(0)\n",
    "L2_reg = numpy.float32(0.0001)\n",
    "\n",
    "train_relu = get_regularized_training_fn(mlp_relu, L1_reg, L2_reg, lr)\n",
    "test_relu = get_test_fn(mlp_relu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sgd_training(train_relu, test_relu, train_set, valid_set, test_set, model_name='mlp_relu')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Momentum training (Adadelta, RMSProp, ...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# This implements simple momentum\n",
    "def get_momentum_updates(params_and_grads, lr, rho):\n",
    "    res = []\n",
    "\n",
    "    # numpy will promote (1 - rho) to float64 otherwise\n",
    "    one = numpy.float32(1.)\n",
    "    \n",
    "    for p, g in params_and_grads:\n",
    "        up = theano.shared(p.get_value() * 0)\n",
    "        res.append((p, p - lr * up))\n",
    "        res.append((up, rho * up + (one - rho) * g))\n",
    "\n",
    "    return res\n",
    "\n",
    "\n",
    "# This implements the parameter updates for Adadelta\n",
    "def get_adadelta_updates(params_and_grads, rho):\n",
    "    up2 = [theano.shared(p.get_value() * 0, name=\"up2 for \" + p.name) for p, g in params_and_grads]\n",
    "    grads2 = [theano.shared(p.get_value() * 0, name=\"grads2 for \" + p.name) for p, g in params_and_grads]\n",
    "\n",
    "    # This is dumb but numpy will promote (1 - rho) to float64 otherwise\n",
    "    one = numpy.float32(1.)\n",
    "    \n",
    "    rg2up = [(rg2, rho * rg2 + (one - rho) * (g ** 2))\n",
    "             for rg2, (p, g) in zip(grads2, params_and_grads)]\n",
    "\n",
    "    updir = [-(tensor.sqrt(ru2 + 1e-6) / tensor.sqrt(rg2 + 1e-6)) * g\n",
    "             for (p, g), ru2, rg2 in zip(params_and_grads, up2, grads2)]\n",
    "\n",
    "    ru2up = [(ru2, rho * ru2 + (one - rho) * (ud ** 2))\n",
    "             for ru2, ud in zip(up2, updir)]\n",
    "\n",
    "    param_up = [(p, p + ud) for (p, g), ud in zip(params_and_grads, updir)]\n",
    "    \n",
    "    return rg2up + ru2up + param_up\n",
    "\n",
    "# You can try to write an RMSProp function and train the model with it.\n",
    "\n",
    "def get_momentum_training_fn(mlp_model, L1_reg, L2_reg, lr, rho):\n",
    "    inputs = [mlp_model.input, mlp_model.target]\n",
    "    params_and_grads = regularized_cost_grad(mlp_model, L1_reg, L2_reg)\n",
    "    updates = get_momentum_updates(params_and_grads, lr=lr, rho=rho)\n",
    "    return theano.function(inputs, updates=updates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rng = numpy.random.RandomState(1234)\n",
    "x = tensor.matrix('x')\n",
    "# The labels coming from Fuel are in a \"column\" format\n",
    "y = tensor.icol('y')\n",
    "\n",
    "n_in = 28 * 28\n",
    "n_out = 10\n",
    "\n",
    "mlp_model = MLP(\n",
    "    rng=rng,\n",
    "    input=x,\n",
    "    target=y,\n",
    "    n_in=n_in,\n",
    "    n_hidden=[500],\n",
    "    n_out=n_out,\n",
    "    activation=tensor.tanh)\n",
    "\n",
    "lr = numpy.float32(0.1)\n",
    "L1_reg = numpy.float32(0)\n",
    "L2_reg = numpy.float32(0.0001)\n",
    "rho = numpy.float32(0.95)\n",
    "\n",
    "momentum_train = get_momentum_training_fn(mlp_model, L1_reg, L2_reg, lr=lr, rho=rho)\n",
    "test_fn = get_test_fn(mlp_model)\n",
    "\n",
    "sgd_training(momentum_train, test_fn, train_set, valid_set, test_set, n_epochs=20, model_name='mlp_momentum')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}