{ "metadata": { "name": "multilayer_perceptron" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# pylearn2 tutorial: Multilayer Perceptron\n", "by [Ian Goodfellow](http://www-etud.iro.umontreal.ca/~goodfeli)\n", "\n", "## Introduction\n", "This ipython notebook will teach you the basics of how multilayer perceptrons work, and show you how to use multilayer perceptrons in pylearn2.\n", "\n", "To do this, we will go over several concepts:\n", "\n", "Part 1: What pylearn2 is doing for you in this example\n", "\n", " - Review of softmax regression, and how MLPs are similar\n", "\n", " - The multilayer perceptron model\n", "\n", " - Some beneficial properties of MLPs\n", "\n", " - Some detrimental properties of MLPs\n", "\n", "Part 2: How to use pylearn2 to train an MLP\n", "\n", "Part 3: A deeper MLP, and pylearn2 polymorphism\n", "\n", "Part 4: Regularization, and pylearn2 costs\n", "\n", "\n", "Note that this won't explain in detail how the individual classes are implemented. The classes\n", "follow pretty good naming conventions and have pretty good docstrings, but if you have trouble\n", "understanding them, write to me and I might add a part 3 explaining how some of the parts work\n", "under the hood.\n", "\n", "Please write to pylearn-dev@googlegroups.com if you encounter any problem with this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirements\n", "\n", "Before running this notebook, you must have installed pylearn2.\n", "Follow the [download and installation instructions](http://deeplearning.net/software/pylearn2/#download-and-installation)\n", "if you have not yet done so.\n", "\n", "This tutorial also assumes you already know about softmax regression, and know how to train and evaluate a softmax regression model in pylearn2. If not, work through softmax_regression.ipynb before starting this tutorial.\n", "\n", "It's also strongly recommend that you run this notebook with THEANO_FLAGS=\"device=gpu\". This is a processing intensive example and the GPU will make it run a lot faster, if you have one available. Execute the next cell to verify that you are using the GPU.\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import theano\n", "print theano.config.device" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "gpu\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "Using gpu device 0: GeForce GTX 285\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: What pylearn2 is doing for you in this example\n", "\n", "In this part, we won't get into any specifics of pylearn2 yet. We'll just discuss how to train a multilayer perceptron (MLP). If you already know about MLPs, feel free to skip straight to part 2, where we show how to do all of this in pylearn2.\n", "\n", "\n", "### Review of softmax regression, and how MLPs are similar\n", "\n", "In softmax_regression.ipynb, we saw how softmax regression is a classification model that learns to map an input vector $x$ to a probability distribution $p(y\\mid x)$ where $y$ is a categorical value with $k$ different values. We then described how a dataset $\\mathcal{D}$ of $(x, y)$ tuples could be used to train a softmax regression model by maximizing the log likelihood,\n", "\n", "$$\\sum_{x,y \\in \\mathcal{D} } \\log P(y \\mid x).$$\n", "\n", "A multilayer perceptron is a very general machine learning model. In many cases, we can think of it as mapping $x$ to $P(y\\mid x)$, and train it by maximizing the log likelihood. We'll start with that basic perspective, because of its similarity to softmax regression. (It is, however, possible to interpret the output of a multiplayer perceptron non-probabilistically, to use it for regression rather than classification, and to train it by optimizing functions other than the log likelihood)\n", "\n", "Everything we described above is still relevant to the MLP. However, there is one more fact about softmax regression that does not apply to the MLP. Specifically, softmax regression assumes that\n", "\n", "$$p(y \\mid x) = \\frac { \\exp( x^T W + b ) } { \\sum_i \\exp(x^T W + b)_i } = \\text{softmax}( x^T W + b).$$\n", "\n", "The MLP makes a different assumption about the functional form of $p(y \\mid x)$.\n", "\n", "## The multilayer perceptron model\n", "\n", "The multilayer perceptron model assumption is very weak. Essentially, the assumption is that the relationship between inputs and outputs can be represented by the composition of several simpler functions. Each function being composed can be thought of as another \"layer\" or stage of processing. The number of compositions determines the \"depth\" of the model.\n", "\n", "Suppose we have a sequence of functions implementing the layers, $g_1, g_2, \\dots, g_L$. Then the output of our MLP is\n", "\n", "$$f(x) = g_L(g_{L-1}( \\dots g_2( g_1 ( x )) \\dots )).$$\n", "\n", "In the first example for this tutorial, we will use just two layers. The final layer will be\n", "\n", "$g_2(g_1) = \\text{softmax}( g_1^T W^{(2)} + b^{(2)}),$\n", "\n", "so we can think of this model as using $g_1$ to transform $x$ into a different space, then doing softmax regression in that space.\n", "\n", "For the first layer, we will use an affine transform followed by elementwise-application of the logistic sigmoid function, $\\sigma(z) = \\frac {1 } { 1 + \\exp(-z) }.$ This is a very commonly used type of layer in multilayer perceptrons. Putting it all together, we get\n", "\n", "$g_1(x) = \\sigma ( x^T W^{(1)} + b^{(1)} ).$\n", "\n", "The full model is thus\n", "\n", "$$f(x) = \\text{softmax}( \\sigma ( x^T W^{(1)} + b^{(1)} )^T W^{(2)} + b^{(2)}).$$\n", "\n", "If we interpret $f(x)$ as defining $p(y \\mid x)$, it makes sense to train the parameters $W^{(1)}$, $W^{(2)}$, $b^{(1)}$, and $b^{(2)}$ by maximizing the log likelihood of the training data. \n", "\n", "\n", "## Some beneficial properties of MLPs\n", "\n", "An obvious problem with softmax regression and other linear classifiers is that linear functions are very simple. They prevent solutions to even very simple classification problems, such as the class of 2 bit patterns whose XOR is true. XOR is true when $x=[1,0]$ or $x=[0,1]$ but not when $x=[0,0]$ or $x=[1,1]$. Suppose we draw a line that separates $[0,0]$ from $[0,1]$. Then it must pass through some point $[0,p]$. We require that this line also pass through $[q,1]$ in order to separate $[0,1]$ from $[1,1]$. But this means it slope must be negative and its $x$-intercept must be negative. Since a line only has one $x$ intercept, it does not pass between $[0,0]$ and $[1,0]$. Those two points belong to different classes, so any linear classifier must fail.\n", "\n", "An MLP solves this problem by introducing extra stages of processing. In our two layer example, suppose the dimensionality of the first layer is 2. We call the outputs of this layer \"hidden units\" because they are neither inputs nor outputs of the system; they are unobserved variables that the network must decide what to do with. The MLP can set one of these hidden units to be active when the sum of the two input variables is less than 1. It can set the other to be active when the sum of the two input variables is greater than 1. It can then set the output unit to be active by default, and to deactivate when either of the two hidden variables is active.\n", "\n", "More generally, an MLP with one sufficient large hidden layer can represent any function. This result is known as the \"universal approximator theorem.\"\n", "\n", "Another advantage of MLPs is that they can be made deeper and deeper, rather than just wider and wider. Many functions can be represented more efficiently (using fewer parameters) with a deep architecture than with a wide one. Using fewer parameters is beneficial both because the MLP takes less memory to represent, but also because the parameters may be estimated more accurately from a smaller amount of data.\n", "\n", "## Some detrimental properties of MLPs\n", "\n", "Unfortunately, just because an MLP can represent any function does not mean that it will learn to represent the right function. The problem of overfitting can still make the MLP perform badly on the test set even if it classifies the training set perfectly. While larger MLPs are capable of fitting more complicated training sets, they are also likely to overfit worse than smaller MLPs.\n", "\n", "A related issue with MLPs is that they have many configuration options. The model itself imposes design decisions such as what type of function to use for each layer, the dimensionality of each layer. Also, the log likelihood is no longer generally concave, so the choice of optimization procedure matters more than it did with softmax regression. These configuration options are known as \"hyperparameters.\" Choosing the right hyperparameters is an open and exciting research problem. \n", "\n", "Most of the hyperparameters in this tutorial were not chosen particularly carefully. Feel free to play with all of the settings in this notebook. If you find better ones, write to me and I'll put your settings and your name in the tutorial!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Part 2: How to use pylearn2 to train an MLP\n", "\n", "Now that we've described the theory of what we're going to do, it's time to do it! This part describes\n", "how to use pylearn2 to run the algorithms described above.\n", "\n", "As in the softmax regression tutorial, we will use the MLP to do optical character recognition on the MNIST dataset.\n", "The yaml string we construct is similar ot the one we use before. The main difference is that the MLP model class\n", "takes a \"layers\" argument describing the various layers of the model.\n", "\n", "Note that for each layer, we need to specify what class to load. The identity of this class determines what type of layer\n", "appears at each position in the network. Here, we use a sigmoid hidden layer followed by a softmax output layer.\n", "\n", "Every layer of the MLP needs a unique name. Here we name the first hidden layer 'h0' and the output label representing the\n", "prediction of the class $y$ 'y'. These layer names are used to generate monitor channel names later so that we can track properties of each layer separately.\n", "\n", "The hidden layer needs some configuration that is pretty similar to the configuration for the output layer. Much as we need to tell the output layer its size (10 classes) we also need to tell the hidden layer its dimension, or the number of hidden units to go in that layer. In this case we use 500. We also need to tell it how to initialize its weights. The Sigmoid class supports the irange argument that we demonstrated for Softmax in the softmax regression tutorial, and we could use that here. Instead, we demonstrate a different argument, sparse_init. When sparse_init is specified, each unit gets exactly sparse_init non-zero weights initially. These weights are drawn from $N(0,1)$, so they are quite large compared to how weights are usually initialized.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import pylearn2\n", "path = os.path.join(pylearn2.__path__[0], 'scripts', 'tutorials', 'multilayer_perceptron', 'mlp_tutorial_part_2.yaml')\n", "with open(path, 'r') as f:\n", " train = f.read()\n", "hyper_params = {'train_stop' : 50000,\n", " 'valid_stop' : 60000,\n", " 'dim_h0' : 500,\n", " 'max_epochs' : 10000,\n", " 'save_path' : '.'}\n", "train = train % (hyper_params)\n", "print train" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "!obj:pylearn2.train.Train {\n", " dataset: &train !obj:pylearn2.datasets.mnist.MNIST {\n", " which_set: 'train',\n", " start: 0,\n", " stop: 50000\n", " },\n", " model: !obj:pylearn2.models.mlp.MLP {\n", " layers: [\n", " !obj:pylearn2.models.mlp.Sigmoid {\n", " layer_name: 'h0',\n", " dim: 500,\n", " sparse_init: 15,\n", " }, !obj:pylearn2.models.mlp.Softmax {\n", " layer_name: 'y',\n", " n_classes: 10,\n", " irange: 0.\n", " }\n", " ],\n", " nvis: 784,\n", " },\n", " algorithm: !obj:pylearn2.training_algorithms.bgd.BGD {\n", " batch_size: 10000,\n", " line_search_mode: 'exhaustive',\n", " conjugate: 1,\n", " updates_per_batch: 10,\n", " monitoring_dataset:\n", " {\n", " 'train' : *train,\n", " 'valid' : !obj:pylearn2.datasets.mnist.MNIST {\n", " which_set: 'train',\n", " start: 50000,\n", " stop: 60000\n", " },\n", " 'test' : !obj:pylearn2.datasets.mnist.MNIST {\n", " which_set: 'test',\n", " }\n", " },\n", " termination_criterion: !obj:pylearn2.termination_criteria.And {\n", " criteria: [\n", " !obj:pylearn2.termination_criteria.MonitorBased {\n", " channel_name: \"valid_y_misclass\"\n", " },\n", " !obj:pylearn2.termination_criteria.EpochCounter {\n", " max_epochs: 10000\n", " }\n", " ]\n", " }\n", " },\n", " extensions: [\n", " !obj:pylearn2.train_extensions.best_params.MonitorBasedSaveBest {\n", " channel_name: 'valid_y_misclass',\n", " save_path: \"mlp_best.pkl\"\n", " },\n", " ]\n", "}\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we still do not specify a cost to be minimized. In the case of LogisticRegression, the model requested the negative log likelihood by default. In the case of the MLP, it is up to the final layer of the MLP to specify the default cost if the user does not provide one. In this case, since the final layer is a Softmax layer, we still have the same objective function as in the SoftmaxRegression tutorial.\n", "\n", "Now, we use pylearn2's yaml_parse.load to construct the Train object, and run its main loop. The same thing could be accomplished by running pylearn2's train.py script on a file containing the yaml string.\n", "\n", "Execute the next cell to train the model. This will take several minutes and possible as much as a few hours depending on how fast your computer is. As the model trained, it should have printed out progress messages. Most of these are the values of the various channels being monitored throughout training. By running it on "mlp_best.pkl", we can see the performance of the model at the point where it did the best on the validation set. This is a big improvement over softmax regression. Here we use the show_weights script to visualize $W$: We're going to take the MLP example above and change it in three major ways:

-Instead of training just a two layer MLP, we'll train a three layer MLP. We can do this just by putting one more layer in the "layers" list. We don't need to change the training algorithm or the main MLP model.

-Instead of using the Sigmoid Layer class, we'll use a different kind of layer, called a rectified linear layer. The rectified linear layer uses the usual affine function $z = x^T W + b$ to compute the presynaptic inputs, then passes each element of $z$ through the function $g(z) = \mathbb{I}_{z > 0} z$. In other words, values greater than 0 are left unchanged, while negative values are replaced with zeros. In pylearn2, we can do this just by loading a different class in the layers list. We don't need to change the training algorithm or the main MLP model.

-Instead of optimizing the log likelihood using the nonlinear conjugate gradient descent algorithm, we will optimize it using a minibatch version of stochastic gradient descent. We can do this just by passing in a different TrainingAlgorithm object. No changes to the model or the code for the cost are needed.

Here is the updated YAML description of the experiment: [ "\tvalid_h0_row_norms_min: 0.123597666621\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_col_norms_max: 5.99608755112\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_col_norms_mean: 3.85341596603\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_col_norms_min: 1.72634136677\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_row_norms_max: 8.52042388916\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_row_norms_mean: 5.47620201111\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_h1_row_norms_min: 3.27072739601\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_objective: 0.140435069799\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_col_norms_max: 5.69501256943\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_col_norms_mean: 5.31268548965\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_col_norms_min: 4.74868249893\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_max_max_class: 0.999999344349\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_mean_max_class: 0.990495383739\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_min_max_class: 0.633842229843\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_misclass: 0.0265999827534\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_nll: 0.140435069799\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_row_norms_max: 1.60322284698\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_row_norms_mean: 0.500980615616\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tvalid_y_row_norms_min: 0.0168750006706\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Time this epoch: 3.211907 seconds\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Monitoring step:\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tEpochs seen: 17\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tBatches seen: 8500\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tExamples seen: 850000\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tlearning_rate: 0.00999999046326\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\tmomentum: 0.989998817444\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_col_norms_max: 6.34764194489\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_col_norms_mean: 4.21877479553\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_col_norms_min: 2.23619961739\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_row_norms_max: 6.5714468956\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_row_norms_mean: 3.30228757858\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h0_row_norms_min: 0.13643656671\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_col_norms_max: 5.99594020844\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_col_norms_mean: 3.85699319839\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_col_norms_min: 1.72638630867\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_row_norms_max: 8.61135101318\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_row_norms_mean: 5.48117828369\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_h1_row_norms_min: 3.27077460289\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_objective: 0.152983635664\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_col_norms_max: 5.81860494614\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_col_norms_mean: 5.40938711166\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_col_norms_min: 4.81085681915\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_max_max_class: 0.999999344349\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_mean_max_class: 0.990412473679\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_min_max_class: 0.641472399235\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_misclass: 0.0277999881655\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_nll: 0.152983635664\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_row_norms_max: 1.66027259827\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\ttest_y_row_norms_mean: 0.509944438934\n" ] }, { "output_type": 