{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Exercise 4 - Neural Networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook covers a Python-based solution for the fourth programming exercise of the machine learning class on Coursera. Please refer to the [exercise text](https://github.com/jdwittenauer/ipython-notebooks/blob/master/exercises/ML/ex4.pdf) for detailed descriptions and equations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise we'll again tackle the hand-written digits data set, this time using a feed-forward neural network with backpropagation. We'll implement un-regularized and regularized versions of the neural network cost function and gradient computation via the backpropagation algorithm. We'll also implement random weight initialization and a method to use the network to make predictions.\n", "\n", "Since the data set is the same one we used in exercise 3, we'll re-use the code to load the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'X': array([[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " ..., \n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.]]),\n", " '__globals__': [],\n", " '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',\n", " '__version__': '1.0',\n", " 'y': array([[10],\n", " [10],\n", " [10],\n", " ..., \n", " [ 9],\n", " [ 9],\n", " [ 9]], dtype=uint8)}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from scipy.io import loadmat\n", "%matplotlib inline\n", "\n", "data = loadmat('data/ex3data1.mat')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we're going to need these later (and will use them often), let's create some useful variables up-front." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "((5000L, 400L), (5000L, 1L))" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = data['X']\n", "y = data['y']\n", "\n", "X.shape, y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're also going to need to one-hot encode our y labels. One-hot encoding turns a class label n (out of k classes) into a vector of length k where index n is \"hot\" (1) while the rest are zero. Scikit-learn has a built in utility we can use for this." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(5000L, 10L)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "encoder = OneHotEncoder(sparse=False)\n", "y_onehot = encoder.fit_transform(y)\n", "y_onehot.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array([10], dtype=uint8),\n", " array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y[0], y_onehot[0,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The neural network we're going to build for this exercise has an input layer matching the size of our instance data (400 + the bias unit), a hidden layer with 25 units (26 with the bias unit), and an output layer with 10 units corresponding to our one-hot encoding for the class labels. For additional details and an image of the network architecture, please refer to the PDF in the \"exercises\" folder.\n", "\n", "The first piece we need to implement is a cost function to evaluate the loss for a given set of network parameters. The source mathematical function is in the exercise text (and looks pretty intimidating). Here are the functions required to compute the cost." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def sigmoid(z):\n", " return 1 / (1 + np.exp(-z))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def forward_propagate(X, theta1, theta2):\n", " m = X.shape[0]\n", " \n", " a1 = np.insert(X, 0, values=np.ones(m), axis=1)\n", " z2 = a1 * theta1.T\n", " a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)\n", " z3 = a2 * theta2.T\n", " h = sigmoid(z3)\n", " \n", " return a1, z2, a2, z3, h" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n", " m = X.shape[0]\n", " X = np.matrix(X)\n", " y = np.matrix(y)\n", " \n", " # reshape the parameter array into parameter matrices for each layer\n", " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", " \n", " # run the feed-forward pass\n", " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", " \n", " # compute the cost\n", " J = 0\n", " for i in range(m):\n", " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n", " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n", " J += np.sum(first_term - second_term)\n", " \n", " J = J / m\n", " \n", " return J" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've used the sigmoid function before so that's not new. The forward-propagate function computes the hypothesis for each training instance given the current parameters. It's output shape should match the same of our one-hot encoding for y. We can test this real quick to convince ourselves that it's working as expected (the intermediate steps are also returned as these will be useful later)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "((25L, 401L), (10L, 26L))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initial setup\n", "input_size = 400\n", "hidden_size = 25\n", "num_labels = 10\n", "learning_rate = 1\n", "\n", "# randomly initialize a parameter array of the size of the full network's parameters\n", "params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25\n", "\n", "m = X.shape[0]\n", "X = np.matrix(X)\n", "y = np.matrix(y)\n", "\n", "# unravel the parameter array into parameter matrices for each layer\n", "theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", "theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", "\n", "theta1.shape, theta2.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "((5000L, 401L), (5000L, 25L), (5000L, 26L), (5000L, 10L), (5000L, 10L))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", "a1.shape, z2.shape, a2.shape, z3.shape, h.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cost function, after computing the hypothesis matrix h, applies the cost equation to compute the total error between y and h." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6.8228086634127862" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our next step is to add regularization to the cost function. If you're following along in the exercise text and thought the last equation looked ugly, this one looks REALLY ugly. It's actually not as complicated as it looks though - in fact, the regularization term is simply an addition to the cost we already computed. Here's the revised cost function." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n", " m = X.shape[0]\n", " X = np.matrix(X)\n", " y = np.matrix(y)\n", " \n", " # reshape the parameter array into parameter matrices for each layer\n", " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", " \n", " # run the feed-forward pass\n", " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", " \n", " # compute the cost\n", " J = 0\n", " for i in range(m):\n", " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n", " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n", " J += np.sum(first_term - second_term)\n", " \n", " J = J / m\n", " \n", " # add the cost regularization term\n", " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n", " \n", " return J" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6.8281541822949299" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next up is the backpropagation algorithm. Backpropagation computes the parameter updates that will reduce the error of the network on the training data. The first thing we need is a function that computes the gradient of the sigmoid function we created earlier." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def sigmoid_gradient(z):\n", " return np.multiply(sigmoid(z), (1 - sigmoid(z)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're ready to implement backpropagation to compute the gradients. Since the computations required for backpropagation are a superset of those required in the cost function, we're actually going to extend the cost function to also perform backpropagation and return both the cost and the gradients." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n", " m = X.shape[0]\n", " X = np.matrix(X)\n", " y = np.matrix(y)\n", " \n", " # reshape the parameter array into parameter matrices for each layer\n", " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", " \n", " # run the feed-forward pass\n", " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", " \n", " # initializations\n", " J = 0\n", " delta1 = np.zeros(theta1.shape) # (25, 401)\n", " delta2 = np.zeros(theta2.shape) # (10, 26)\n", " \n", " # compute the cost\n", " for i in range(m):\n", " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n", " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n", " J += np.sum(first_term - second_term)\n", " \n", " J = J / m\n", " \n", " # add the cost regularization term\n", " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n", " \n", " # perform backpropagation\n", " for t in range(m):\n", " a1t = a1[t,:] # (1, 401)\n", " z2t = z2[t,:] # (1, 25)\n", " a2t = a2[t,:] # (1, 26)\n", " ht = h[t,:] # (1, 10)\n", " yt = y[t,:] # (1, 10)\n", " \n", " d3t = ht - yt # (1, 10)\n", " \n", " z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)\n", " d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)\n", " \n", " delta1 = delta1 + (d2t[:,1:]).T * a1t\n", " delta2 = delta2 + d3t.T * a2t\n", " \n", " delta1 = delta1 / m\n", " delta2 = delta2 / m\n", " \n", " # unravel the gradient matrices into a single array\n", " grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))\n", " \n", " return J, grad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The hardest part of the backprop computation (other than understanding WHY we're doing all these calculations) is getting the matrix dimensions right. By the way, if you find it confusing when to use A * B vs. np.multiply(A, B), you're not alone. Basically the former is a matrix multiplication and the latter is an element-wise multiplication (unless A or B is a scalar value, in which case it doesn't matter). I wish there was a more concise syntax for this (maybe there is and I'm just not aware of it).\n", "\n", "Anyway, let's test it out to make sure the function returns what we're expecting it to." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(6.8281541822949299, (10285L,))" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)\n", "J, grad.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We still have one more modification to make to the backprop function - adding regularization to the gradient calculations. The final regularized version is below." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n", " m = X.shape[0]\n", " X = np.matrix(X)\n", " y = np.matrix(y)\n", " \n", " # reshape the parameter array into parameter matrices for each layer\n", " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", " \n", " # run the feed-forward pass\n", " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", " \n", " # initializations\n", " J = 0\n", " delta1 = np.zeros(theta1.shape) # (25, 401)\n", " delta2 = np.zeros(theta2.shape) # (10, 26)\n", " \n", " # compute the cost\n", " for i in range(m):\n", " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n", " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n", " J += np.sum(first_term - second_term)\n", " \n", " J = J / m\n", " \n", " # add the cost regularization term\n", " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n", " \n", " # perform backpropagation\n", " for t in range(m):\n", " a1t = a1[t,:] # (1, 401)\n", " z2t = z2[t,:] # (1, 25)\n", " a2t = a2[t,:] # (1, 26)\n", " ht = h[t,:] # (1, 10)\n", " yt = y[t,:] # (1, 10)\n", " \n", " d3t = ht - yt # (1, 10)\n", " \n", " z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)\n", " d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)\n", " \n", " delta1 = delta1 + (d2t[:,1:]).T * a1t\n", " delta2 = delta2 + d3t.T * a2t\n", " \n", " delta1 = delta1 / m\n", " delta2 = delta2 / m\n", " \n", " # add the gradient regularization term\n", " delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * learning_rate) / m\n", " delta2[:,1:] = delta2[:,1:] + (theta2[:,1:] * learning_rate) / m\n", " \n", " # unravel the gradient matrices into a single array\n", " grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))\n", " \n", " return J, grad" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(6.8281541822949299, (10285L,))" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)\n", "J, grad.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're finally ready to train our network and use it to make predictions. This is roughly similar to the previous exercise with multi-class logistic regression." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " status: 3\n", " success: False\n", " nfev: 250\n", " fun: 0.33900736818312283\n", " x: array([ -8.85740564e-01, 2.57420350e-04, -4.09396202e-04, ...,\n", " 1.44634791e+00, 1.68974302e+00, 7.10121593e-01])\n", " message: 'Max. number of function evaluations reach'\n", " jac: array([ -5.11463703e-04, 5.14840700e-08, -8.18792403e-08, ...,\n", " -2.48297749e-04, -3.17870911e-04, -3.31404592e-04])\n", " nit: 21" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.optimize import minimize\n", "\n", "# minimize the objective function\n", "fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate), \n", " method='TNC', jac=True, options={'maxiter': 250})\n", "fmin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We put a bound on the number of iterations since the objective function is not likely to completely converge. Our total cost has dropped below 0.5 though so that's a good indicator that the algorithm is working. Let's use the parameters it found and forward-propagate them through the network to get some predictions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[10],\n", " [10],\n", " [10],\n", " ..., \n", " [ 9],\n", " [ 9],\n", " [ 9]], dtype=int64)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = np.matrix(X)\n", "theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n", "theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n", "\n", "a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n", "y_pred = np.array(np.argmax(h, axis=1) + 1)\n", "y_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we can compute the accuracy to see how well our trained network is doing." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy = 99.22%\n" ] } ], "source": [ "correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]\n", "accuracy = (sum(map(int, correct)) / float(len(correct)))\n", "print 'accuracy = {0}%'.format(accuracy * 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we're done! We've successfully implemented a rudimentary feed-forward neural network with backpropagation and used it to classify images of handwritten digits. In the next exercise we'll look at another power supervised learning algorithm, support vector machines." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }