{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Programming Exercise 3\n",
    "# Multi-class Classification and Neural Networks\n",
    "\n",
    "## Introduction\n",
    "\n",
    "\n",
    "In this exercise, you will implement one-vs-all logistic regression and neural networks to recognize handwritten digits. Before starting the programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated topics. \n",
    "\n",
    "All the information you need for solving this assignment is in this notebook, and all the code you will be implementing will take place within this notebook. The assignment can be promptly submitted to the coursera grader directly from this notebook (code and instructions are included below).\n",
    "\n",
    "Before we begin with the exercises, we need to import all libraries required for this programming exercise. Throughout the course, we will be using [`numpy`](http://www.numpy.org/) for all arrays and matrix operations, [`matplotlib`](https://matplotlib.org/) for plotting, and [`scipy`](https://docs.scipy.org/doc/scipy/reference/) for scientific and numerical computation functions and tools. You can find instructions on how to install required libraries in the README file in the [github repository](https://github.com/dibgerge/ml-coursera-python-assignments)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# used for manipulating directory paths\n",
    "import os\n",
    "\n",
    "# Scientific and vector computation for python\n",
    "import numpy as np\n",
    "\n",
    "# Plotting library\n",
    "from matplotlib import pyplot\n",
    "\n",
    "# Optimization module in scipy\n",
    "from scipy import optimize\n",
    "\n",
    "# will be used to load MATLAB mat datafile format\n",
    "from scipy.io import loadmat\n",
    "\n",
    "# library written for this exercise providing additional functions for assignment submission, and others\n",
    "import utils\n",
    "\n",
    "# define the submission/grader object for this exercise\n",
    "grader = utils.Grader()\n",
    "\n",
    "# tells matplotlib to embed plots within the notebook\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission and Grading\n",
    "\n",
    "\n",
    "After completing each part of the assignment, be sure to submit your solutions to the grader. The following is a breakdown of how each part of this exercise is scored.\n",
    "\n",
    "\n",
    "| Section | Part                                 | Submission function                   |  Points \n",
    "| :-      |:-                                    | :-                                    |  :-:    \n",
    "| 1       | [Regularized Logistic Regression](#section1)     | [`lrCostFunction`](#lrCostFunction)   | 30     \n",
    "| 2       | [One-vs-all classifier training](#section2)       | [`oneVsAll`](#oneVsAll)               | 20     \n",
    "| 3       | [One-vs-all classifier prediction](#section3)     | [`predictOneVsAll`](#predictOneVsAll) | 20     \n",
    "| 4       | [Neural Network Prediction Function](#section4)   | [`predict`](#predict)           | 30\n",
    "|         | Total Points                         |                                 | 100    \n",
    "\n",
    "\n",
    "You are allowed to submit your solutions multiple times, and we will take only the highest score into consideration.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "At the end of each section in this notebook, we have a cell which contains code for submitting the solutions thus far to the grader. Execute the cell to see your score up to the current section. For all your work to be submitted properly, you must execute those cells at least once. They must also be re-executed everytime the submitted function is updated.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 Multi-class Classification\n",
    "\n",
    "For this exercise, you will use logistic regression and neural networks to recognize handwritten digits (from 0 to 9). Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes)\n",
    "on mail envelopes to recognizing amounts written on bank checks. This exercise will show you how the methods you have learned can be used for this classification task.\n",
    "\n",
    "In the first part of the exercise, you will extend your previous implementation of logistic regression and apply it to one-vs-all classification.\n",
    "\n",
    "### 1.1 Dataset\n",
    "\n",
    "You are given a data set in `ex3data1.mat` that contains 5000 training examples of handwritten digits (This is a subset of the [MNIST](http://yann.lecun.com/exdb/mnist) handwritten digit dataset). The `.mat` format means that that the data has been saved in a native Octave/MATLAB matrix format, instead of a text (ASCII) format like a csv-file. We use the `.mat` format here because this is the dataset provided in the MATLAB version of this assignment. Fortunately, python provides mechanisms to load MATLAB native format using the `loadmat` function within the `scipy.io` module. This function returns a python dictionary with keys containing the variable names within the `.mat` file. \n",
    "\n",
    "There are 5000 training examples in `ex3data1.mat`, where each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix `X`. This gives us a 5000 by 400 matrix `X` where every row is a training example for a handwritten digit image.\n",
    "\n",
    "$$ X = \\begin{bmatrix} - \\: (x^{(1)})^T \\: - \\\\ -\\: (x^{(2)})^T \\:- \\\\ \\vdots \\\\ - \\: (x^{(m)})^T \\:-  \\end{bmatrix} $$\n",
    "\n",
    "The second part of the training set is a 5000-dimensional vector `y` that contains labels for the training set. \n",
    "We start the exercise by first loading the dataset. Execute the cell below, you do not need to write any code here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 20x20 Input Images of Digits\n",
    "input_layer_size  = 400\n",
    "\n",
    "# 10 labels, from 1 to 10 (note that we have mapped \"0\" to label 10)\n",
    "num_labels = 10\n",
    "\n",
    "#  training data stored in arrays X, y\n",
    "data = loadmat(os.path.join('Data', 'ex3data1.mat'))\n",
    "X, y = data['X'], data['y'].ravel()\n",
    "\n",
    "# set the zero digit to 0, rather than its mapped 10 in this dataset\n",
    "# This is an artifact due to the fact that this dataset was used in \n",
    "# MATLAB where there is no index 0\n",
    "y[y == 10] = 0\n",
    "\n",
    "m = y.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Visualizing the data\n",
    "\n",
    "You will begin by visualizing a subset of the training set. In the following cell, the code randomly selects selects 100 rows from `X` and passes those rows to the `displayData` function. This function maps each row to a 20 pixel by 20 pixel grayscale image and displays the images together. We have provided the `displayData` function in the file `utils.py`. You are encouraged to examine the code to see how it works. Run the following cell to visualize the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Randomly select 100 data points to display\n",
    "rand_indices = np.random.choice(m, 100, replace=False)\n",
    "sel = X[rand_indices, :]\n",
    "\n",
    "utils.displayData(sel)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 1.3 Vectorizing Logistic Regression\n",
    "\n",
    "You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers. To make this training efficient, it is important to ensure that your code is well vectorized. In this section, you will implement a vectorized version of logistic regression that does not employ any `for` loops. You can use your code in the previous exercise as a starting point for this exercise. \n",
    "\n",
    "To test your vectorized logistic regression, we will use custom data as defined in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test values for the parameters theta\n",
    "theta_t = np.array([-2, -1, 1, 2], dtype=float)\n",
    "\n",
    "# test values for the inputs\n",
    "X_t = np.concatenate([np.ones((5, 1)), np.arange(1, 16).reshape(5, 3, order='F')/10.0], axis=1)\n",
    "\n",
    "# test values for the labels\n",
    "y_t = np.array([1, 0, 1, 0, 1])\n",
    "\n",
    "# test value for the regularization parameter\n",
    "lambda_t = 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"section1\"></a>\n",
    "#### 1.3.1 Vectorizing the cost function \n",
    "\n",
    "We will begin by writing a vectorized version of the cost function. Recall that in (unregularized) logistic regression, the cost function is\n",
    "\n",
    "$$ J(\\theta) = \\frac{1}{m} \\sum_{i=1}^m \\left[ -y^{(i)} \\log \\left( h_\\theta\\left( x^{(i)} \\right) \\right) - \\left(1 - y^{(i)} \\right) \\log \\left(1 - h_\\theta \\left( x^{(i)} \\right) \\right) \\right] $$\n",
    "\n",
    "To compute each element in the summation, we have to compute $h_\\theta(x^{(i)})$ for every example $i$, where $h_\\theta(x^{(i)}) = g(\\theta^T x^{(i)})$ and $g(z) = \\frac{1}{1+e^{-z}}$ is the sigmoid function. It turns out that we can compute this quickly for all our examples by using matrix multiplication. Let us define $X$ and $\\theta$ as\n",
    "\n",
    "$$ X = \\begin{bmatrix} - \\left( x^{(1)} \\right)^T - \\\\ - \\left( x^{(2)} \\right)^T - \\\\ \\vdots \\\\ - \\left( x^{(m)} \\right)^T - \\end{bmatrix} \\qquad \\text{and} \\qquad \\theta = \\begin{bmatrix} \\theta_0 \\\\ \\theta_1 \\\\ \\vdots \\\\ \\theta_n \\end{bmatrix} $$\n",
    "\n",
    "Then, by computing the matrix product $X\\theta$, we have: \n",
    "\n",
    "$$ X\\theta = \\begin{bmatrix} - \\left( x^{(1)} \\right)^T\\theta - \\\\ - \\left( x^{(2)} \\right)^T\\theta - \\\\ \\vdots \\\\ - \\left( x^{(m)} \\right)^T\\theta - \\end{bmatrix} = \\begin{bmatrix} - \\theta^T x^{(1)}  - \\\\ - \\theta^T x^{(2)} - \\\\ \\vdots \\\\ - \\theta^T x^{(m)}  - \\end{bmatrix} $$\n",
    "\n",
    "In the last equality, we used the fact that $a^Tb = b^Ta$ if $a$ and $b$ are vectors. This allows us to compute the products $\\theta^T x^{(i)}$ for all our examples $i$ in one line of code.\n",
    "\n",
    "#### 1.3.2 Vectorizing the gradient\n",
    "\n",
    "Recall that the gradient of the (unregularized) logistic regression cost is a vector where the $j^{th}$ element is defined as\n",
    "\n",
    "$$ \\frac{\\partial J }{\\partial \\theta_j} = \\frac{1}{m} \\sum_{i=1}^m \\left( \\left( h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x_j^{(i)} \\right) $$\n",
    "\n",
    "To vectorize this operation over the dataset, we start by writing out all the partial derivatives explicitly for all $\\theta_j$,\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\begin{bmatrix} \n",
    "\\frac{\\partial J}{\\partial \\theta_0} \\\\\n",
    "\\frac{\\partial J}{\\partial \\theta_1} \\\\\n",
    "\\frac{\\partial J}{\\partial \\theta_2} \\\\\n",
    "\\vdots \\\\\n",
    "\\frac{\\partial J}{\\partial \\theta_n}\n",
    "\\end{bmatrix} = &\n",
    "\\frac{1}{m} \\begin{bmatrix}\n",
    "\\sum_{i=1}^m \\left( \\left(h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x_0^{(i)}\\right) \\\\\n",
    "\\sum_{i=1}^m \\left( \\left(h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x_1^{(i)}\\right) \\\\\n",
    "\\sum_{i=1}^m \\left( \\left(h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x_2^{(i)}\\right) \\\\\n",
    "\\vdots \\\\\n",
    "\\sum_{i=1}^m \\left( \\left(h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x_n^{(i)}\\right) \\\\\n",
    "\\end{bmatrix} \\\\\n",
    "= & \\frac{1}{m} \\sum_{i=1}^m \\left( \\left(h_\\theta\\left(x^{(i)}\\right) - y^{(i)} \\right)x^{(i)}\\right) \\\\\n",
    "= & \\frac{1}{m} X^T \\left( h_\\theta(x) - y\\right)\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "where\n",
    "\n",
    "$$  h_\\theta(x) - y = \n",
    "\\begin{bmatrix}\n",
    "h_\\theta\\left(x^{(1)}\\right) - y^{(1)} \\\\\n",
    "h_\\theta\\left(x^{(2)}\\right) - y^{(2)} \\\\\n",
    "\\vdots \\\\\n",
    "h_\\theta\\left(x^{(m)}\\right) - y^{(m)} \n",
    "\\end{bmatrix} $$\n",
    "\n",
    "Note that $x^{(i)}$ is a vector, while $h_\\theta\\left(x^{(i)}\\right) - y^{(i)}$  is a scalar (single number).\n",
    "To understand the last step of the derivation, let $\\beta_i = (h_\\theta\\left(x^{(m)}\\right) - y^{(m)})$ and\n",
    "observe that:\n",
    "\n",
    "$$ \\sum_i \\beta_ix^{(i)} = \\begin{bmatrix} \n",
    "| & | & & | \\\\\n",
    "x^{(1)} & x^{(2)} & \\cdots & x^{(m)} \\\\\n",
    "| & | & & | \n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    "\\beta_1 \\\\\n",
    "\\beta_2 \\\\\n",
    "\\vdots \\\\\n",
    "\\beta_m\n",
    "\\end{bmatrix} = x^T \\beta\n",
    "$$\n",
    "\n",
    "where the values $\\beta_i = \\left( h_\\theta(x^{(i)} - y^{(i)} \\right)$.\n",
    "\n",
    "The expression above allows us to compute all the partial derivatives\n",
    "without any loops. If you are comfortable with linear algebra, we encourage you to work through the matrix multiplications above to convince yourself that the vectorized version does the same computations. \n",
    "\n",
    "Your job is to write the unregularized cost function `lrCostFunction` which returns both the cost function $J(\\theta)$ and its gradient $\\frac{\\partial J}{\\partial \\theta}$. Your implementation should use the strategy we presented above to calculate $\\theta^T x^{(i)}$. You should also use a vectorized approach for the rest of the cost function. A fully vectorized version of `lrCostFunction` should not contain any loops.\n",
    "\n",
    "<div class=\"alert alert-box alert-warning\">\n",
    "**Debugging Tip:** Vectorizing code can sometimes be tricky. One common strategy for debugging is to print out the sizes of the matrices you are working with using the `shape` property of `numpy` arrays. For example, given a data matrix $X$ of size $100 \\times 20$ (100 examples, 20 features) and $\\theta$, a vector with size $20$, you can observe that `np.dot(X, theta)` is a valid multiplication operation, while `np.dot(theta, X)` is not. Furthermore, if you have a non-vectorized version of your code, you can compare the output of your vectorized code and non-vectorized code to make sure that they produce the same outputs.\n",
    "</div>\n",
    "<a id=\"lrCostFunction\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lrCostFunction(theta, X, y, lambda_):\n",
    "    \"\"\"\n",
    "    Computes the cost of using theta as the parameter for regularized\n",
    "    logistic regression and the gradient of the cost w.r.t. to the parameters.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    theta : array_like\n",
    "        Logistic regression parameters. A vector with shape (n, ). n is \n",
    "        the number of features including any intercept.  \n",
    "    \n",
    "    X : array_like\n",
    "        The data set with shape (m x n). m is the number of examples, and\n",
    "        n is the number of features (including intercept).\n",
    "    \n",
    "    y : array_like\n",
    "        The data labels. A vector with shape (m, ).\n",
    "    \n",
    "    lambda_ : float\n",
    "        The regularization parameter. \n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    J : float\n",
    "        The computed value for the regularized cost function. \n",
    "    \n",
    "    grad : array_like\n",
    "        A vector of shape (n, ) which is the gradient of the cost\n",
    "        function with respect to theta, at the current values of theta.\n",
    "    \n",
    "    Instructions\n",
    "    ------------\n",
    "    Compute the cost of a particular choice of theta. You should set J to the cost.\n",
    "    Compute the partial derivatives and set grad to the partial\n",
    "    derivatives of the cost w.r.t. each parameter in theta\n",
    "    \n",
    "    Hint 1\n",
    "    ------\n",
    "    The computation of the cost function and gradients can be efficiently\n",
    "    vectorized. For example, consider the computation\n",
    "    \n",
    "        sigmoid(X * theta)\n",
    "    \n",
    "    Each row of the resulting matrix will contain the value of the prediction\n",
    "    for that example. You can make use of this to vectorize the cost function\n",
    "    and gradient computations. \n",
    "    \n",
    "    Hint 2\n",
    "    ------\n",
    "    When computing the gradient of the regularized cost function, there are\n",
    "    many possible vectorized solutions, but one solution looks like:\n",
    "    \n",
    "        grad = (unregularized gradient for logistic regression)\n",
    "        temp = theta \n",
    "        temp[0] = 0   # because we don't add anything for j = 0\n",
    "        grad = grad + YOUR_CODE_HERE (using the temp variable)\n",
    "    \n",
    "    Hint 3\n",
    "    ------\n",
    "    We have provided the implementatation of the sigmoid function within \n",
    "    the file `utils.py`. At the start of the notebook, we imported this file\n",
    "    as a module. Thus to access the sigmoid function within that file, you can\n",
    "    do the following: `utils.sigmoid(z)`.\n",
    "    \n",
    "    \"\"\"\n",
    "    #Initialize some useful values\n",
    "    m = y.size\n",
    "    \n",
    "    # convert labels to ints if their type is bool\n",
    "    if y.dtype == bool:\n",
    "        y = y.astype(int)\n",
    "    \n",
    "    # You need to return the following variables correctly\n",
    "    J = 0\n",
    "    grad = np.zeros(theta.shape)\n",
    "    \n",
    "    # ====================== YOUR CODE HERE ======================\n",
    "\n",
    "\n",
    "        \n",
    "    # =============================================================\n",
    "    return J, grad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.3 Vectorizing regularized logistic regression\n",
    "\n",
    "After you have implemented vectorization for logistic regression, you will now\n",
    "add regularization to the cost function. Recall that for regularized logistic\n",
    "regression, the cost function is defined as\n",
    "\n",
    "$$ J(\\theta) = \\frac{1}{m} \\sum_{i=1}^m \\left[ -y^{(i)} \\log \\left(h_\\theta\\left(x^{(i)} \\right)\\right) - \\left( 1 - y^{(i)} \\right) \\log\\left(1 - h_\\theta \\left(x^{(i)} \\right) \\right) \\right] + \\frac{\\lambda}{2m} \\sum_{j=1}^n \\theta_j^2 $$\n",
    "\n",
    "Note that you should not be regularizing $\\theta_0$ which is used for the bias term.\n",
    "Correspondingly, the partial derivative of regularized logistic regression cost for $\\theta_j$ is defined as\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "& \\frac{\\partial J(\\theta)}{\\partial \\theta_0} = \\frac{1}{m} \\sum_{i=1}^m \\left( h_\\theta\\left( x^{(i)} \\right) - y^{(i)} \\right) x_j^{(i)}  & \\text{for } j = 0 \\\\\n",
    "& \\frac{\\partial J(\\theta)}{\\partial \\theta_0} = \\left( \\frac{1}{m} \\sum_{i=1}^m \\left( h_\\theta\\left( x^{(i)} \\right) - y^{(i)} \\right) x_j^{(i)} \\right) + \\frac{\\lambda}{m} \\theta_j & \\text{for } j  \\ge 1\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "Now modify your code in lrCostFunction in the [**previous cell**](#lrCostFunction) to account for regularization. Once again, you should not put any loops into your code.\n",
    "\n",
    "<div class=\"alert alert-box alert-warning\">\n",
    "**python/numpy Tip:** When implementing the vectorization for regularized logistic regression, you might often want to only sum and update certain elements of $\\theta$. In `numpy`, you can index into the matrices to access and update only certain elements. For example, A[:, 3:5]\n",
    "= B[:, 1:3] will replaces the columns with index 3 to 5 of A with the columns with index 1 to 3 from B. To select columns (or rows) until the end of the matrix, you can leave the right hand side of the colon blank. For example, A[:, 2:] will only return elements from the $3^{rd}$ to last columns of $A$. If you leave the left hand size of the colon blank, you will select elements from the beginning of the matrix. For example, A[:, :2] selects the first two columns, and is equivalent to A[:, 0:2]. In addition, you can use negative indices to index arrays from the end. Thus, A[:, :-1] selects all columns of A except the last column, and A[:, -5:] selects the $5^{th}$ column from the end to the last column. Thus, you could use this together with the sum and power ($^{**}$) operations to compute the sum of only the elements you are interested in (e.g., `np.sum(z[1:]**2)`). In the starter code, `lrCostFunction`, we have also provided hints on yet another possible method computing the regularized gradient.\n",
    "</div>\n",
    "\n",
    "Once you finished your implementation, you can call the function `lrCostFunction` to test your solution using the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "J, grad = lrCostFunction(theta_t, X_t, y_t, lambda_t)\n",
    "\n",
    "print('Cost         : {:.6f}'.format(J))\n",
    "print('Expected cost: 2.534819')\n",
    "print('-----------------------')\n",
    "print('Gradients:')\n",
    "print(' [{:.6f}, {:.6f}, {:.6f}, {:.6f}]'.format(*grad))\n",
    "print('Expected gradients:')\n",
    "print(' [0.146561, -0.548558, 0.724722, 1.398003]');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After completing a part of the exercise, you can submit your solutions for grading by first adding the function you modified to the submission object, and then sending your function to Coursera for grading. \n",
    "\n",
    "The submission script will prompt you for your login e-mail and submission token. You can obtain a submission token from the web page for the assignment. You are allowed to submit your solutions multiple times, and we will take only the highest score into consideration.\n",
    "\n",
    "*Execute the following cell to grade your solution to the first part of this exercise.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# appends the implemented function in part 1 to the grader object\n",
    "grader[1] = lrCostFunction\n",
    "\n",
    "# send the added functions to coursera grader for getting a grade on this part\n",
    "grader.grade()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"section2\"></a>\n",
    "### 1.4 One-vs-all Classification\n",
    "\n",
    "In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the $K$ classes in our dataset. In the handwritten digits dataset, $K = 10$, but your code should work for any value of $K$. \n",
    "\n",
    "You should now complete the code for the function `oneVsAll` below, to train one classifier for each class. In particular, your code should return all the classifier parameters in a matrix $\\theta \\in \\mathbb{R}^{K \\times (N +1)}$, where each row of $\\theta$ corresponds to the learned logistic regression parameters for one class. You can do this with a “for”-loop from $0$ to $K-1$, training each classifier independently.\n",
    "\n",
    "Note that the `y` argument to this function is a vector of labels from 0 to 9. When training the classifier for class $k \\in \\{0, ..., K-1\\}$, you will want a K-dimensional vector of labels $y$, where $y_j \\in 0, 1$ indicates whether the $j^{th}$ training instance belongs to class $k$ $(y_j = 1)$, or if it belongs to a different\n",
    "class $(y_j = 0)$. You may find logical arrays helpful for this task. \n",
    "\n",
    "Furthermore, you will be using scipy's `optimize.minimize` for this exercise. \n",
    "<a id=\"oneVsAll\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def oneVsAll(X, y, num_labels, lambda_):\n",
    "    \"\"\"\n",
    "    Trains num_labels logistic regression classifiers and returns\n",
    "    each of these classifiers in a matrix all_theta, where the i-th\n",
    "    row of all_theta corresponds to the classifier for label i.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The input dataset of shape (m x n). m is the number of \n",
    "        data points, and n is the number of features. Note that we \n",
    "        do not assume that the intercept term (or bias) is in X, however\n",
    "        we provide the code below to add the bias term to X. \n",
    "    \n",
    "    y : array_like\n",
    "        The data labels. A vector of shape (m, ).\n",
    "    \n",
    "    num_labels : int\n",
    "        Number of possible labels.\n",
    "    \n",
    "    lambda_ : float\n",
    "        The logistic regularization parameter.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    all_theta : array_like\n",
    "        The trained parameters for logistic regression for each class.\n",
    "        This is a matrix of shape (K x n+1) where K is number of classes\n",
    "        (ie. `numlabels`) and n is number of features without the bias.\n",
    "    \n",
    "    Instructions\n",
    "    ------------\n",
    "    You should complete the following code to train `num_labels`\n",
    "    logistic regression classifiers with regularization parameter `lambda_`. \n",
    "    \n",
    "    Hint\n",
    "    ----\n",
    "    You can use y == c to obtain a vector of 1's and 0's that tell you\n",
    "    whether the ground truth is true/false for this class.\n",
    "    \n",
    "    Note\n",
    "    ----\n",
    "    For this assignment, we recommend using `scipy.optimize.minimize(method='CG')`\n",
    "    to optimize the cost function. It is okay to use a for-loop \n",
    "    (`for c in range(num_labels):`) to loop over the different classes.\n",
    "    \n",
    "    Example Code\n",
    "    ------------\n",
    "    \n",
    "        # Set Initial theta\n",
    "        initial_theta = np.zeros(n + 1)\n",
    "      \n",
    "        # Set options for minimize\n",
    "        options = {'maxiter': 50}\n",
    "    \n",
    "        # Run minimize to obtain the optimal theta. This function will \n",
    "        # return a class object where theta is in `res.x` and cost in `res.fun`\n",
    "        res = optimize.minimize(lrCostFunction, \n",
    "                                initial_theta, \n",
    "                                (X, (y == c), lambda_), \n",
    "                                jac=True, \n",
    "                                method='TNC',\n",
    "                                options=options) \n",
    "    \"\"\"\n",
    "    # Some useful variables\n",
    "    m, n = X.shape\n",
    "    \n",
    "    # You need to return the following variables correctly \n",
    "    all_theta = np.zeros((num_labels, n + 1))\n",
    "\n",
    "    # Add ones to the X data matrix\n",
    "    X = np.concatenate([np.ones((m, 1)), X], axis=1)\n",
    "\n",
    "    # ====================== YOUR CODE HERE ======================\n",
    "   \n",
    "\n",
    "\n",
    "    # ============================================================\n",
    "    return all_theta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you have completed the code for `oneVsAll`, the following cell will use your implementation to train a multi-class classifier. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lambda_ = 0.1\n",
    "all_theta = oneVsAll(X, y, num_labels, lambda_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*You should now submit your solutions.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grader[2] = oneVsAll\n",
    "grader.grade()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"section3\"></a>\n",
    "#### 1.4.1 One-vs-all Prediction\n",
    "\n",
    "After training your one-vs-all classifier, you can now use it to predict the digit contained in a given image. For each input, you should compute the “probability” that it belongs to each class using the trained logistic regression classifiers. Your one-vs-all prediction function will pick the class for which the corresponding logistic regression classifier outputs the highest probability and return the class label (0, 1, ..., K-1) as the prediction for the input example. You should now complete the code in the function `predictOneVsAll` to use the one-vs-all classifier for making predictions. \n",
    "<a id=\"predictOneVsAll\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predictOneVsAll(all_theta, X):\n",
    "    \"\"\"\n",
    "    Return a vector of predictions for each example in the matrix X. \n",
    "    Note that X contains the examples in rows. all_theta is a matrix where\n",
    "    the i-th row is a trained logistic regression theta vector for the \n",
    "    i-th class. You should set p to a vector of values from 0..K-1 \n",
    "    (e.g., p = [0, 2, 0, 1] predicts classes 0, 2, 0, 1 for 4 examples) .\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    all_theta : array_like\n",
    "        The trained parameters for logistic regression for each class.\n",
    "        This is a matrix of shape (K x n+1) where K is number of classes\n",
    "        and n is number of features without the bias.\n",
    "    \n",
    "    X : array_like\n",
    "        Data points to predict their labels. This is a matrix of shape \n",
    "        (m x n) where m is number of data points to predict, and n is number \n",
    "        of features without the bias term. Note we add the bias term for X in \n",
    "        this function. \n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    p : array_like\n",
    "        The predictions for each data point in X. This is a vector of shape (m, ).\n",
    "    \n",
    "    Instructions\n",
    "    ------------\n",
    "    Complete the following code to make predictions using your learned logistic\n",
    "    regression parameters (one-vs-all). You should set p to a vector of predictions\n",
    "    (from 0 to num_labels-1).\n",
    "    \n",
    "    Hint\n",
    "    ----\n",
    "    This code can be done all vectorized using the numpy argmax function.\n",
    "    In particular, the argmax function returns the index of the max element,\n",
    "    for more information see '?np.argmax' or search online. If your examples\n",
    "    are in rows, then, you can use np.argmax(A, axis=1) to obtain the index \n",
    "    of the max for each row.\n",
    "    \"\"\"\n",
    "    m = X.shape[0];\n",
    "    num_labels = all_theta.shape[0]\n",
    "\n",
    "    # You need to return the following variables correctly \n",
    "    p = np.zeros(m)\n",
    "\n",
    "    # Add ones to the X data matrix\n",
    "    X = np.concatenate([np.ones((m, 1)), X], axis=1)\n",
    "\n",
    "    # ====================== YOUR CODE HERE ======================\n",
    "\n",
    "\n",
    "    \n",
    "    # ============================================================\n",
    "    return p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you are done, call your `predictOneVsAll` function using the learned value of $\\theta$. You should see that the training set accuracy is about 95.1% (i.e., it classifies 95.1% of the examples in the training set correctly)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = predictOneVsAll(all_theta, X)\n",
    "print('Training Set Accuracy: {:.2f}%'.format(np.mean(pred == y) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*You should now submit your solutions.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grader[3] = predictOneVsAll\n",
    "grader.grade()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 Neural Networks\n",
    "\n",
    "In the previous part of this exercise, you implemented multi-class logistic regression to recognize handwritten digits. However, logistic regression cannot form more complex hypotheses as it is only a linear classifier (You could add more features - such as polynomial features - to logistic regression, but that can be very expensive to train).\n",
    "\n",
    "In this part of the exercise, you will implement a neural network to recognize handwritten digits using the same training set as before. The neural network will be able to represent complex models that form non-linear hypotheses. For this week, you will be using parameters from a neural network that we have already trained. Your goal is to implement the feedforward propagation algorithm to use our weights for prediction. In next week’s exercise, you will write the backpropagation algorithm for learning the neural network parameters. \n",
    "\n",
    "We start by first reloading and visualizing the dataset which contains the MNIST handwritten digits (this is the same as we did in the first part of this exercise, we reload it here to ensure the variables have not been modified). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  training data stored in arrays X, y\n",
    "data = loadmat(os.path.join('Data', 'ex3data1.mat'))\n",
    "X, y = data['X'], data['y'].ravel()\n",
    "\n",
    "# set the zero digit to 0, rather than its mapped 10 in this dataset\n",
    "# This is an artifact due to the fact that this dataset was used in \n",
    "# MATLAB where there is no index 0\n",
    "y[y == 10] = 0\n",
    "\n",
    "# get number of examples in dataset\n",
    "m = y.size\n",
    "\n",
    "# randomly permute examples, to be used for visualizing one \n",
    "# picture at a time\n",
    "indices = np.random.permutation(m)\n",
    "\n",
    "# Randomly select 100 data points to display\n",
    "rand_indices = np.random.choice(m, 100, replace=False)\n",
    "sel = X[rand_indices, :]\n",
    "\n",
    "utils.displayData(sel)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### 2.1 Model representation \n",
    "\n",
    "Our neural network is shown in the following figure.\n",
    "\n",
    "![Neural network](Figures/neuralnetwork.png)\n",
    "\n",
    "It has 3 layers: an input layer, a hidden layer and an output layer. Recall that our inputs are pixel values of digit images. Since the images are of size 20×20, this gives us 400 input layer units (excluding the extra bias unit which always outputs +1). As before, the training data will be loaded into the variables X and y. \n",
    "\n",
    "You have been provided with a set of network parameters ($\\Theta^{(1)}$, $\\Theta^{(2)}$) already trained by us. These are stored in `ex3weights.mat`. The following cell loads those parameters into  `Theta1` and `Theta2`. The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup the parameters you will use for this exercise\n",
    "input_layer_size  = 400  # 20x20 Input Images of Digits\n",
    "hidden_layer_size = 25   # 25 hidden units\n",
    "num_labels = 10          # 10 labels, from 0 to 9\n",
    "\n",
    "# Load the .mat file, which returns a dictionary \n",
    "weights = loadmat(os.path.join('Data', 'ex3weights.mat'))\n",
    "\n",
    "# get the model weights from the dictionary\n",
    "# Theta1 has size 25 x 401\n",
    "# Theta2 has size 10 x 26\n",
    "Theta1, Theta2 = weights['Theta1'], weights['Theta2']\n",
    "\n",
    "# swap first and last columns of Theta2, due to legacy from MATLAB indexing, \n",
    "# since the weight file ex3weights.mat was saved based on MATLAB indexing\n",
    "Theta2 = np.roll(Theta2, 1, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"section4\"></a>\n",
    "### 2.2 Feedforward Propagation and Prediction\n",
    "\n",
    "Now you will implement feedforward propagation for the neural network. You will need to complete the code in the function `predict` to return the neural network’s prediction. You should implement the feedforward computation that computes $h_\\theta(x^{(i)})$ for every example $i$ and returns the associated predictions. Similar to the one-vs-all classification strategy, the prediction from the neural network will be the label that has the largest output $\\left( h_\\theta(x) \\right)_k$.\n",
    "\n",
    "<div class=\"alert alert-box alert-warning\">\n",
    "**Implementation Note:** The matrix $X$ contains the examples in rows. When you complete the code in the function `predict`, you will need to add the column of 1’s to the matrix. The matrices `Theta1` and `Theta2` contain the parameters for each unit in rows. Specifically, the first row of `Theta1` corresponds to the first hidden unit in the second layer. In `numpy`, when you compute $z^{(2)} = \\theta^{(1)}a^{(1)}$, be sure that you index (and if necessary, transpose) $X$ correctly so that you get $a^{(l)}$ as a 1-D vector.\n",
    "</div>\n",
    "<a id=\"predict\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(Theta1, Theta2, X):\n",
    "    \"\"\"\n",
    "    Predict the label of an input given a trained neural network.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    Theta1 : array_like\n",
    "        Weights for the first layer in the neural network.\n",
    "        It has shape (2nd hidden layer size x input size)\n",
    "    \n",
    "    Theta2: array_like\n",
    "        Weights for the second layer in the neural network. \n",
    "        It has shape (output layer size x 2nd hidden layer size)\n",
    "    \n",
    "    X : array_like\n",
    "        The image inputs having shape (number of examples x image dimensions).\n",
    "    \n",
    "    Return \n",
    "    ------\n",
    "    p : array_like\n",
    "        Predictions vector containing the predicted label for each example.\n",
    "        It has a length equal to the number of examples.\n",
    "    \n",
    "    Instructions\n",
    "    ------------\n",
    "    Complete the following code to make predictions using your learned neural\n",
    "    network. You should set p to a vector containing labels \n",
    "    between 0 to (num_labels-1).\n",
    "     \n",
    "    Hint\n",
    "    ----\n",
    "    This code can be done all vectorized using the numpy argmax function.\n",
    "    In particular, the argmax function returns the index of the  max element,\n",
    "    for more information see '?np.argmax' or search online. If your examples\n",
    "    are in rows, then, you can use np.argmax(A, axis=1) to obtain the index\n",
    "    of the max for each row.\n",
    "    \n",
    "    Note\n",
    "    ----\n",
    "    Remember, we have supplied the `sigmoid` function in the `utils.py` file. \n",
    "    You can use this function by calling `utils.sigmoid(z)`, where you can \n",
    "    replace `z` by the required input variable to sigmoid.\n",
    "    \"\"\"\n",
    "    # Make sure the input has two dimensions\n",
    "    if X.ndim == 1:\n",
    "        X = X[None]  # promote to 2-dimensions\n",
    "    \n",
    "    # useful variables\n",
    "    m = X.shape[0]\n",
    "    num_labels = Theta2.shape[0]\n",
    "\n",
    "    # You need to return the following variables correctly \n",
    "    p = np.zeros(X.shape[0])\n",
    "\n",
    "    # ====================== YOUR CODE HERE ======================\n",
    "\n",
    "\n",
    "\n",
    "    # =============================================================\n",
    "    return p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you are done, call your predict function using the loaded set of parameters for `Theta1` and `Theta2`. You should see that the accuracy is about 97.5%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = predict(Theta1, Theta2, X)\n",
    "print('Training Set Accuracy: {:.1f}%'.format(np.mean(pred == y) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After that, we will display images from the training set one at a time, while at the same time printing out the predicted label for the displayed image. \n",
    "\n",
    "Run the following cell to display a single image the the neural network's prediction. You can run the cell multiple time to see predictions for different images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if indices.size > 0:\n",
    "    i, indices = indices[0], indices[1:]\n",
    "    utils.displayData(X[i, :], figsize=(4, 4))\n",
    "    pred = predict(Theta1, Theta2, X[i, :])\n",
    "    print('Neural Network Prediction: {}'.format(*pred))\n",
    "else:\n",
    "    print('No more images to display!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*You should now submit your solutions.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grader[4] = predict\n",
    "grader.grade()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}