{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section7_1-Neural-Networks.ipynb)\n",
    "\n",
    "# Supervised Learning: Neural Networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install torchvision"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set(style='ticks')\n",
    "\n",
    "import torch\n",
    "import torchvision\n",
    "from scipy import optimize\n",
    "from ipywidgets import interact\n",
    "from IPython.display import SVG\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## McCulloch and Pitts Neuron\n",
    "\n",
    "In 1943, McCulloch and Pitts introduced a mathematical model of a neuron. It consisted of three components:\n",
    "\n",
    "1. A set of **weights** $w_i$ corresponding to synapses (inputs)\n",
    "2. An **adder** for summing input signals; analogous to cell membrane that collects charge\n",
    "3. An **activation function** for determining when the neuron fires, based on accumulated input\n",
    "\n",
    "The neuron model is shown schematically below. On the left are input nodes $\\{x_i\\}$, usually expressed as a vector. The strength with which the inputs are able to deliver the signal along the synapse is determined by their corresponding weights $\\{w_i\\}$. The adder then sums the inputs from all the synapses:\n",
    "\n",
    "$$h = \\sum_i w_i x_i$$\n",
    "\n",
    "The parameter $\\theta$ determines whether or not the neuron fires given a weighted input of $h$. If it fires, it returns a value $y=1$, otherwise $y=0$. For example, a simple **activation function** is using $\\theta$ as a simple fixed threshold:\n",
    "\n",
    "$$y = g(h) = \\left\\{ \\begin{array}{l}\n",
    "1, \\text{if } h \\gt \\theta \\\\\n",
    "0, \\text{if } h \\le \\theta\n",
    "\\end{array} \\right.$$\n",
    "\n",
    "\n",
    "this activation function may take any of several forms, such as a logistic function.\n",
    "\n",
    "![neuron](http://d.pr/i/9AMK+)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single neuron is not interesting, nor useful, from a learning perspective. It cannot learn; it simply receives inputs and either fires or not. Only when neurons are joined as a **network** can they perform useful work.\n",
    "\n",
    "Learning takes place by changing the weights of the connections in a neural network, and by changing the parameters of the activation functions of neurons.\n",
    "\n",
    "## Perceptron\n",
    "\n",
    "A collection of McCullough and Pitts neurons, along with a set of input nodes connected to the inputs via weighted edges, is a perceptron, the simplest neural network.\n",
    "\n",
    "Each neuron is independent of the others in the perceptron, in the sense that its behavior and performance depends only on its own weights and threshold values, and not of those for the other neurons. Though they share inputs, they operate independently.\n",
    "\n",
    "The number of inputs and outputs are determined by the data. Weights are stored as a `N x K` matrix, with N observations and K neurons, with $w_{ij}$ specifying the weight on the *i*th observation on the *j*th neuron.\n",
    "\n",
    "![perceptron](http://d.pr/i/4IWA+)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to use the perceptron for statistical learning, we compare the outputs $y_j$ from each neuron to the obervation target $t_j$, and adjust the input weights when they do not correspond (*e.g.* if a neuron fires when it should not have).\n",
    "\n",
    "$$t_j - y_j$$\n",
    "\n",
    "We use this difference to update the weight $w_{ij}$, based on the input and a desired **learning rate**. This results in an update rule:\n",
    "\n",
    "$$w_{ij} \\leftarrow w_{ij} + \\eta (t_j - y_j) x_i$$\n",
    "\n",
    "After an incremental improvement, the perceptron is shown the training data again, resulting in another update. This is repeated until the performance no longer improves. Having a learning rate less than one results in a more stable learning rate, though this stability is traded off against having to expose the network to the data multiple times. Typical learning rates are in the 0.1-0.4 range.\n",
    "\n",
    "An additional input node is typically added to the perceptron model, which is a constant value (usually -1, 0, or 1) that acts analogously to an intercept in a regression model. This establishes a baseline input for the case when all inputs are zero.\n",
    "\n",
    "![bias](http://d.pr/i/105b5+)\n",
    "\n",
    "## Learning with Perceptrons\n",
    "\n",
    "1. Initialize weights $w_{ij}$ to small, random numbers.\n",
    "2. For each t in T iterations\n",
    "    * compute activation for each neuron *j* connected to each input vector *i*\n",
    "    $$y_j = g\\left( h=\\sum_i w_{ij} x_i \\right) = \\left\\{ \\begin{array}{l} 1, \\text{if } h \\gt 0 \\\\ 0, \\text{if } h \\le 0 \\end{array} \\right.$$\n",
    "    * update weights\n",
    "    $$w_{ij} \\leftarrow w_{ij} + \\eta (t_j - y_j) x_i$$\n",
    "\n",
    "\n",
    "This algorithm is $\\mathcal{O}(Tmn)$\n",
    "\n",
    "### Example: Logical functions\n",
    "\n",
    "Let's see how the perceptron learns by training it on a couple of of logical functions, AND and OR. For two variables `x1` and `x2`, the AND function returns 1 if both are true, or zero otherwise; the OR function returns 1 if either variable is true, or both. These functions can be expressed as simple lookup tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "AND = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,0,0,1)})\n",
    "AND"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to initialize weights to small, random values (can be positive and negative)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w = np.random.randn(3)*1e-4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, a simple activation function for calculating $g(h)$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = lambda inputs, weights: np.where(np.dot(inputs, weights)>0, 1, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, a training function that iterates the learning algorithm, returning the adapted weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train(inputs, targets, weights, eta, n_iterations):\n",
    "\n",
    "    # Add the inputs that match the bias node\n",
    "    inputs = np.c_[inputs, -np.ones((len(inputs), 1))]\n",
    "\n",
    "    for n in range(n_iterations):\n",
    "\n",
    "        activations = g(inputs, weights);\n",
    "        weights -= eta*np.dot(np.transpose(inputs), activations - targets)\n",
    "        \n",
    "    return(weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test it first on the AND function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = AND[['x1','x2']]\n",
    "target = AND['y']\n",
    "\n",
    "w = train(inputs, target, w, 0.25, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g(np.c_[inputs, -np.ones((len(inputs), 1))], w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, it has learned the function perfectly. Now for OR:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OR = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,1,1,1)})\n",
    "OR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w = np.random.randn(3)*1e-4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = OR[['x1','x2']]\n",
    "target = OR['y']\n",
    "\n",
    "w = train(inputs, target, w, 0.25, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g(np.c_[inputs, -np.ones((len(inputs), 1))], w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also 100% correct.\n",
    "\n",
    "### Exercise: XOR\n",
    "\n",
    "Now try running the model on the XOR function, where a one is returned for either `x1` or `x2` being true, but *not* both. What happens here?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's explore the problem graphically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "AND.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter')\n",
    "plt.plot(np.linspace(0,1.4), 1.5 - 1*np.linspace(0,1.4), 'k--');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OR.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter')\n",
    "plt.plot(np.linspace(-.4,1), .5 - 1*np.linspace(-.4,1), 'k--');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "XOR = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,1,1,0)})\n",
    "\n",
    "XOR.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The perceptron tries to find a separating hyperplane for the two response classes. Namely, a set of weights that satisfies:\n",
    "\n",
    "$$\\mathbf{x_1}\\mathbf{w}^T=0$$\n",
    "\n",
    "and:\n",
    "\n",
    "$$\\mathbf{x_2}\\mathbf{w}^T=0$$\n",
    "\n",
    "Hence,\n",
    "\n",
    "$$\\begin{aligned}\n",
    "\\mathbf{x}_1\\mathbf{w}^T &= \\mathbf{x}_2\\mathbf{w}^T \\\\\n",
    "\\Rightarrow (\\mathbf{x}_1 - \\mathbf{x}_2) \\mathbf{w}^T &= 0\n",
    "\\end{aligned}$$\n",
    "\n",
    "This means that either the norms of $\\mathbf{x}_1 - \\mathbf{x}_2$ or $\\mathbf{w}$ are zero, or the cosine of the angle between them is equal to zero, due to the identity:\n",
    "\n",
    "$$\\mathbf{a}\\mathbf{b} = \\|a\\| \\|b\\| \\cos \\theta$$\n",
    "\n",
    "Since there is no reason for the norms to be zero in general, we need the two vectors to be at right angles to one another. So, we need a weight vector that is perpendicular to the decision boundary.\n",
    "\n",
    "Clearly, for the XOR function, the output classes are not linearly separable. So, the algorithm does not converge on an answer, but simply cycles through two incorrect solutions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-layer Perceptron\n",
    "\n",
    "The solution to fitting more complex (*i.e.* non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Hence, a reasonable guess at how to make a perceptron more complex is to simply **add more weights**.\n",
    "\n",
    "There are two ways to add complexity:\n",
    "\n",
    "1. Add backward connections, so that output neurons feed back to input nodes, resulting in a **recurrent network**\n",
    "2. Add neurons between the input nodes and the outputs, creating an additional (\"hidden\") layer to the network, resulting in a **multi-layer perceptron**\n",
    "\n",
    "The latter approach is more common in applications of neural networks.\n",
    "\n",
    "![multilayer](http://d.pr/i/14BS1+)\n",
    "\n",
    "How to train a multilayer network is not intuitive. Propagating the inputs forward over two layers is straightforward, since the outputs from the hidden layer can be used as inputs for the output layer. However, the process for updating the weights based on the prediction error is less clear, since it is difficult to know whether to change the weights on the input layer or on the hidden layer in order to improve the prediction.\n",
    "\n",
    "Updating a multi-layer perceptron (MLP) is a matter of: \n",
    "\n",
    "1. moving forward through the network, calculating outputs given inputs and current weight estimates\n",
    "2. moving backward updating weights according to the resulting error from forward propagation. \n",
    "\n",
    "In this sense, it is similar to a single-layer perceptron, except it has to be done twice, once for each layer (in principle, we can add additional hidden layers, but without sacrificing generality, I will keep it simple).\n",
    "\n",
    "### Error back-propagation\n",
    "\n",
    "We update the weights in a MLP using **back-propagation** of the prediction errors, which is essentially a form of gradient descent, as we have used previously for optimization.\n",
    "\n",
    "First, for the multi-layer perceptron we need to modify the error function, which in the single-layer case was a simple difference between the predicted and observed outputs. Because we will be summing errors, we have to avoid having errors in different directions cancelling each other out, so a sum of squares error is more appropriate:\n",
    "\n",
    "$$E(t,y) = \\frac{1}{2} \\sum_i (t_i - y_i)^2$$\n",
    "\n",
    "It is on this function that we will perform gradient descent, since the goal is to minimize the error. Specificially, we will differentiate with respect to the weights, since it is the weights that we are manipulating in order to get better predictions.\n",
    "\n",
    "Recall that the error is a function of the threshold function\n",
    "\n",
    "$$E(\\mathbf{w}) = \\frac{1}{2} \\sum_i (t_i - y_i)^2 = \\frac{1}{2} \\sum_i \\left(t_i - g\\left[ \\sum_j w_{ij} a_j \\right]\\right)^2$$\n",
    "\n",
    "So, we will also need to differentiate that. However, the threshold function we used in the single-layer perceptron was discontinuous, making it non-differentiable. Thus, we need to modify it as well. An alternative is to employ some type of sigmoid function, such as the logistic, which can be parameterized to resemble a threshold function, but varies smoothly across its range.\n",
    "\n",
    "$$g(h) = \\frac{1}{1 + \\exp(-\\beta h)}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logistic = lambda h, beta: 1./(1 + np.exp(-beta * h))\n",
    "\n",
    "@interact(beta=(-1, 25))\n",
    "def logistic_plot(beta=5):\n",
    "    hvals = np.linspace(-2, 2)\n",
    "    plt.plot(hvals, logistic(hvals, beta))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has the advantage of having a simple derivative:\n",
    "\n",
    "$$\\frac{dg}{dh} = \\beta g(h)(1 - g(h))$$\n",
    "\n",
    "Alternatively, the hyperbolic tangent function is also sigmoid:\n",
    "\n",
    "$$g(h) = \\tanh(h) = \\frac{\\exp(h) - \\exp(-h)}{\\exp(h) + \\exp(-h)}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperbolic_tangent = lambda h: (np.exp(h) - np.exp(-h)) / (np.exp(h) + np.exp(-h))\n",
    "\n",
    "@interact(theta=(-1, 25))\n",
    "def tanh_plot(theta=5):\n",
    "    hvals = np.linspace(-2, 2)\n",
    "    h = hvals*theta\n",
    "    plt.plot(hvals, hyperbolic_tangent(h))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the hyperbolic tangent function asymptotes at -1 and 1, rather than 0 and 1, which is sometimes beneficial, and its derivative is simple:\n",
    "\n",
    "$$\\frac{d \\tanh(x)}{dx} = 1 - \\tanh^2(x)$$\n",
    "\n",
    "Performing gradient descent will allow us to change the weights in the direction that optimially reduces the error. The next trick will be to employ the **chain rule** to decompose how the error changes as a function of the input weights into the change in error as a function of changes in the inputs to the weights, mutliplied by the changes in input values as a function of changes in the weights. \n",
    "\n",
    "$$\\frac{\\partial E}{\\partial w} = \\frac{\\partial E}{\\partial h}\\frac{\\partial h}{\\partial w}$$\n",
    "\n",
    "This will allow us to write a function describing the activations of the output weights as a function of the activations of the hidden layer nodes and the output weights, which will allow us to propagate error backwards through the network.\n",
    "\n",
    "The second term in the chain rule simplifies to:\n",
    "\n",
    "$$\\begin{align}\n",
    "\\frac{\\partial h_k}{\\partial w_{jk}} &= \\frac{\\partial \\sum_l w_{lk} a_l}{\\partial w_{jk}}  \\\\\n",
    "&= \\sum_l \\frac{\\partial w_{lk} a_l}{\\partial w_{jk}} \\\\\n",
    "& = a_j\n",
    "\\end{align}$$\n",
    "\n",
    "where $a_j$ is the activation of the jth hidden layer neuron.\n",
    "\n",
    "For the first term in the chain rule above, we decompose it as well:\n",
    "\n",
    "$$\\frac{\\partial E}{\\partial h_k} = \\frac{\\partial E}{\\partial y_k}\\frac{\\partial y_k}{\\partial h_k} = \\frac{\\partial E}{\\partial g(h_k)}\\frac{\\partial g(h_k)}{\\partial h_k}$$\n",
    "\n",
    "The second term of this chain rule is just the derivative of the activation function, which we have chosen to have a conveneint form, while the first term simplifies to:\n",
    "\n",
    "$$\\frac{\\partial E}{\\partial g(h_k)} = \\frac{\\partial}{\\partial g(h_k)}\\left[\\frac{1}{2} \\sum_k (t_k - y_k)^2 \\right] = t_k - y_k$$\n",
    "\n",
    "Combining these, and assuming (for illustration) a logistic activiation function, we have the gradient:\n",
    "\n",
    "$$\\frac{\\partial E}{\\partial w} = (t_k - y_k) y_k (1-y_k) a_j$$\n",
    "\n",
    "Which ends up getting plugged into the weight update formula that we saw in the single-layer perceptron:\n",
    "\n",
    "$$w_{jk} \\leftarrow w_{jk} - \\eta (t_k - y_k) y_k (1-y_k) a_j$$\n",
    "\n",
    "Note that here we are *subtracting* the second term, rather than adding, since we are doing gradient descent.\n",
    "\n",
    "We can now outline the MLP learning algorithm:\n",
    "\n",
    "1. Initialize all $w_{jk}$ to small random values\n",
    "2. For each input vector, conduct forward propagation:\n",
    "    * compute activation of each neuron $j$ in hidden layer (here, sigmoid):\n",
    "    $$h_j = \\sum_i x_i v_{ij}$$\n",
    "    $$a_j = g(h_j) = \\frac{1}{1 + \\exp(-\\beta h_j)}$$\n",
    "    * when the output layer is reached, calculate outputs similarly:\n",
    "    $$h_k = \\sum_k a_j w_{jk}$$\n",
    "    $$y_k = g(h_k) = \\frac{1}{1 + \\exp(-\\beta h_k)}$$\n",
    "3. Calculate loss for resulting predictions:\n",
    "    * compute error at output:\n",
    "    $$\\delta_k = (t_k - y_k) y_k (1-y_k)$$\n",
    "4. Conduct backpropagation to get partial derivatives of cost with respect to weights, and use these to update weights:\n",
    "    * compute error of the hidden layers:\n",
    "    $$\\delta_{hj} = \\left[\\sum_k w_{jk} \\delta_k \\right] a_j(1-a_j)$$\n",
    "    * update output layer weights:\n",
    "    $$w_{jk} \\leftarrow w_{jk} - \\eta \\delta_k a_j$$\n",
    "    * update hidden layer weights:\n",
    "    $$v_{ij} \\leftarrow v_{ij} - \\eta \\delta_{hj} x_i$$\n",
    "    \n",
    "Return to (2) and iterate until learning completes. Best practice is to shuffle input vectors to avoid training in the same order.\n",
    "\n",
    "Its important to be aware that because gradient descent is a hill-climbing (or descending) algorithm, it is liable to be caught in local minima with respect to starting values. Therefore, it is worthwhile training several networks using a range of starting values for the weights, so that you have a better chance of discovering a globally-competitive solution.\n",
    "\n",
    "One useful performance enhancement for the MLP learning algorithm is the addition of **momentum** to the weight updates. This is just a coefficient on the previous weight update that increases the correlation between the current weight and the weight after the next update. This is particularly useful for complex models, where falling into local mimima is an issue; adding momentum will give some weight to the previous direction, making the resulting weights essentially a weighted average of the two directions. Adding momentum, along with a smaller learning rate, usually results in a more stable algorithm with quicker convergence. When we use momentum, we lose this guarantee, but this is generally seen as a small price to pay for the improvement momentum usually gives.\n",
    "\n",
    "A weight update with momentum looks like this:\n",
    "\n",
    "$$w_{jk} \\leftarrow w_{jk} - \\eta \\delta_k a_j + \\alpha \\Delta w_{jk}^{t-1}$$\n",
    "\n",
    "where $\\alpha$ is the momentum (regularization) parameter and $\\Delta w_{jk}^{t-1}$ the update from the previous iteration.\n",
    "\n",
    "The multi-layer pereptron is implemented below in the `MLP` class. The implementation uses the scikit-learn interface, so it is uses in the same way as other supervised learning algorithms in that package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "softmax = lambda a: a / np.sum(a, axis=1, keepdims=True)\n",
    "\n",
    "class MLP:\n",
    "    \n",
    "    def __init__(self, alpha=0.01, eta=0.01, n_hidden_dim=25):\n",
    "        \n",
    "        self.alpha = alpha\n",
    "        self.eta = eta\n",
    "        self.n_hidden_dim = n_hidden_dim\n",
    "        \n",
    "    # Helper function to evaluate the total loss on the dataset\n",
    "    def calculate_loss(self, X, y):\n",
    "        \n",
    "        # Forward propagation to calculate our predictions\n",
    "        z1 = X.dot(self.w1) + self.b1\n",
    "        a1 = np.tanh(z1)\n",
    "        z2 = a1.dot(self.w2) + self.b2\n",
    "        exp_scores = np.exp(z2)\n",
    "        probs = softmax(exp_scores)\n",
    "        \n",
    "        # Calculating the loss\n",
    "        data_loss = -np.log(probs[range(num_examples), y]).sum()\n",
    "        \n",
    "        # Add regulatization term to loss (optional)\n",
    "        data_loss += self.alpha/2 * np.square(self.w1).sum() + np.square(self.w2).sum()\n",
    "        \n",
    "        return 1./num_examples * data_loss\n",
    "        \n",
    "    # Helper function to predict an output (0 or 1)\n",
    "    def predict(self, x):\n",
    "\n",
    "        # Forward propagation\n",
    "        z1 = x.dot(self.w1) + self.b1\n",
    "        a1 = np.tanh(z1)\n",
    "        z2 = a1.dot(self.w2) + self.b2\n",
    "        exp_scores = np.exp(z2)\n",
    "        probs = softmax(exp_scores)\n",
    "        \n",
    "        return np.argmax(probs, axis=1)\n",
    "        \n",
    "        \n",
    "    def fit(self, X, y, num_passes=20000, print_loss=False, seed=42):\n",
    "        \n",
    "        num_examples, nn_input_dim = X.shape\n",
    "        nn_output_dim = len(set(y))\n",
    "    \n",
    "        # Initialize the parameters to random values. We need to learn these.\n",
    "        np.random.seed(seed)\n",
    "        self.w1 = np.random.randn(nn_input_dim, self.n_hidden_dim) / np.sqrt(nn_input_dim)\n",
    "        self.b1 = np.zeros((1, self.n_hidden_dim))\n",
    "        self.w2 = np.random.randn(self.n_hidden_dim, nn_output_dim) / np.sqrt(self.n_hidden_dim)\n",
    "        self.b2 = np.zeros((1, nn_output_dim))\n",
    "\n",
    "\n",
    "        # Gradient descent. For each batch...\n",
    "        for i in range(num_passes):\n",
    "\n",
    "            # Forward propagation\n",
    "            z1 = X.dot(self.w1) + self.b1\n",
    "            a1 = np.tanh(z1)\n",
    "            z2 = a1.dot(self.w2) + self.b2\n",
    "            exp_scores = np.exp(z2)\n",
    "\n",
    "            # Backpropagation\n",
    "            delta3 = softmax(exp_scores)\n",
    "            delta3[range(num_examples), y] -= 1\n",
    "            dw2 = (a1.T).dot(delta3)\n",
    "            db2 = np.sum(delta3, axis=0, keepdims=True)\n",
    "            delta2 = delta3.dot(self.w2.T) * (1 - np.power(a1, 2))\n",
    "            dw1 = np.dot(X.T, delta2)\n",
    "            db1 = np.sum(delta2, axis=0)\n",
    "\n",
    "            # Add regularization terms (b1 and b2 don't have regularization terms)\n",
    "            dw2 += self.alpha * self.w2\n",
    "            dw1 += self.alpha * self.w1\n",
    "\n",
    "            # Gradient descent parameter update\n",
    "            self.w1 += -self.eta * dw1\n",
    "            self.b1 += -self.eta * db1\n",
    "            self.w2 += -self.eta * dw2\n",
    "            self.b2 += -self.eta * db2\n",
    "\n",
    "            # Optionally print the loss.\n",
    "            # This is expensive because it uses the whole dataset, so we don't want to do it too often.\n",
    "            if print_loss and i % 1000 == 0:\n",
    "              print(\"Loss after iteration %i: %f\" %(i, calculate_loss(X, y)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's initialize a MLP classifier, specifying the conjugate gradient minimization method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mlp = MLP()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can confirm that it solves a non-linear classification, using the simple XOR example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = XOR[['x1','x2']].values\n",
    "y = XOR['y'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mlp.fit(X, y, num_passes=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mlp.predict(X).reshape([2,2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a somewhat more sophisiticated example, we can use scikit-learn to simulate some data with a non-linear boundary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate a dataset and plot it\n",
    "np.random.seed(0)\n",
    "X, y = datasets.make_moons(200, noise=0.20)\n",
    "X = X.astype(np.float32)\n",
    "y = y.astype(np.int32)\n",
    "plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Set1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = MLP(n_hidden_dim=3)\n",
    "clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_decision_boundary(pred_func, X=X, y=y):\n",
    "    \n",
    "    # Set min and max values and give it some padding\n",
    "    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n",
    "    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n",
    "    h = 0.01\n",
    "    \n",
    "    # Generate a grid of points with distance h between them\n",
    "    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "    \n",
    "    # Predict the function value for the whole gid\n",
    "    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    \n",
    "    # Plot the contour and training examples\n",
    "    plt.contourf(xx, yy, Z, cmap=plt.cm.summer_r)\n",
    "    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Greens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(lambda x: clf.predict(x))\n",
    "plt.title(\"Decision Boundary for hidden layer size 3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we used the scikit-learn interface, its easy to take advantage of the `metrics` module to evaluate the MLP's performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = datasets.make_moons(50, noise=0.20)\n",
    "X = X.astype(np.float32)\n",
    "y = y.astype(np.int32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "\n",
    "accuracy_score(y, clf.predict(X))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(y, clf.predict(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Varying the hidden layer size\n",
    "\n",
    "In the example above we picked a hidden layer size of 3. Let's now get a sense of how varying the hidden layer size affects the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@interact(n_hidden_dim=[1, 2, 3, 4, 5, 20, 50])\n",
    "def hidden_layer_size(n_hidden_dim=5):\n",
    "    model = MLP(n_hidden_dim=n_hidden_dim)\n",
    "    model.fit(X, y)\n",
    "    plot_decision_boundary(lambda x: model.predict(x))\n",
    "    plt.title('Hidden Layer size %d' % n_hidden_dim)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Neural network specification\n",
    "\n",
    "The MLP implemented above uses a single hidden layer, though it allows a user-specified number of hidden layer nodes (defaults to 25). It is worth considering whether it is useful having **multiple hidden layers**, and whether more hidden nodes is desirable.\n",
    "\n",
    "Unfortunately, there is no theory to guide the choice of hidden node number. As a result, we are left to experiment with this parameter, perhaps in some systematic fashion such as cross-validation.\n",
    "\n",
    "Adding additional layers presents only additional \"bookkeeping\" overhead to the user, with the weight updating becoming more complicated as layers are added. So, we don't want to add more hidden layers if it does not pay off in performance. It turns out that two or three layers (including the output layer) can be shown to approximate almost any smooth function. Combining 3 sigmoid functions allows local responses to be approximated with arbitrary accuracy. This is sufficient for determining any decision boundary.\n",
    "\n",
    "### Neural network validation\n",
    "\n",
    "Just as with other supervised learning algorithms, neural networks can be under- or over-fit to a training dataset. The degree to which a network is trained to a particular dataset depends on how long we train it on that dataset. Every time we run the MLP learning algorithm over a dataset (an **epoch**), it reduces the prediction error for that dataset. Thus, the number of epochs should be tuned as a hyperparameter, stopping when the testing-training error gap begins to widen.\n",
    "\n",
    "Note that though we can also use cross-validation to tune the number of hidden layers in the network, there is no risk of overfitting by having too many layers.\n",
    "\n",
    "### Exercise: Epoch tuning for MLPs\n",
    "\n",
    "The dataset `pima-indians-diabetes.data.txt` in your data folder contains eight measurements taken from a group of Pima Native Americans in Arizona, along with an indicator of the onset of diabetes. Use the MLP class to fit a neural network classifier to this dataset, and use cross-validation to examine the optimal number of epochs to use in training.\n",
    "\n",
    "1. Number of times pregnant\n",
    "2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n",
    "3. Diastolic blood pressure (mm Hg)\n",
    "4. Triceps skin fold thickness (mm)\n",
    "5. 2-Hour serum insulin (mu U/ml)\n",
    "6. Body mass index (weight in kg/(height in m)^2)\n",
    "7. Diabetes pedigree function\n",
    "8. Age (years)\n",
    "9. Class variable (0 or 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pima = pd.read_csv('../data/pima-indians-diabetes.data.txt', header=None)\n",
    "pima.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## References\n",
    "\n",
    "- [The TensorFlow Playground](https://playground.tensorflow.org/)\n",
    "\n",
    "- T. Hastie, R. Tibshirani and J. Friedman. (2009) [Elements of Statistical Learning: Data Mining, Inference, and Prediction](http://statweb.stanford.edu/~tibs/ElemStatLearn/), second edition. Springer.\n",
    "\n",
    "- X. Glorot, A. Bordes and Y. Bengio (2011). [Deep sparse rectifier neural networks (PDF)](http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf). AISTATS.\n",
    "\n",
    "- S. Marsland. (2009) [Machine Learning: An Algorithmic Perspective](Machine Learning: An Algorithmic Perspectivehttp://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html). CRC Press.\n",
    "\n",
    "- D. Rodriguez. (2013) [Basic [1 hidden layer] neural network on Python](http://danielfrg.com/blog/2013/07/03/basic-neural-network-python/).\n",
    "\n",
    "- D. Britz. (2015) [Implementing a Neural Network from Scratch](http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}