{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building your first Deep Neural Network: Introduction to Backpropagation\n",
    "\n",
    "In this Chapter, we will:\n",
    "- Introrduce the Streetlight Problem.\n",
    "- Study matrices and the matrix relationship.\n",
    "- Implement full, batch, and stochastic Gradient Descent.\n",
    "- Show that neural networks learn correlation.\n",
    "- Explain overfitting.\n",
    "- Create our own correlation.\n",
    "- Study backpropagation: Long-distance error attribution.\n",
    "- Study linear versus non-lienar propagation.\n",
    "- Implement our first deep network.\n",
    "- Implement Backpropagation: bringing it all together.\n",
    "\n",
    "> [Douglas Adams] \"O Deep Thought Computer,\" he Said, \"The task we have designed you to perform is this. We want you to tell us...\" he paused. \"The Answer\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Streetlight Problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The streetlight problem is a toy problem that considers how a network can learn an entire dataset. \n",
    "\n",
    "Imagine that you are approaching a street corner in a foreign country, as you approach, you look up and realize that the street light is unfamiliar. In this case, how can you know when it's safe to cross the street? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:300px;\" src=\"static/imgs/06/streetlight_data.png\" />\n",
    "</div>\n",
    "\n",
    "To solve this problem, you might sit at a street corner for a few minutes observing the correlation between each light combination and whether people around you choose to stop or walk. After a few minutes, you realize that there is a perfect correlation between the middle light and whether it's safe to walk or not. \n",
    "\n",
    "You learned this pattern by observing all individual data points and **searching for correlation**. This is what we're going to train a neural network to do!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have two datasets; on the one hand, we have six streetlight states, on the other hand, we have size observations of whether people walked or not.\n",
    "\n",
    "Neural networks do not read streetlights. As a result, we want to prepare this data for processing. First thing to do is split it into two groups:\n",
    "- What we know.\n",
    "- What we want to know."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matrices & the Matrix Relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to translate the streetlight data into numerical values:\n",
    "\n",
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:400px\" src=\"static/imgs/06/symbols-to-numbers.png\" />\n",
    "    <img style=\"width:200px\" src=\"static/imgs/06/output-pattern.png\" />\n",
    "</div>\n",
    "\n",
    "The goal after that is to teach a neural network to translate a streetlight pattern into the correct stop/walk pattern. What we really want to do is to **transform** the input information we have into the correct stop/walk target signal.\n",
    "\n",
    "In data matrices, a common convention is to give each recorded example a single row and give each feature/column/thing being recorded a single column. This makes the matrix easier to be read and processed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Good Data Matrices perfectly mimic the outside world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data matrix does not have to be all 1s and 0s. It was the case for the previous example since we are dealing with binary information. The matrix itself should mirror the patterns that exist in the real world.\n",
    "\n",
    "The underlying pattern is not the same as the matrix:\n",
    "- It is a property of the matrix.\n",
    "- The pattern is what the matrix is expressing.\n",
    "- The same pattern can exist in other matrices that describe the same real-world phenomena.\n",
    "\n",
    "The resulting matrix is called a **lossless representation** because we can perfectly convert back and forth between the stop/walk notes and the matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a Matrix or Two in Python\n",
    "### Import the Matrices into Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's Create the Streetlight pattern matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6, 3)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "streetlights = np.array([[1,0,1], [0,1,1], [0,0,1], [1,1,1], [0,1,1], [1,0,1]])\n",
    "streetlights.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Numpy` is really just a fancy wrapper for an array of arrays that provides special, matrix-oriented functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6, 1)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "walk_vs_stop = np.array([[0], [1], [0], [1], [1], [0]])\n",
    "walk_vs_stop.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a Neural Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error: 0.46497 | Prediction: 0.68189\n",
      "Error: 0.06026 | Prediction: 0.24548\n",
      "Error: 0.00781 | Prediction: 0.08837\n",
      "Error: 0.00101 | Prediction: 0.03181\n",
      "Error: 0.00013 | Prediction: 0.01145\n",
      "Error: 2e-05 | Prediction: 0.00412\n",
      "Error: 0.0 | Prediction: 0.00148\n",
      "Error: 0.0 | Prediction: 0.00053\n",
      "Error: 0.0 | Prediction: 0.00019\n",
      "Error: 0.0 | Prediction: 7e-05\n"
     ]
    }
   ],
   "source": [
    "# Personal Implementation\n",
    "ws = np.random.rand(streetlights.shape[1])\n",
    "x_i = streetlights[0]\n",
    "y_i = walk_vs_stop[0]\n",
    "lr = .1\n",
    "\n",
    "for iteration in range(20):\n",
    "    # predict.\n",
    "    prediction = x_i.dot(ws)\n",
    "    # MSE error.\n",
    "    error = (prediction - y_i) ** 2\n",
    "    # update weights.\n",
    "    for j in range(len(ws)):\n",
    "        gradient = 2 * x_i[j] * (prediction - y_i)\n",
    "        ws[j] -= lr * gradient\n",
    "    \n",
    "    if iteration % 2 == 0:\n",
    "        print(f\"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error: 0.04 | Prediction: -0.2\n",
      "Error: 0.0256 | Prediction: -0.16\n",
      "Error: 0.01638 | Prediction: -0.128\n",
      "Error: 0.01049 | Prediction: -0.1024\n",
      "Error: 0.00671 | Prediction: -0.08192\n",
      "Error: 0.00429 | Prediction: -0.06554\n",
      "Error: 0.00275 | Prediction: -0.05243\n",
      "Error: 0.00176 | Prediction: -0.04194\n",
      "Error: 0.00113 | Prediction: -0.03355\n",
      "Error: 0.00072 | Prediction: -0.02684\n",
      "Error: 0.00046 | Prediction: -0.02147\n",
      "Error: 0.0003 | Prediction: -0.01718\n",
      "Error: 0.00019 | Prediction: -0.01374\n",
      "Error: 0.00012 | Prediction: -0.011\n",
      "Error: 8e-05 | Prediction: -0.0088\n",
      "Error: 5e-05 | Prediction: -0.00704\n",
      "Error: 3e-05 | Prediction: -0.00563\n",
      "Error: 2e-05 | Prediction: -0.0045\n",
      "Error: 1e-05 | Prediction: -0.0036\n",
      "Error: 1e-05 | Prediction: -0.00288\n"
     ]
    }
   ],
   "source": [
    "# Book's Implementation.\n",
    "weights = np.array([.5, .48, -.7])\n",
    "alpha = .1\n",
    "\n",
    "input = streetlights[0]\n",
    "goal_prediction = walk_vs_stop[0]\n",
    "\n",
    "for iteration in range(20):\n",
    "    prediction = input.dot(weights)\n",
    "    error = (goal_prediction - prediction) ** 2\n",
    "    delta = prediction - goal_prediction\n",
    "    weights = weights - (alpha * (input * delta))\n",
    "    print(f\"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning the Whole Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The neural network has been learning only one streetlight. Next, let's implement a training loop to learn from all of the data points we have:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction: 1.39 | Reality: 0 | Error: 1.94297\n",
      "Prediction: 0.91 | Reality: 1 | Error: 0.0076\n",
      "Prediction: 0.19 | Reality: 0 | Error: 0.03563\n",
      "Prediction: 1.57 | Reality: 1 | Error: 0.33057\n",
      "Prediction: 0.68 | Reality: 1 | Error: 0.10242\n",
      "Prediction: 0.65 | Reality: 0 | Error: 0.42256\n",
      "Prediction: 0.39 | Reality: 0 | Error: 0.15212\n",
      "Prediction: 0.6 | Reality: 1 | Error: 0.16003\n",
      "Prediction: -0.03 | Reality: 0 | Error: 0.00078\n",
      "Prediction: 1.11 | Reality: 1 | Error: 0.01157\n",
      "Prediction: 0.72 | Reality: 1 | Error: 0.07698\n",
      "Prediction: 0.33 | Reality: 0 | Error: 0.11028\n",
      "Prediction: 0.2 | Reality: 0 | Error: 0.0397\n",
      "Prediction: 0.73 | Reality: 1 | Error: 0.07439\n",
      "Prediction: -0.04 | Reality: 0 | Error: 0.00161\n",
      "Prediction: 1.06 | Reality: 1 | Error: 0.00343\n",
      "Prediction: 0.82 | Reality: 1 | Error: 0.03206\n",
      "Prediction: 0.19 | Reality: 0 | Error: 0.03783\n",
      "Prediction: 0.12 | Reality: 0 | Error: 0.01362\n",
      "Prediction: 0.83 | Reality: 1 | Error: 0.02879\n",
      "Prediction: -0.04 | Reality: 0 | Error: 0.00132\n",
      "Prediction: 1.05 | Reality: 1 | Error: 0.00209\n",
      "Prediction: 0.89 | Reality: 1 | Error: 0.01273\n",
      "Prediction: 0.12 | Reality: 0 | Error: 0.01334\n",
      "Prediction: 0.07 | Reality: 0 | Error: 0.0048\n",
      "Prediction: 0.9 | Reality: 1 | Error: 0.01095\n",
      "Prediction: -0.03 | Reality: 0 | Error: 0.001\n",
      "Prediction: 1.04 | Reality: 1 | Error: 0.00142\n",
      "Prediction: 0.93 | Reality: 1 | Error: 0.00512\n",
      "Prediction: 0.07 | Reality: 0 | Error: 0.00463\n",
      "Prediction: 0.04 | Reality: 0 | Error: 0.00167\n",
      "Prediction: 0.94 | Reality: 1 | Error: 0.00419\n",
      "Prediction: -0.03 | Reality: 0 | Error: 0.00075\n",
      "Prediction: 1.03 | Reality: 1 | Error: 0.00099\n",
      "Prediction: 0.95 | Reality: 1 | Error: 0.00211\n",
      "Prediction: 0.04 | Reality: 0 | Error: 0.00156\n",
      "Prediction: 0.02 | Reality: 0 | Error: 0.00056\n",
      "Prediction: 0.96 | Reality: 1 | Error: 0.00162\n",
      "Prediction: -0.02 | Reality: 0 | Error: 0.00056\n",
      "Prediction: 1.03 | Reality: 1 | Error: 0.0007\n",
      "Prediction: 0.97 | Reality: 1 | Error: 0.0009\n",
      "Prediction: 0.02 | Reality: 0 | Error: 0.0005\n"
     ]
    }
   ],
   "source": [
    "# let's generalize the algorithm.\n",
    "ws = np.random.rand(streetlights.shape[1])\n",
    "lr = .1\n",
    "epoches = 7\n",
    "\n",
    "for interation in range(epoches):\n",
    "    for i in range(len(streetlights)):\n",
    "        # predict.\n",
    "        prediction = streetlights[i].dot(ws)\n",
    "        # MSE error.\n",
    "        error = (prediction - walk_vs_stop[i]) ** 2\n",
    "        # update weights.\n",
    "        for j in range(len(ws)):\n",
    "            gradient = 2 * streetlights[i][j] * (prediction - walk_vs_stop[i])\n",
    "            ws[j] -= lr * gradient\n",
    "        print(f\"Prediction: {round(prediction, 2)} | Reality: {walk_vs_stop[i][0]} | Error: {round(error[0], 5)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([0.0, 1.0, -0.0, 1.0, 1.0, 0.0],\n",
       " array([[1, 0, 1],\n",
       "        [0, 1, 1],\n",
       "        [0, 0, 1],\n",
       "        [1, 1, 1],\n",
       "        [0, 1, 1],\n",
       "        [1, 0, 1]]),\n",
       " array([[0],\n",
       "        [1],\n",
       "        [0],\n",
       "        [1],\n",
       "        [1],\n",
       "        [0]]))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# our final weights predictions. Compared with ground truths\n",
    "[round(ws.dot(streetlight), 1) for streetlight in streetlights], streetlights, walk_vs_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error: 2.65612\n",
      "Error: 0.96287\n",
      "Error: 0.55092\n",
      "Error: 0.36446\n",
      "Error: 0.25168\n",
      "Error: 0.17798\n",
      "Error: 0.12864\n",
      "Error: 0.09511\n",
      "Error: 0.07195\n",
      "Error: 0.05565\n",
      "Error: 0.04395\n",
      "Error: 0.03536\n",
      "Error: 0.02891\n",
      "Error: 0.02395\n",
      "Error: 0.02006\n",
      "Error: 0.01695\n",
      "Error: 0.01442\n",
      "Error: 0.01233\n",
      "Error: 0.01059\n",
      "Error: 0.00912\n"
     ]
    }
   ],
   "source": [
    "# Book's Implementation.\n",
    "weights = np.array([.5, .48, -.7])\n",
    "alpha = .1\n",
    "\n",
    "for iteration in range(20):\n",
    "    error_for_all_lights = 0\n",
    "    for row_index in range(len(walk_vs_stop)):\n",
    "        input = streetlights[row_index]\n",
    "        goal_prediction = walk_vs_stop[row_index]\n",
    "        \n",
    "        prediction = input.dot(weights)\n",
    "        \n",
    "        error = (goal_prediction - prediction) ** 2\n",
    "        error_for_all_lights += error\n",
    "        \n",
    "        delta = prediction - goal_prediction\n",
    "        weights = weights - (alpha * (input * delta))\n",
    "    \n",
    "    print(f\"Error: {round(error_for_all_lights[0], 5)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full, Batch, and Stochastic Gradient Descent\n",
    "### Stochastic Gradient Descent updates weights one example at a time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea of learning one example at a time is called **stochastic gradient descent**. It performs a prediction and a weight update for each training example separately. It also iterates through the entire dataset many times until it can find a weight configuration that works well for the entire dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Full) Gradient Descent updates weights one dataset at a time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of updating the weights once for each training example, the weight calculates the loss over the entire dataset. Changing the weights only when it computes the full average."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Gradient Descent updates weights taking in n examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of updating the weights using one example or the entire dataset, we choose batch size (typically between 8 and 1024), after which the weights are updated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neural Networks Learn Correlation\n",
    "### What did the last neural network learn?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Correlation is found wherever the weights are set to high numbers. Inversely, randomness with respect to the input is gound wherever the weights converge to `0`.\n",
    "\n",
    "So, a valid question to ask is: How did the network identify correlation in the last example? and the answer comes from the input, in the process of gradient descent, each training example asserts either up pressure or down pressure on the weights. On average, there was more `up` pressure for the middle weight and more `down` pressure for the other two.\n",
    "\n",
    "Next, we want to ask the following questions:\n",
    "- Where does the pressure come from?\n",
    "- Why is it different for different weights?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Up and Down Pressue\n",
    "### It comes from the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each node is individually trying to correctly predict the output given the input. For the most partr, each node ignores all other nodes when attempting to do so, the only cross communication that occurs is that all 3 weights must share the same error measure. The weight update is nthing more than taking this shared error measure and multiplying it by each respective error. \n",
    "\n",
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:500px\" src=\"static/imgs/06/weight_update.png\" />\n",
    "</div>\n",
    "\n",
    "A key part of why neural networks learn is **error attribution**, which means given a shared error, the network needs to figure out which weights contributed (so they can be adjusted) and which weights did not contribute (so they can be left alone). On average, this causes the network to find the correlation between the middle weight and the output to be the dominant predictive force while enhancing the predictive accuracy of the network.\n",
    "\n",
    "To summarize:\n",
    "- The prediction is a weighted sum of the inputs.\n",
    "- The learning algorithm rewards inputs that correlate with the output with upward weight pressure (in the case of 1), and penalize inputs that decorrelate with the output with downward pressure (in the case of 0).\n",
    "- The weighted sum of the inputs find perfect correlation between the input and output by weighting decorrelated inputs to 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Edge Case: Overfitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes correlation happens accidentally. If a particular configuration of weights accidentally creates perfect correlation between the predictions and the output dataset, in that case, the neural network will stop learning.\n",
    "\n",
    "In essence, the neural network memorized the two training examples instead of finding the correlation that will generalize to any possible streetlight configuration. The greatest challenge we will face with deep learning is pushing our neural network to **generalize** instead of just **memorize**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Edge Case: Conflicting Pressure\n",
    "### Sometimes correlation fights itself"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As nodes learn, they absorb some of the error; in other words, they absorb part of the correlation. This causes the network to predict with moderate. correlative power, which reduces the error.\n",
    "\n",
    "After that, the other weights only try to adjust their values to correctly predict what's left. **Regularization** forces weights with conflicting pressure to mve toward 0 and aims to say that only weights with really strong correlation can stay on.\n",
    "\n",
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:500px;\" src=\"static/imgs/06/latent_correlation.png\" />\n",
    "</div>\n",
    "\n",
    "In the case of one input & output layers, each weight learn for itself and finds correlation between the associated column and the output. In the case when the correlation is indirect, meaning that when a linear combination of the inputs is correlated with the output and not distinct columns, in that case, we use the **multi-layer perceptron** architecture."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning Indirect Correlation\n",
    "### If your Data doesn't have correlation, create intermediate data that does!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Neural networks search for correlation between their input and output **layers**. Because sometimes the input dataset doesn't directly correlate with the output dataset, we'll use the input dataset to create an intermediate dataset that does have correlation with the output.\n",
    "\n",
    "This will lead us to what is called **representation learning**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Correlation\n",
    "\n",
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:500px;\" src=\"static/imgs/06/hidden-layer.png\" />\n",
    "</div>\n",
    "\n",
    "The middle layer represents the intermediate dataset. The resulting network is still just a function. It has a bunch of weights that are collected together in a particular way.\n",
    "\n",
    "Gradient Descent still works because we can calculate how much each weight contributes to the error and adjust it to reduce the error to 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stacking Neural Networks: A Review\n",
    "\n",
    "If we look at the stacked neural network architecture and ignore the lower weights & only consider their output to be the dataset, the the top half of the neural network is just like the networks trained in the preceding chapter. We can use the same learning logic to help the network efficiently learn.\n",
    "\n",
    "The part that we don't yet understand is how to update the weights of the first layer. Previously, we used *delta* as a cached/normalized error measure. Now we want to figure out how to know the *delta* values at the first layer so they can help the second layer make accurate predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backpropagation: Long-distance Error Attribution\n",
    "### The Weighted average error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:600px\" src=\"static/imgs/06/backpropagation.png\" />\n",
    "</div>\n",
    "\n",
    "The way of using *delta* at layer 2 to figure out the *delta* at layer 1 is to multiply it by each of the respective weights for layer 1. It's like the prediction logic but in reverse. This process of moving *delta* back is called **backpropagation**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backpropagation: Why does this work?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Backpropagation lets us say: \"If we want this node to be x amount higher, then each of these previous four nodes needs to be `x * weights_1_2` amount higher/lower\". Because these weights were amplifying the prediction by `weights_1_2` times. \n",
    "\n",
    "Once we know this, we can update each weight matrix as you did before. For each weight, we multiply its output delta by its input value and adjust the weight by that much (or we can scale it by the learning rate)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear vs. Nonlinear\n",
    "### This is probably the **hardest** Concept in the Book, Let's Take it slowly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As it turns out, we need **one more piece** to make this neural network train. The Problem lies in the following statement:\n",
    "\n",
    "> All linear mappings of linear mappings produce linear mappings.\n",
    "\n",
    "Meaning, no matter how many stacked layers you add to our neural network, we can find an equivalent NN with only one layer that processes the input in the same way.\n",
    "\n",
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:800px\" src=\"static/imgs/06/linearity-problem.png\" />\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why the Neural Network still doesn't work\n",
    "### If you trained the three layer network as it is now, it wouldn't converge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The middle nodes don't get to add anything to the conversation. they don't get to have correlation of their own. As a result, they're more or less correlated to various input nodes. Because we know that in the new dataset there is no correlation between any of the inputs/outputs, how can the middle layer help? We present the following reasons of why it doesn't:\n",
    "- It mixes up a bunch of correlation that're already useless.\n",
    "- **What we really need is for the middle layer to be able to selectively correlate with the input**.\n",
    "\n",
    "In essence, we want the middle layer to sometimes correlate with an input, and sometimes not correlate. This will give it a correlation of its own and the opportunity to not just always be `x%` correlated with one input and `y%` correlated with another input. Instead, it can be `x%` correlated with one input only when it wants to be.\n",
    "\n",
    "This is called **conditional correlation** or **sometimes correlation**, but let's just call it **non-linearity**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Secret to Sometimes Correlation\n",
    "### Turn off the node when the value is below 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a Node's value dropped below 0, normally the node would still have the same correlation to the input as always (it would just happen to be negative in value). But if we turn off the node when it would be negative, then it has zero correlation to any inputs whenever it's negative.\n",
    "\n",
    "This means that **the Node can now pick & choose when it wants to be correlated to something**. This allows it to say something like: \"Make me perfectly correlated to the left input, but only when the right input is turned off\". Now the node can be conditional (or speak for itself!).\n",
    "\n",
    "The fancy term for this \"if the node would be negative, set it to 0\" logic is **nonlinearity**. Without this tweak, the neural network is linear. There are many kinds of nonlinearities, but the one discussed here is, in many cases, the best one to use. It's also the simplest, ReLU:\n",
    "\n",
    "$$ReLU(x) = max(x, 0)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Our First Deep Neural Network"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:500px\" src=\"static/imgs/06/introducing-relu.png\" />\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.random.seed(1)  # set the seed to a number if you are interested in having producing the same results between runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ReLU(x):\n",
    "    return (x > 0) * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = .1\n",
    "hidden_size = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array([[1, 0, 1], \n",
    "              [0, 1, 1], \n",
    "              [0, 0, 1], \n",
    "              [1, 1, 1]])\n",
    "y = np.array([[1, 1, 0, 0]]).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((3, 4), (4, 1))"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# weights to connect the 3 layers.\n",
    "ws_0_1 = (2 * np.random.random((3, hidden_size))) - 1\n",
    "ws_1_2 = (2 * np.random.random((hidden_size, 1))) - 1\n",
    "ws_0_1.shape, ws_1_2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "layer_0 = X[0]\n",
    "layer_1 = ReLU(np.dot(layer_0, ws_0_1))\n",
    "layer_2 = np.dot(layer_1, ws_1_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backpropagating in Code\n",
    "### You can learn the amount that each weight contributes to the final error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ReLU_grad(x):\n",
    "    \"\"\"Derivative of ReLU.\"\"\"\n",
    "    return (x > 0) * 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ERROR: 0.0\n",
      "ERROR: 0.07710452429266079\n",
      "ERROR: 0.03764561420363532\n",
      "ERROR: 0.002405811235989023\n",
      "ERROR: 9.222743383310113e-06\n",
      "ERROR: 0.0\n",
      "ERROR: 0.0\n",
      "ERROR: 0.0\n",
      "ERROR: 0.0\n",
      "ERROR: 0.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[-0.5910955 ],\n",
       "       [ 1.13962134],\n",
       "       [-0.94522481],\n",
       "       [ 1.11202675]])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for epoch in range(100):\n",
    "    for i in range(len(X)):\n",
    "        # get input X[i] & target y[i]\n",
    "        x_i, y_i = X[i], y[i]\n",
    "        \n",
    "        # calculate prediction\n",
    "        hs = ReLU(np.dot(x_i, ws_0_1))\n",
    "        prediction = np.dot(hs, ws_1_2)\n",
    "        \n",
    "        # calculate error, pure error.\n",
    "        error = (prediction - y_i) ** 2\n",
    "        delta = prediction - y_i\n",
    "        \n",
    "        # calculate gradients of 1st layer.\n",
    "        grad_0_1 = np.zeros(ws_0_1.shape)\n",
    "        for line_i in range(len(ws_0_1)):\n",
    "            for col_i in range(len(ws_0_1[0])):\n",
    "                grad_0_1[line_i][col_i] = 2 * delta * x_i[line_i] * ws_1_2[col_i] * ReLU_grad(hs[col_i])\n",
    "        \n",
    "        # update weights of 1st layer.\n",
    "        ws_0_1 -= lr * grad_0_1\n",
    "        \n",
    "        # calculate gradients of 2nd layer.\n",
    "        grad_1_2 = np.zeros(ws_1_2.shape)\n",
    "        for line_i in range(len(ws_1_2)):\n",
    "            grad_1_2[line_i]= 2 * delta * hs[line_i]\n",
    "        \n",
    "        # update weights of 2nd layer.\n",
    "        ws_1_2 -= lr * grad_1_2\n",
    "    if (epoch % 10 == 0):\n",
    "        print(f\"ERROR: {round(error[0], 30)}\")\n",
    "ws_1_2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ERROR: 3.629241238667749e-11\n",
      "ERROR: 8.651118420460012e-12\n",
      "ERROR: 2.080468952941911e-12\n",
      "ERROR: 5.00404140651715e-13\n",
      "ERROR: 1.2035981037474324e-13\n",
      "ERROR: 2.894955274935546e-14\n",
      "ERROR: 6.963091350021569e-15\n",
      "ERROR: 1.6747973437405323e-15\n",
      "ERROR: 4.0283054359352306e-16\n",
      "ERROR: 9.689079599258301e-17\n"
     ]
    }
   ],
   "source": [
    "# Book Implementation.\n",
    "for iteration in range(100):\n",
    "    layer_2_error = 0\n",
    "    for i in range(len(X)):\n",
    "        layer_0 = X[i:i+1]\n",
    "        layer_1 = ReLU(np.dot(layer_0, ws_0_1))\n",
    "        layer_2 = np.dot(layer_1, ws_1_2)\n",
    "        \n",
    "        layer_2_error += np.sum((layer_2 - y[i:i+1]) ** 2)\n",
    "        layer_2_delta = (layer_2 - y[i:i+1])\n",
    "        layer_1_delta = layer_2_delta.dot(ws_1_2.T)*ReLU_grad(layer_1)\n",
    "        \n",
    "        ws_1_2 -= lr * layer_1.T.dot(layer_2_delta)\n",
    "        ws_0_1 -= lr * layer_0.T.dot(layer_1_delta)\n",
    "    if (iteration % 10 == 0):\n",
    "        print(f\"ERROR: {round(layer_2_error, 30)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Proper error attribution is the goal of backpropagation. It's about figuring out how much each weight contributed to the overall error. Now that we know how much the final prediction should move up or down, we need to figure out how much each middle node should move up/down. We call these the \"intermediate predictions\". Once we have the delta at layer 1, we can use the same processes as before for calculating a weight upda\n",
    "\n",
    "Backpropagation is about calculating *deltas* for intermediate layers so we can perform gradient descent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why do Deep Networks Matter?\n",
    "### What's the point of creating \"intermediate datasets\" that have correlation? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two layer network might have a problem classifying cat vs. non-cat pictures because no individual pixel correlates with whether there's a cat in the picture, only different configuration of pixels correlate with whether there's a cat. \n",
    "\n",
    "Deep Learning is all about creating intermediate layers (representations) wherein each node is an intermediate layer represents the presence or absence of a different configuration of inputs. Because intermediate layers detect various pixel configurations, it then gives the final layer the information it needs to correctly predict the presence/absence of cat.\n",
    "\n",
    "Some neural networks have hundreds of layers!\n",
    "\n",
    "The rest of this book will be dedicated to studying different phenomena within these layers in an effort to explore the full power of deep neural networks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Challenge: Build a 3-layer Neural Network from Memory!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((4, 3), (4, 1))"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = np.array([[1, 0, 1], \n",
    "              [0, 1, 1], \n",
    "              [0, 0, 1], \n",
    "              [1, 1, 1]])\n",
    "y = np.array([[1, 1, 0, 0]]).T\n",
    "\n",
    "epochs = 10000\n",
    "lr = 0.1\n",
    "\n",
    "X.shape, y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((3, 4), (4, 1))"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# init weights.\n",
    "ws_1 = np.random.rand(X.shape[1], 4)\n",
    "ws_2 = np.random.rand(4, y.shape[1])\n",
    "\n",
    "ws_1.shape, ws_2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "def relu(x):\n",
    "    return (x > 0) * x\n",
    "\n",
    "def grad_relu(x):\n",
    "    return x > 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ERROR: 1.8831273134174278\n",
      "ERROR: 0.0001422217424704162\n",
      "ERROR: 7.932455952075163e-12\n",
      "ERROR: 6.184971668330622e-19\n",
      "ERROR: 3.58738933991847e-26\n",
      "ERROR: 1.232595164407831e-32\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n",
      "ERROR: 1.1093356479670479e-31\n"
     ]
    }
   ],
   "source": [
    "for epoch in range(epochs):\n",
    "    for i in range(len(X)):\n",
    "        # get input/output\n",
    "        layer_in = X[i:i+1]\n",
    "        \n",
    "        # calculate prediction\n",
    "        layer_1 = relu(layer_in.dot(ws_1))\n",
    "        layer_out = layer_1.dot(ws_2).reshape(1, 1)\n",
    "        \n",
    "        # calculate delta 2\n",
    "        delta_2 = layer_out - y[i:i+1]\n",
    "        \n",
    "        # calc error for logs\n",
    "        error = delta_2 ** 2\n",
    "        \n",
    "        # calculate delta 1\n",
    "        # delta_2.dot(ws_2.T) -> (1, 4)\n",
    "        # grad_relu(hs) -> (4,)\n",
    "        # * : element wise multiplication.\n",
    "        delta_1 = delta_2.dot(ws_2.T)*grad_relu(layer_1)\n",
    "        \n",
    "        # update weights\n",
    "        ws_2 -= lr * (layer_1.T.reshape(4,1).dot(delta_2))\n",
    "        ws_1 -= lr * (layer_in.T.reshape(3,1).dot(delta_1))\n",
    "    if epoch % 200 == 0:\n",
    "        print(f\"ERROR: {error[0][0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "y:  [1]  y_hat:  0.9999999999999996\n",
      "y:  [1]  y_hat:  0.9999999999999997\n",
      "y:  [0]  y_hat:  3.531122842112766e-16\n",
      "y:  [0]  y_hat:  1.1102230246251565e-16\n"
     ]
    }
   ],
   "source": [
    "# test weights.\n",
    "for i in range(len(X)):\n",
    "    x_i, y_i = X[i], y[i]\n",
    "    print('y: ', y_i, ' y_hat: ', relu(x_i.dot(ws_1)).dot(ws_2).squeeze())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sketches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align:center;\">\n",
    "    <img style=\"width:66%\" src=\"static/imgs/06/backprop.jpg\" /><br/>\n",
    "    <span style=\"color:gray;\">* Error Note: last layer doesn't have an activation function</span><br/>\n",
    "    <img style=\"width:66%\" src=\"static/imgs/06/more-explanation.jpg\" />\n",
    "    <img style=\"width:66%\" src=\"static/imgs/06/thought_understand.jpg\" />\n",
    "    <img style=\"width:66%\" src=\"static/imgs/06/debug_mode.jpg\" />\n",
    "    <img style=\"width:66%\" src=\"static/imgs/06/falsy.jpg\" />\n",
    "</div>\n",
    "\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}