{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CDA3H4Ibhoxw"
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from typing import Tuple, List\n",
    "\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
    "from sklearn.datasets import make_blobs, make_moons\n",
    "from matplotlib import pyplot as plt\n",
    "from jax import grad, value_and_grad\n",
    "from jax import numpy as jpy\n",
    "\n",
    "plt.style.use(\"fivethirtyeight\")\n",
    "\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "PWpJ4mcmhox7"
   },
   "source": [
    "# Fundamentals of Deep Learning\n",
    "\n",
    "In this notebook, we'll cover the fundamentals of deep learning by implementing our own relatively low level building blocks of neural networks using Python and NumPy. Before diving deep into it, note that I've tried my best to include a few good practies with Python coding that are desirable for when you come to do your own modeling; for example, in my function definitions, I add type hinting to remind you and I (mostly me) what kind of inputs the functions are expecting and I will use [f-strings](https://www.python.org/dev/peps/pep-0498/) instead of the older formatting statements. Because of various nuances with Python such as [duck typing](https://en.wikipedia.org/wiki/Duck_typing) and later PyTorch, I will try and point out specific things that I'm doing that will hopefully benefit future you.\n",
    "\n",
    "The idea behind this notebook is to set up the core parts that will be repeatedly used and re-used in virtually all deep learning applications. We won't actually be using these functions in later notebooks as we move to higher level libraries which abstract most of it away; the idea behind doing this is purely pedagogical and hopefully gives you a feel for how things are done. As you get better and better at deep learning, it's definitely advisable to start using one of the bigger open source libraries such as PyTorch, TensorFlow, Cafe, etc. not only because they've set up most of the low level stuff, but more than likely a lot more efficiently than most of us could do in a mere few hours.\n",
    "\n",
    "To kick things off, we'll look at this humble equation:\n",
    "\n",
    "$$ y = mx + b $$\n",
    "\n",
    "which is your typical linear regression, with slope $m$ and offset $b$. The power of this expression is its simplicity: we are able to map inputs $x$ more or less directly (with two parameters) onto an output value $y$. When we're dealing with multidimensional data, the same rule applies: we just make $x$ a vector, and do multivariate linear regression. The advantage of such simplicity is in its ability to be explained; we interpret each parameter with a specific purpose—for example, $m$ is the slope, and tells us how quickly $y$ changes with $x$, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-Xmhr7lshox9"
   },
   "outputs": [],
   "source": [
    "x = np.linspace(0., 10., 100)\n",
    "y = 2.31531 * x + 0.23412 + np.random.normal(size=(x.size))\n",
    "\n",
    "# The array slicing here is actually a useful trick to create another dimension\n",
    "# scikit-learn modules expect 2D arrays for most of their inputs\n",
    "model = LinearRegression(fit_intercept=True)\n",
    "model.fit(x[:,None], y[:,None])\n",
    "\n",
    "y_pred = model.predict(x[:,None])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oqL_fAu1hoyE",
    "outputId": "77972f34-cde0-444c-d66c-e26e4389b855"
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(4,4))\n",
    "\n",
    "ax.scatter(x, y, label=\"Truth\")\n",
    "ax.plot(x, y_pred.flatten(), label=\"Fit\", color=\"k\", ls=\"--\")\n",
    "\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JXBmzNl7hoyL"
   },
   "source": [
    "However, the power and weakness of linear regression is its simplicity. While it's readily interpretable, it doesn't have much _expressive power_: the number of things you can do with linear regression is big, but it's also very strictly useful for a very specific purpose. One of the problems with science and statistics up to now the fascination with linear regression, just because it's easy, and consequently our tendency to jump through hoops to fit linear trends to non-linear data.\n",
    "\n",
    "Take for example the datasets below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Eal6ler1hoyN"
   },
   "outputs": [],
   "source": [
    "blob_data, blob_labels = make_blobs(200, centers=[[0.,1], [0., -1]], random_state=0)\n",
    "# unpack the data into \"X\" and \"Y\" dimensions\n",
    "blob_x, blob_y = blob_data[:,0], blob_data[:,1]\n",
    "\n",
    "moon_data, moon_labels = make_moons(200, noise=0.1, random_state=0)\n",
    "# unpack the data into \"X\" and \"Y\" dimensions\n",
    "moon_x, moon_y = moon_data[:,0], moon_data[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-VrGPeIthoyW",
    "outputId": "ebc493eb-8464-4d55-8617-33eaa05be637"
   },
   "outputs": [],
   "source": [
    "fig, axarray = plt.subplots(1, 2, figsize=(6,3))\n",
    "\n",
    "axarray[0].scatter(blob_x, blob_y, c=blob_labels, cmap=\"Spectral\")\n",
    "axarray[1].scatter(moon_x, moon_y, c=moon_labels, cmap=\"Spectral\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xw2wO1tlhoyc"
   },
   "source": [
    "If our task was to sort data points depending on their origin (i.e. a classification problem), linear regression might be able to do reasonably well on the left panel (blobs), where the decision boundary separating the two could be closely approximated by a straight line. In the right panel (moons), the dividing boundary does not appear to be a linear function! In the conventional way of thinking, we could probably find some projection of the data where it is linear, but to what end? You will likely lose the interpretability of model, and you would have to think hard about how to perform the transformations with sufficient confidence that will capture the underlying problem enough to be predictive.\n",
    "\n",
    "---\n",
    "\n",
    "# Enter Neural Networks\n",
    "\n",
    "If our data is sufficiently complex, then it may be a better option to let the data speak for itself, rather than imposing what _we think_ the data should look like. Instead of applying crazy transformations, why not transform our model? This is where neural networks come into play.\n",
    "\n",
    "The idea behind neural networks is simple, and in the simplest case can be written out as a simple linear algebra equation:\n",
    "\n",
    "$$ Y = w^TX + b $$\n",
    "\n",
    "Note that here I'm using vector notation: $X$ and $Y$ are no longer restricted to scalar quantities, but the equation more or less looks the same as our linear regression equation. The main difference is $m$ used to be our slope, is now replaced by a vector $w$ which represents \"weights\". You now have an added level of complexity/expressive power in your model by having multiple free parameters rather than a single value. In general, $w$ is a rank two tensor, or a matrix (as we'll see later)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VgUyT8_yhoyd"
   },
   "outputs": [],
   "source": [
    "def linear_layer(X: np.ndarray, w: np.ndarray, b: np.ndarray):\n",
    "    \"\"\"\n",
    "    NumPy implementation of a \"fully connected layer\". \n",
    "    Dot product of transposed weights and X, plus a bias term.\n",
    "    \"\"\"\n",
    "    return w.T @ X + b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vd1auqqvhoyj"
   },
   "source": [
    "So how can we use this?\n",
    "\n",
    "Let's take a simple example of a two layer model, which is the simplest type of neural network you can have. \n",
    "\n",
    "![simple](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/280px-Colored_neural_network.svg.png)\n",
    "\n",
    "It's convenient to define some nomenclature here:\n",
    "\n",
    "| Term | Definition |\n",
    "|---|---|\n",
    "| Unit | One of the nodes on our graph. Also referred to as a neuron. |\n",
    "| Layer | One of the columns on our graph |\n",
    "| Hidden Layer | Intermediate layers between input/output. Hidden because you don't typically see the actual computation/values. |\n",
    "| Fully-connected | Every node is connected to every node in the previous layer. This is important for defining the flow of your graph. |\n",
    "| Deep | Neural network with many layers deep; somewhat arbitrary (more than four layers?). |\n",
    "\n",
    "There are two layers, as the output counts as a layer of the network, but the inputs do not. Here, we have an intermediate—or \"hidden\"—layer in between the inputs and outputs that performs our transformation. This kind of network is also called a perceptron, and in our two-layer case is the special case of a [linear perceptron](https://en.wikipedia.org/wiki/Perceptron). To compute the  first \"hidden\" part, we simply have to take our weights $w$, compute the dot product with our inputs.\n",
    "\n",
    "The shape of these matrices are worth bringing up: there's quite a bit of nuance because of the way these equations are set up. While we would like to think of $w$ as a vector, it is actually a matrix that connects the inputs (length 3) with the hidden (length 4) layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "X31-Nrnghoyk"
   },
   "outputs": [],
   "source": [
    "X = np.random.randn(3, 1)\n",
    "# Shape 3,4 — input length, hidden dimension\n",
    "w = np.random.randn(3, 4)\n",
    "# bias is a single term for each neuron\n",
    "b = np.random.randn(4, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TBS3pERShoyp",
    "outputId": "4556637f-55ed-4e5b-a478-c62c50d08c52"
   },
   "outputs": [],
   "source": [
    "h = linear_layer(X, w, b)\n",
    "print(f\"Hidden layer: \\n{h}\")\n",
    "print(f\"Shape of hidden layer: \\t{h.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jB2LzXmDhoyv"
   },
   "source": [
    "If we wanted to make our model more expressive, we could increase our hidden dimension—replace 4 with some large number. We'll get to that later, but for now we can finish this model by computing the next step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "KkBh9-Aohoyw"
   },
   "outputs": [],
   "source": [
    "w = np.random.randn(4, 2)\n",
    "# Not using bias on the output layer\n",
    "b = np.zeros((2, 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "X0teHkh1hoy2",
    "outputId": "790f3a1c-a881-4c09-a5c6-3013cde2d60d"
   },
   "outputs": [],
   "source": [
    "Y = linear_layer(h, w, b)\n",
    "print(f\"Output layer: \\n{Y}\")\n",
    "print(f\"Shape of output: \\t{Y.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OgORRNeKhozB"
   },
   "source": [
    "And that's it! That's actually all you need to implement a simple neural network. Of course, the complication is in many of the other nuanced factors: our code isn't very general, nor is it a very practical way to use neural networks. We're still a far way away from treating the blob and moon problems, but in principle this is all our computation requires.\n",
    "\n",
    "For the sake of abstraction, let's play around with making a class that will set up a multilayer perceptron: a fully connected, $n$ hidden layer neural network. If you are smart enough to do it recursively, do it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Z_sfMBxuhozD"
   },
   "outputs": [],
   "source": [
    "class LinearModel:\n",
    "    def __init__(self, X_dim: int, Y_dim: int, hidden_dims: Tuple[int]):\n",
    "        self.X_dim = X_dim\n",
    "        self.Y_dim = Y_dim\n",
    "        self.hidden_dims = hidden_dims\n",
    "        self.w = list()\n",
    "        self.b = list()\n",
    "        # Number of layers is hidden + output\n",
    "        self.n_layers = len(hidden_dims) + 1\n",
    "        self.init_parameters()\n",
    "        \n",
    "    def init_parameters(self):\n",
    "        \"\"\"\n",
    "        Initializes weights to random values, biases to 0\n",
    "        \"\"\"\n",
    "\n",
    "        #Arrange all layer sizes (including input) into a list in order.\n",
    "        #Input layer -> hidden layer 1 --> hidden layer 2 --> ... --> output layer\n",
    "        #The length of this list will be self.n_layers+1 (since input doesn't count in n_layers)\n",
    "        self.layer_lens = [self.X_dim]\n",
    "        for dim in self.hidden_dims:\n",
    "          self.layer_lens.append(dim)\n",
    "        self.layer_lens.append(self.Y_dim)\n",
    "\n",
    "        self.w = list()\n",
    "        self.b = list()\n",
    "        #initialize weight and bias matrices, sizing according to the current layer and next layer\n",
    "        for i in range(self.n_layers):\n",
    "            self.w.append(np.random.normal(size=(self.layer_lens[i],self.layer_lens[i+1])))\n",
    "            self.b.append(np.zeros((self.layer_lens[i+1],1)))\n",
    "    \n",
    "    def forward(self, X: np.ndarray):\n",
    "        \"\"\"\n",
    "        Same notation used for later PyTorch models; we implement a \"forward\" class method\n",
    "        that will perform a forward pass of our graph. This function recursively computes\n",
    "        our `linear_layer` function using the previous iteration's input. \n",
    "        \"\"\"\n",
    "        for w, b in zip(self.w, self.b):\n",
    "            X = linear_layer(X, w, b)\n",
    "        return X\n",
    "    \n",
    "    def __call__(self, X: np.ndarray):\n",
    "        \"\"\"\n",
    "        Another PyTorch-style method that lets provides a functional interface\n",
    "        to the model.\n",
    "        \"\"\"\n",
    "        return self.forward(X)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "KZabhQihhozJ"
   },
   "outputs": [],
   "source": [
    "X_DIM = 3\n",
    "Y_DIM = 2\n",
    "HIDDEN_DIMS = (4,)\n",
    "\n",
    "model = LinearModel(X_DIM, Y_DIM, HIDDEN_DIMS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "hOAqz5DthozO",
    "outputId": "8fcbd11e-c2f0-441d-ec1c-3135015dc8cb"
   },
   "outputs": [],
   "source": [
    "model.forward(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_8DOWGishozT",
    "outputId": "b87dca6b-d043-4737-bfe4-03422a4a0257"
   },
   "outputs": [],
   "source": [
    "# Nice functional interface to models; get used to this and .forward() calls!\n",
    "model(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "66bdJ5ZUhozY"
   },
   "source": [
    "Try making a bigger network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "10vTU41HhozZ"
   },
   "outputs": [],
   "source": [
    "X_DIM = 3\n",
    "Y_DIM = 2\n",
    "HIDDEN_DIMS = (4,10,20,50)\n",
    "\n",
    "model = LinearModel(X_DIM, Y_DIM, HIDDEN_DIMS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bhy2_F0Jhozd",
    "outputId": "3faeb5fc-58f2-4327-fa67-bb2b228c5367"
   },
   "outputs": [],
   "source": [
    "model.forward(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "q3DFNWLUhozl"
   },
   "source": [
    "Notice something about the output values? With more layers, the values get larger and larger! This becomes more and more of a problem for deep neural networks, but the advantage of building deeper networks comes from the ability to express: if you think about each of these layers as sequential linear transformations, even though locally (within a layer or two) the behavior is linear, you can actually build up representations of even highly non-linear functions. Another way of thinking about this is in terms of basis functions: our neurons simply act as an expansion of our function and in the same way as in the limit of a complete basis set, we could in theory build an neural network with infinite expressive power by building infinitely deep and wide networks. For most of our tasks however, we don't really need that level of complexity—there's a case for building simpler models, if not for interpretability than for use in deployment/production/science. The more complex your model is, the less likely people will use it, and the more difficult it is to maintain.\n",
    "\n",
    "# Activation Functions\n",
    "\n",
    "This brings us to how neural networks are like omni-tools: the basic mechanics have been implemented in our Python class, and the computation with NumPy matrix operations. These two components alone are enough, if all we wanted to do was regression! If our task is something like classification, then we need some way to be able to express probabilities. This is where activation functions are useful: we modify our neural network model to include a function that transforms the outputs of the model into something more fitting for the task at hand. In many instances, you'll also see activation functions being referred to as \"non-linearities\".\n",
    "\n",
    "$$ Z = g(w^TX + b) $$\n",
    "\n",
    "where $Z$ is the activation output and $g$ is our activation function. $Z$ is not specifically $Y$ as before, because we can act on intermediate outputs in addition to the final output. Remember how large the output values were? From a learning perspective as well as for your task, you might not want stupidly large numbers—what you're trying to model may be much more subtle, and occurs on a smaller scale.\n",
    "\n",
    "So now we'll look at a few example functions that $g$ can take on, and what sort of tasks they suit.\n",
    "\n",
    "## Sigmoid\n",
    "\n",
    "This is the classic activation function to use for classification. This function simply compresses your output into the range of [0, 1], which we can easily interpret as a probability of something occurring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ftkifxQshozm"
   },
   "outputs": [],
   "source": [
    "def sigmoid(X: np.ndarray):\n",
    "    return 1 / (1 + np.exp(-X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oEMh4eCQhozr"
   },
   "source": [
    "To demonstrate, let's take a simply two-layer model that maps single values of X to a single value of Y, and use the sigmoid function to operate on the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "KkzFIwsShozt"
   },
   "outputs": [],
   "source": [
    "X_DIM = 1\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4,)\n",
    "\n",
    "model = LinearModel(X_DIM, Y_DIM, HIDDEN_DIMS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "khaYu2nqhozz"
   },
   "outputs": [],
   "source": [
    "X = np.random.randn(1, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8esR4mhthoz4"
   },
   "outputs": [],
   "source": [
    "y = model(X)\n",
    "\n",
    "G = sigmoid(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "R3WHLfQkhoz9",
    "outputId": "1ef653cc-19da-4299-f4e2-5ada3bddf0cb"
   },
   "outputs": [],
   "source": [
    "fig, axarray = plt.subplots(1, 2, figsize=(6,3))\n",
    "\n",
    "axarray[0].set_title(\"Model\")\n",
    "axarray[0].set_ylabel(\"Y\")\n",
    "axarray[0].set_xlabel(\"X\")\n",
    "axarray[0].scatter(X[0,:], y[0,:], c=y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3)\n",
    "axarray[1].set_title(\"Sigmoid\")\n",
    "axarray[1].scatter(X[0,:], G[0,:], c=y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3)\n",
    "axarray[1].set_ylabel(\"G\")\n",
    "axarray[1].set_xlabel(\"X\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ukv6Xgr9ho0C"
   },
   "source": [
    "The scatter points are colormapped to the value of $Y$, so that there is a one-to-one correspondance between the two graphs. As you see, the larger values of $Y$ asymptote to 1 when a sigmoid function is applied, conversely very small/negative values of $Y$ go to zero. So in terms of probabilities, the sigmoid function is pretty useful. The trouble is, however, if we were to try and apply this to large values, like if we were to stack up our neural network model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "19LebgIPho0D"
   },
   "outputs": [],
   "source": [
    "X_DIM = 1\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4,10,30,10,4)\n",
    "\n",
    "model = LinearModel(X_DIM, Y_DIM, HIDDEN_DIMS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bEFiQSQDho0J"
   },
   "outputs": [],
   "source": [
    "Y = model(X)\n",
    "\n",
    "G = sigmoid(Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "x6QTdN0fho0V",
    "outputId": "46a67fba-b548-4655-f39d-c3c345f656a7"
   },
   "outputs": [],
   "source": [
    "fig, axarray = plt.subplots(1, 2, figsize=(6,3))\n",
    "\n",
    "axarray[0].set_title(\"Model\")\n",
    "axarray[0].set_ylabel(\"Y\")\n",
    "axarray[0].set_xlabel(\"X\")\n",
    "axarray[0].scatter(X[0,:], Y[0,:], c=Y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3)\n",
    "axarray[1].set_title(\"Sigmoid\")\n",
    "axarray[1].scatter(X[0,:], G[0,:], c=Y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3)\n",
    "axarray[1].set_ylabel(\"G\")\n",
    "axarray[1].set_xlabel(\"X\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "7I6xbAUFho0a"
   },
   "source": [
    "See how the values of our sigmoid becomes basically one and zero? This is not very useful, as it basically means our neural network just outputs a step function. However, if we were to modify our neural network to include sigmoids between layers, the idea is that the numbers stay relatively small between each successive layer.\n",
    "\n",
    "Below this, we're going to use class inheritance to reduce the need for repetitive code. Instead of copy-pasting our previous class, we're just going to inherit the common features of the model, including the `init_parameters` function and all of the other boilerplate stuff, and just modify the `forward` method to include activations. This is way of model development is object-oriented and very well supported in PyTorch, so it's advantageous to get used to this style of development."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "it4ScaXMho0b"
   },
   "outputs": [],
   "source": [
    "class LinearModelWithActivation(LinearModel):\n",
    "    \"\"\"\n",
    "    Added support for an activation function. The `activation` keyword argument provides a\n",
    "    way for the user to provide a function that operates on X\n",
    "    \"\"\"\n",
    "    def __init__(self, X_dim: int, Y_dim: int, hidden_dims: Tuple[int], activation=None):\n",
    "        super().__init__(X_dim, Y_dim, hidden_dims)\n",
    "        self.init_parameters()\n",
    "        self.activation = activation\n",
    "    \n",
    "    def forward(self, X: np.ndarray):\n",
    "        \"\"\"\n",
    "        Modified forward method to include an activation as well.\n",
    "        \"\"\"\n",
    "        for w, b in zip(self.w, self.b):\n",
    "            X = linear_layer(X, w, b)\n",
    "            if self.activation:\n",
    "                X = self.activation(X)\n",
    "        return X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8pL3MCFHho0f"
   },
   "outputs": [],
   "source": [
    "X_DIM = 1\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4,10,30,10,4)\n",
    "\n",
    "model = LinearModelWithActivation(X_DIM, Y_DIM, HIDDEN_DIMS, activation=sigmoid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eWLTtre2ho0j"
   },
   "outputs": [],
   "source": [
    "Y = model(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "yHvhzDrzho0n",
    "outputId": "374f2910-2c8a-4f5b-ed4d-62f023fe06d2"
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "\n",
    "ax.set_title(\"Model with sigmoid activation\")\n",
    "ax.set_ylabel(\"Y\")\n",
    "ax.set_xlabel(\"X\")\n",
    "ax.scatter(X[0,:], Y[0,:], c=Y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3)\n",
    "# ax.set_ylim([0.172, 0.18])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SAPBOSUCho0u"
   },
   "source": [
    "Notice how our model is now non-linear? With these simple sigmoid functions between hidden layers, we've induced non-linearities despite the fact that our model is just built with linear transformations! Not only that, the absolute values of $Y$ are now much smaller than the model without activations. There are a range of activation functions that can allow your model to grasp certain responses, and the figure below gives a brief overview.\n",
    "\n",
    "Below we're also going to test out a few commonly used activation functions, and see effect they have on our model.\n",
    "\n",
    "## ReLU\n",
    "\n",
    "The rectified linear unit (ReLU) function is extremely common, and in this implementation I'm taking the vanilla ReLU function which basically takes $\\mathrm{max}(X, 0)$; the larger value of either $X$, or zero. This function forces values to be non-negative, or at least zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "SC49pHFZho0v"
   },
   "outputs": [],
   "source": [
    "def relu(X: np.ndarray):\n",
    "    return np.maximum(X, np.zeros(X.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rvmvnBzTho00"
   },
   "source": [
    "## Softmax\n",
    "\n",
    "This is a very similar function to the sigmoid—instead of making a single number range from [0,1], it forces a vector to sum up to one, such that each individual element can be thought of as a probability. This is the most commonly used activation for multilabel classification tasks.\n",
    "\n",
    "$$ \\mathrm{softmax} = \\frac{\\exp(X)}{\\sum \\exp(X)} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9HcqDPI5ho01"
   },
   "outputs": [],
   "source": [
    "def softmax(X: np.ndarray):\n",
    "    return np.exp(X) / np.exp(X).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TE6YNpbUho09",
    "outputId": "3f482099-8fb7-4264-ff33-589803c7f74a"
   },
   "outputs": [],
   "source": [
    "X_DIM = 1\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4,10,30,10,4)\n",
    "\n",
    "activation_funcs = [sigmoid, softmax, np.tanh, np.cos, relu]\n",
    "labels = [\"Sigmoid\", \"Softmax\", \"tanh\", \"cos\", \"ReLU\",]\n",
    "\n",
    "fig, axarray = plt.subplots(1, len(activation_funcs), figsize=(14, 3))\n",
    "\n",
    "for index, activation_func in enumerate(activation_funcs):\n",
    "    model = LinearModelWithActivation(X_DIM, Y_DIM, HIDDEN_DIMS, activation=activation_func)\n",
    "    Y = model(X)\n",
    "    axarray[index].scatter(\n",
    "        X[0,:], Y[0,:], c=y[0,:], cmap=\"Spectral\", edgecolor=\"k\", lw=0.3\n",
    "    )\n",
    "    axarray[index].set_title(labels[index])\n",
    "    axarray[index].set_xlabel(\"X\")\n",
    "    axarray[index].set_ylabel(\"Y\")\n",
    "# This command spreads out the figures to not overlap\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KWKk1s0Hho1D"
   },
   "source": [
    "As you can see, the choice of function can change the output values quite dramatically, even if it's nominally the same input values $X$. You can also appreciate that, our model is only six layers deep (5 hidden layers), and we're already getting to the point where getting an intuition for what the output values will look like is kind of difficult. For example, $\\tanh$ is a commonly used function that is very similar to sigmoid, which acts to compress values between [-1,1]. While it seemed like our sigmoid activations behaved well this time, $\\tanh$ just seemed to output a step function. So, depending on how you stack up layers and activations, you can get very different behaviours and for this reason, deep learning (even without the learning part yet!) is very much an experimental field. Over time you're supposed to build up an intuition for what might work well, but it's very hard to dive into a new problem and say with confidence your model will work without having tried something in the first place."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RPv5mKV3ho1E"
   },
   "source": [
    "# The \"Learning\" in Deep Learning\n",
    "\n",
    "So far we've only implemented the core functioning parts of a neural network—we haven't done anything yet to actually solve problems. The \"learning\" in deep learning refers to finding parameters (which I'll refer to now as $\\theta$, which broadly speaking includes weights $w$ and biases $b$). To \"learn\" parameters, we have to be able to compute how these parameters should change. You can therefore think of the learning part as broken into two parts:\n",
    "\n",
    "1. How do we evaluate performance? (Cost, $J$)\n",
    "2. How do we use performance to update our model (Optimizer)\n",
    "\n",
    "## Cost Functions\n",
    "\n",
    "So, much like any other conventional approach like linear regression, we have to have some way to measure how our model is doing with respect to our target data: we have to be able to define a loss or cost function, $J$, that accurately measures how well our model is doing, as to be able to subsequently update our parameters to do better in the next time. Let's look at that moon problem again, and try and build a neural network that will classify each point color: because there's only two clumps, we could just do a binary classification, which is either 0 or 1—perfect for our sigmoid activation.\n",
    "\n",
    "In this model, we're going to take two dimensional data points which we called $X$ and $Y$, and try and predict whether it's blue or red (0 or 1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vskJtS5Who1F"
   },
   "outputs": [],
   "source": [
    "X_DIM = 2\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4, 10, 4)\n",
    "\n",
    "classification_model = LinearModelWithActivation(X_DIM, Y_DIM, HIDDEN_DIMS, sigmoid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1LkCDOL4ho1N"
   },
   "source": [
    "To prepare for evaluation, we have to do some array manipulation to get the data into the right shape. Here, `X` is expected by the model to be a 2D array; in this case, two rows corresponding to \"X\" and \"Y\", and in our moon data example, we have 200 samples. So the shape of `X` should be `(2, 200)`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dyKrAzt0ho1O"
   },
   "outputs": [],
   "source": [
    "moon_data = np.vstack([moon_x, moon_y])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "w2z-3NW3ho1S"
   },
   "outputs": [],
   "source": [
    "# sanity check here\n",
    "assert len(moon_data.T) == len(moon_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CqCtDRFSho1X",
    "outputId": "62a421c5-7198-4cd8-8e1c-225d2cd0c2f5"
   },
   "outputs": [],
   "source": [
    "print(moon_data.shape)\n",
    "print(moon_labels.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tzNDN3iXho1b"
   },
   "source": [
    "So now, we just need to come up with a way to measure how far off our model is. One of the most relatable is the mean-squared-error, which just computes the squared distance from the labels are from the predictions:\n",
    "\n",
    "$$ J = \\frac{1}{N} \\sum_n^N (\\hat{y}_n - y_n)^2 $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "WMubSCkUho1c"
   },
   "outputs": [],
   "source": [
    "def mean_squared_error(y_pred: np.ndarray, y: np.ndarray):\n",
    "    return np.mean(np.square(y_pred - y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "p7Q_aHVuho1i"
   },
   "source": [
    "We can compute how badly our initialized model performs simply by calling it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8aNdSYMQho1j"
   },
   "outputs": [],
   "source": [
    "J = mean_squared_error(\n",
    "    classification_model(moon_data).flatten(), moon_labels\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AfW_jexmho1n",
    "outputId": "4b2af183-8585-4a7f-b6c5-2e7256bf881e"
   },
   "outputs": [],
   "source": [
    "print(f\"Cost: {J:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Z0kHAgxyho1r"
   },
   "source": [
    "Actually not too bad, but we could do better right?\n",
    "\n",
    "## Parameter Update\n",
    "\n",
    "Now that we have our cost function, we need a way to guide our model to learn this classification problem. In order to do this, we need to compute gradients: we have our cost, and we have our parameter set, we just need to see how changing parameters changes the cost function.\n",
    "\n",
    "$$ \\nabla = \\frac{\\partial J}{\\partial \\theta}$$\n",
    "\n",
    "And the simplest way to update our parameters is to use good old gradient descent:\n",
    "\n",
    "$$ \\theta_{i+1} = \\theta_{i} - \\alpha \\nabla_i$$\n",
    "\n",
    "where $i$, $\\alpha$, and $\\nabla$ are the iteration indexes, a \"learning rate\" parameter, and the gradient respectively. In other words, the latest set of parameters are given by the last iteration's set of paramters, with some small value modifying the gradients. Here's our problem: we have to compute the gradient for _every single parameter_ with respect to $J$. What's more is that it's actually not as straightforward as it seems; since our the way our model works is through a forward pass: to compute $\\nabla$, we have to do the reverse operation, which is referred to __back-propagation__.\n",
    "\n",
    "Fortunately, there are a lot of smarter and harder working people than me: `jax` is a relatively [new library](https://jax.readthedocs.io/en/latest/index.html) by developers at Google that implement a bunch of nifty features, one of which is _automatic differentiation_. By using such a high level function, we can abstract away most of these problems without ever having to think about them.\n",
    "\n",
    "We do, however, need to make some modifications to our code. Once again, we're going to inherit from `LinearModelWithActivation` to create a `LinearModelWithGrad` class, which will implement an additional method that will compute the loss with respect to a set of parameters. This function will then be used be `jax.grad` to evaluate gradients for every single parameter in our model. Because of the way `jax` is written, we can't just use pure `numpy` functions and so some of the functions have been specially replaced with `jax.numpy` or `jpy` analogs, namely our loss function and our activation function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "EqpAPHBNho1s"
   },
   "outputs": [],
   "source": [
    "class LinearModelWithGrad(LinearModelWithActivation):\n",
    "    def __init__(self, X_dim: int, Y_dim: int, hidden_dims: Tuple[int], activation=None):\n",
    "        super().__init__(X_dim, Y_dim, hidden_dims, activation)\n",
    "        self.grads = list()\n",
    "    \n",
    "    def compute_loss_grad(self, params: List[List[np.ndarray]], X: np.ndarray, Y: np.ndarray):\n",
    "        # Function to evaluate the loss/cost function by providing a set of parameters\n",
    "        # the X data, and target values Y. The NumPy functions to evaluate the loss\n",
    "        # are actually replaced with `jax` analogs\n",
    "        self.w, self.b = params\n",
    "        J = jpy.mean(jpy.square(self.forward(X) - Y))\n",
    "        return J\n",
    "\n",
    "\n",
    "def jax_sigmoid(X: np.ndarray):\n",
    "    # Use the `jax` version of `exp` instead of NumPy\n",
    "    return 1 / (1 + jpy.exp(-X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Sf0Gr6Ufho1x"
   },
   "source": [
    "## Training our model\n",
    "\n",
    "Putting it all together, we're going to train our neural network using gradient descent to learn to classify our two moons. To elaborate on the workflow, we're going to ask our model what it thinks based on $X$, or `moon_data`, and compute the cost $J$ to see how far off it was. Simultaneously, we're going to be using `jax` for its autodifferentiation, and compute the gradient of every $J$, with respect to every single parameter we have in the model. These gradients are then going to be used with gradient descent to update the parameters.\n",
    "\n",
    "We're going to keep doing this for a set number of iterations, which we'll call _epochs_. This is just to get familiar with the terminology that we'll encounter a lot more when we start diving deeper into the models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "nNY0BKQ4ho1y"
   },
   "outputs": [],
   "source": [
    "def gradient_descent_optimization(model, X, Y, epochs=10, lr=0.3):\n",
    "    for epoch in range(1, epochs + 1):\n",
    "        if epoch == 1:\n",
    "            params = (model.w, model.b)\n",
    "        # setup the automatic differentiation\n",
    "        gradient_function = value_and_grad(model.compute_loss_grad)\n",
    "        J, gradients = gradient_function(params, X, Y)\n",
    "        if epoch % 100 == 0:\n",
    "            print(f\"Loss for epoch {epoch}: {J:.4f}\")\n",
    "        # Separate into weights and biases\n",
    "        w_grads, b_grads = gradients\n",
    "        for w_i, w_grad in enumerate(w_grads):\n",
    "            # convert the jax arrays back into NumPy ones. Inefficient, but\n",
    "            # it works\n",
    "            params[0][w_i] = np.array(params[0][w_i]) - lr * w_grad\n",
    "        for b_i, b_grad in enumerate(b_grads):\n",
    "            params[1][b_i] = np.array(params[1][b_i]) - lr * b_grad\n",
    "    (model.w, model.b) = params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_ONH2Xosho13"
   },
   "outputs": [],
   "source": [
    "X_DIM = 2\n",
    "Y_DIM = 1\n",
    "HIDDEN_DIMS = (4, 10, 4)\n",
    "\n",
    "classification_model = LinearModelWithGrad(X_DIM, Y_DIM, HIDDEN_DIMS, jax_sigmoid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "17kAMt7-ho17",
    "outputId": "21d1403e-0ac7-446c-d761-79b8901bc02e"
   },
   "outputs": [],
   "source": [
    "# Run our gradient descent optimization routine\n",
    "gradient_descent_optimization(classification_model, moon_data, moon_labels, epochs=1000, lr=1.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iKOcsnO7ho2A"
   },
   "source": [
    "Based on the loss decreasing progressively, we can see that the model is actually improving with each epoch. You can actually play around with the settings above to see how much of an effect the number of epochs, and the learning rate `lr` affects the performance of the model.\n",
    "\n",
    "Finally, we're going to check our answer against the ground truth. We're going to run `moon_data` through our model again, and compare the predictions in two ways: first without rounding it to the nearest integer, and second with rounding. The first case gives a colormap of the exact probabilities: if we interpret the value as between [0,1], then 0.5 is basically the model saying it's unsure, and can go other way. By rounding the values, we make the plot much more binary and easily interpretable, at the cost of losing a dimension of our problem. While learning about machine learning early on, it's important to be cognizant of various diagnostics your model tries to tell you, typically through model uncertainty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TR7_Qoxhho2B"
   },
   "outputs": [],
   "source": [
    "predictions = np.array(classification_model(moon_data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qfeHzloAho2H",
    "outputId": "0656292e-e178-4468-fbaa-3606b27b9cb2"
   },
   "outputs": [],
   "source": [
    "fig, axarray = plt.subplots(1, 3, figsize=(9,3))\n",
    "\n",
    "axarray[0].scatter(moon_data[0,:], moon_data[1,:], c=moon_labels, cmap=\"Spectral\", lw=0.3,)\n",
    "axarray[0].set_title(\"True Labels\")\n",
    "\n",
    "axarray[1].scatter(moon_data[0,:], moon_data[1,:], c=predictions.flatten(), cmap=\"Spectral\", lw=0.3,)\n",
    "axarray[1].set_title(\"Predicted\")\n",
    "\n",
    "axarray[2].scatter(moon_data[0,:], moon_data[1,:], c=predictions.round().flatten(), cmap=\"Spectral\", lw=0.3,)\n",
    "axarray[2].set_title(\"Rounded Predictions\")\n",
    "\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "baeIUw4Yho2L"
   },
   "source": [
    "---\n",
    "\n",
    "# Summary\n",
    "\n",
    "This notebook has come a pretty far way: we started off by looking at why we would want to bother with neural network or deep learning models in the first place. Part of the intrigue is to let the data \"speak for itself\" by applying a blackbox (hopefully not so blackbox now), unsupervised algorithm to learn your problem. This way, you don't have to think about how to change your dataset to force certain aspects to conform to what your idea/hypothesis is, and the solution—if it exists—should come somewhat organically.\n",
    "\n",
    "With this in mind, in the first section we implemented our own low-level version of a neural network, the fully-connected/multilayer perceptron model using _only_ NumPy. We saw that at its core, these building blocks are nothing more than simple linear algebra, and as we progress to more and more difficult problems, we simply have to implement more complex solutions that are hopefully going to be abstracted away.\n",
    "\n",
    "Subsequently, we saw a brief primer on \"activation functions\", that transform the outputs of our neural networks, such that they can be used for various tasks outside of regression. We spent some time looking at how a small sampling platter of these functions can make the outputs look very different, and indeed make deep neural networks behave in extremely complicated fashions—despite given the same inputs—that are not easily predicted _a priori_.\n",
    "\n",
    "Having set up all of the core mechanics of neural networks, we finally looked at how to actually adapt them to a problem. In our toy task of classifying the two moons, we set ourselves up to find a way to evaluate our model performance, and based on this information, how to update our model appropriately. Here, we just used some of the simplest cost functions and optimization algorithms, the mean-squared loss and simple gradient descent. Ultimately, we showed that the model was able to learn appropriately, and provide some reasonable predictions."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "01-fundamentals.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}