{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "34cd05b0-a7a9-4053-853c-2bc7f43ec0af",
   "metadata": {},
   "source": [
    "# Notebook 8: Binary Classifier Exercise [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mattsankner/micrograd/blob/main/mg8_binary_classifer.ipynb) [![View in nbviewer](https://img.shields.io/badge/view-nbviewer-orange)](https://nbviewer.jupyter.org/github/mattsankner/micrograd/blob/main/mg8_binary_classifier.ipynb)\n",
    "\n",
    "## Let's build a binary classifier with our micrograd engine! This will be able to tell whether random points on a graph are red or blue.\n",
    "\n",
    "Below, I take Andrej's example and expound on it with my notes about the code. You won't understand all of it just from the series of lectures, but it runs on what we already went through, although the code is slightly different, you should be able to understand what is going on, especially if you test with different outputs along the way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a13b84ac-cf6a-48da-b67d-a36f0bcdde58",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This code is adapted from [Andrei Karpathy]\n",
    "# Copyright (c) [2020] [Andrei Karpathy]\n",
    "\n",
    "import random\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plot\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0ccaba28-e1c0-4921-a800-cf7ca415cf10",
   "metadata": {},
   "outputs": [],
   "source": [
    "#import the micrograd we made\n",
    "from micrograd.engine import Value \n",
    "from micrograd.mlp import Neuron, Layer, MLP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ee213fe0-e4b6-4247-81f1-359dd8d26483",
   "metadata": {},
   "outputs": [],
   "source": [
    "#set random seeds for reproducibility. Numpy and random both have seeds set.\n",
    "#every time the code is run, it will generate the same random numbers, \n",
    "#which means that the dataset and any other random operations produce the same results\n",
    "np.random.seed(1337)\n",
    "random.seed(1337)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "668fa78c-f60a-4b75-9b03-9631db52c113",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.collections.PathCollection at 0x12ab56ed0>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 500x500 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.datasets import make_moons, make_blobs\n",
    "\n",
    "# create dataset using 'make_moons' from sklearn\n",
    "# generates synthetic dataset with two interleaving half circles shape\n",
    "X, y = make_moons(n_samples=100, noise=0.1) #adds gaussian noise to the data\n",
    "\n",
    "# make y axis to be -1 to 1 instead of 0 to 1 (default)\n",
    "y = y * 2 - 1 \n",
    "\n",
    "# print dataset in 2D\n",
    "plot.figure(figsize=(5,5)) #figure width/height in inches\n",
    "\n",
    "# creates scatter plot of data with x and y coordinates. c=y sets color based on labels; s is size of the points, cmap=jet is colormap\n",
    "plot.scatter(X[:,0], X[:,1], c=y, s=20, cmap='jet')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87f068df-c06e-4dc7-aaa6-25684e84479e",
   "metadata": {},
   "source": [
    "### ReLU"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a64da87-28c4-4417-91f7-92e0deb5454a",
   "metadata": {},
   "source": [
    "Below, we initialize the neural network, which will be made with the micrograd I created in this repository. It is made with a ```ReLU``` function instead of a ```tanh()``` activation function, as Andrei coded his this way in his final product. \n",
    "\n",
    "Recall from the notebooks that without an activation functions to introduce non-linearity into the network, the network would not be able to learn complex patterns and functions; it would simply be equivalent to a single-layer lienar model, which would limit its modeling power. Activation functions enable stacking of multiple layers, where each layer learns different levels of abstraction. Different activation functions can control the range of the output values. \n",
    "\n",
    "After some research, here is the theory behind why we implemented ```ReLU()``` in micrograd and are using it here:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "373e1584-ec70-4973-8795-4552ff60f3ef",
   "metadata": {},
   "source": [
    "### ```tanh(): Hyperbolic Tangent```\n",
    "\n",
    "$$tanh(x)={\\frac {\\sinh x}{\\cosh x}}={\\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}={\\frac {e^{2x}-1}{e^{2x}+1}}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40920a02-9b7c-4b47-a7c4-5c4f1e68a676",
   "metadata": {},
   "source": [
    "### ```ReLU(): Rectified Linear Unit```\n",
    "\n",
    "$$ReLU(x)={max(0, x)}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daadd844-9230-4a1b-96cc-2797ff93898b",
   "metadata": {},
   "source": [
    "- While ```tanh()``` outputs between -1 and 1, ```ReLU``` outputs $0$ for any negative input and the input itself for any positive input\n",
    "\n",
    "- In the backward pass, ```tanh()```'s derivativce is $1-tanh(x)^2$, which can lead to small gradients for large values of $x$, potentially causing ```vanishing gradients```. The ```ReLU```'s derivative is $1$ for positive outputs and $0$ for negative inputs, making gradient calculation straightforward.\n",
    "\n",
    "- ```vanishing gradients```: when the gradients of the loss function with respect to the model parameters become very small, effectively approaching zero. This can severely slow down or even stop the training process becasue the updates to the model parameters become insignificant.\n",
    "  \n",
    "- For ```tanh()```, when the input values are are very large or very small, ```tanh(x)``` saturates to $-1$ or $1$, making the gradient close to $0$. Since these gradients get mutliplied through each layer, it results in exponentially smaller gradients continually, which makes it tough for the model to learn.\n",
    "\n",
    "- For ```ReLU()```, it doesn't squash the gradients into a small range, so it doesn't have this problem. It suffers from a different problem, however, known as the ```dying ReLU``` porblem, where neurons get stuck with a gradient of $0$ if they always output $0$.\n",
    "\n",
    "- Note: another common activation function is ```sigmoid(x)```, which outputs values between $0$ and $1$, and is useful for binary classification, although they can also suffer from the vanishing gradient problem. Its function is \n",
    "$${\\frac {1}{1 + {e^{-x}}}}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "6b517b85-43be-427d-b8a4-9f7e5c64c0c9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "parameter count:  337\n"
     ]
    }
   ],
   "source": [
    "model = MLP(2, [16, 16, 1]) # 2-layer neural network: 2 inputs, 2 hidden layers with 16 \n",
    "\n",
    "print(\"parameter count: \", len(model.parameters()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc916bfb-ea8d-4ff2-8ab9-b631f23bf759",
   "metadata": {},
   "source": [
    "## Observe the loss function below. Use comments and explanations below the code block to aod your understanding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "76595edd-91ca-4d47-875d-86a9cfe6fbc9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total loss:  0.8958441028683222 , Accuracy:  50.0 %\n"
     ]
    }
   ],
   "source": [
    "# loss function\n",
    "def loss(batch_size=None): #compare predictions to labels. Xb = data; yb = labels\n",
    "    \n",
    "    #batch selection\n",
    "    if batch_size is None:\n",
    "        Xb, yb = X, y # Use the entire dataset\n",
    "    else:\n",
    "        ri = np.random.permutation(X.shape[0])[:batch_size] # randomly select a batch of batch_size, if specified\n",
    "        Xb, yb = X[ri], y[ri] #randomly select a subset of the data\n",
    "        \n",
    "    inputs = [list(map(Value, xrow)) for xrow in Xb] # Convert input rows to Value objects for automatic differentiation\n",
    "    \n",
    "    #passing the inputs through the model to compute the scores (predictions)\n",
    "    #map function applies a given function (model; an mlp instance) to each item of an iterable(inputs; list of input vectors) and returns an iterator (object)\n",
    "    scores = list(map(model, inputs))\n",
    "    \n",
    "    # calculates \"max-margin\" loss for each prediction and applies ReLU(). \n",
    "    #for each y and score for the given x input to mlp, calculate loss\n",
    "    losses = [(1 + -yi*scorei).relu() for yi, scorei in zip(yb, scores)]\n",
    "    data_loss = sum(losses) * (1.0 / len(losses)) #dataloss = average of these individual losses\n",
    "    \n",
    "    #adds an L2 regularization (weight decay) term to the loss to prevent overfitting\n",
    "    alpha = 1e-4\n",
    "    reg_loss = alpha * sum((p*p for p in model.parameters())) #regularization term: alpha * sum of p^2 for all params p in model\n",
    "    total_loss = data_loss + reg_loss #total loss = sum of data loss and regularization loss\n",
    "    \n",
    "    #calculate accuracy... measures proportion of correct predictions, determines if predictions are correct\n",
    "    #checks if sign of true label yi and predictions scorei are both positive -> then the prediction is correct\n",
    "    #results in a list of boolean values for correct and incorrect predictions inside accuracy\n",
    "    accuracy = [(yi > 0) == (scorei.data > 0) for yi, scorei in zip(yb, scores)] \n",
    "\n",
    "    #sum(accuracy) = number of true values in accuracy list (num of correct predictions)\n",
    "    #len(accuracy) = total number of predictions\n",
    "    #sum/len = accuracy as a fraction of correct predictions\n",
    "    return total_loss, sum(accuracy) / len(accuracy) #return loss and accuracy\n",
    "\n",
    "total_loss, acc = loss() #compute total loss and accuracy for current model state\n",
    "print(\"Total loss: \", total_loss.data, \", Accuracy: \", acc*100, \"%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e41bd0cb-a9f4-4bb6-83b6-8d6548a98375",
   "metadata": {},
   "source": [
    "As you can see, the initial loss is high accuracy is low."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35f97082-5027-443e-b8fd-11e5388c2814",
   "metadata": {},
   "source": [
    "## Explanation:\n",
    "This is the code that gives us the losses using ```ReLU``` for each pair of ```yi``` and ```scorei```:\n",
    "``` python\n",
    "losses = [(1 + -yi*scorei).relu() for yi, scorei in zip(yb, scores)]\n",
    "```\n",
    "You could write its main function as:\n",
    "```python\n",
    "loss = ReLU(1 + -y[i] * score[i]\n",
    "```\n",
    "\n",
    "### ReLU():\n",
    "```ReLU(x) = max(0,x)```\n",
    "- it outputs x if x is positive, and 0 if x is negative. This non-linearity allows the net to learn complex functions.\n",
    "\n",
    "\n",
    "### Hinge Loss:\n",
    "The hinge loss function is commonly used for ```max-margin``` classification, such as in support vector machines(svm's). The hinge loss for a single prediction is defined as:\n",
    "\n",
    "```loss = max(0,1 - y[i] * score[i]```\n",
    "\n",
    "Here, ```y[i]``` is the true label, either $1$ or $-1$, and ```score[i]``` is the predicted score. The hinge loss penalizes the predictions if it does not exceed the margin (set to $1$ here) in the correct direction. \n",
    "\n",
    "### Margin-Based Loss:\n",
    "In the code, the ```ReLU``` function is used as a part of caluclating the hinge loss to get the ```margin-based loss```. The ```ReLU``` is taking the ```max()``` of $0$ and ```1 - y[i] * score[i]```. \n",
    "\n",
    "This ensures that the hinge loss will be $0$ or $>0$, because it's taking the ```max()``` of $0$ and the ```hinge loss value```. \n",
    "\n",
    "The hinge loss penalizes predictions that are correct but not confident enough (within the margin) or incorrect. If the product of ```y[i] * score[i] < 1```, the prediction is either incorrect or not within a safe margin. For ```y[i]>0```, the score should be greater than $1$, \n",
    "and for ```y[i]<0```, the score should be less than $-1$. \n",
    "\n",
    "### True Labels ($y[i]$)):\n",
    "\n",
    "For binary classification, the true labels are typically either $+1$ or $-1$. In this example, $( yb[0] = 1 ) and ( yb[1] = -1 )$.\n",
    "\n",
    "### Predicted Scores ($score[i]$)):\n",
    "\n",
    "These are the outputs of the model. In the example, we have $score[0] = 0.8$ and $score[1] = -0.6$.\n",
    "\n",
    "Adding $1$ to this product shifts the hinge loss so that it is zero or positive:\n",
    "\n",
    "- For correctly classified examples with a sufficient margin,$1- -y[i] * score[i]$ will be less than $0$, and ReLU will make it $0$.\n",
    "- For incorrectly classified examples or those within the margin, $1 - -y[i] * score[i]$ will be positive, and ReLU will keep the positive value.\n",
    "\n",
    "\n",
    "### Examples:\n",
    "\n",
    "Prediction is correct but not confident enough (within margin), resulting in small loss.\n",
    "- ```y[i] = 1; score[i] = 0.8``` -> $1 - y[i] * score[i] = 1 - 1 * 0.8 = 0.2$   -> ```ReLU(0.2) = 0.2)```\n",
    "  \n",
    "Prediction is incorrect, resulting in a significant loss:\n",
    "- ```y[i] = 1; score[i] = -0.5``` -> $1 - y[i] * score[i] = 1 - 1 * (-0.5) = 1 + 0.5 = 1.5$   ->   ```ReLU(1.5) = 1.5```\n",
    "\n",
    "Prediction is incorrect:\n",
    "- ```y[i] = -1; score[i] = 0.3```  -> $1 - y[i] * score[i] = 1- (-1) * 0.3 = 1 + 0.3 = 1.3$    -> ```ReLU(1.3) = 1.3```\n",
    "\n",
    "Prediction is correct but not condient enough:\n",
    "- ```y[i] = -1; score[i] = 0.7```   -> $1 -y[i] * score[i] = 1 - (-1) * (-0.7) = 1 - 0.7 = 0.3$   ->   ```ReLU(0.3) = 0.3```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c72e8ce-17de-4e34-8e47-99f1dfe75c72",
   "metadata": {},
   "source": [
    "### L2 Regularization:\n",
    "\n",
    "```python\n",
    "alpha = 1e-4 #a hyperparameter; the regularization strength set to a small value\n",
    "reg_loss = alpha * sum((p*p for p in model.parameters())) #scales the sum of squared parameters by the regularization strength\n",
    "total_loss = data_loss + reg_loss \n",
    "#total loss = sum of data loss(loss computed from predictions and true labels using hinge loss and relu) \n",
    "#reg_loss = regularization loss (l2)\n",
    "```\n",
    "\n",
    "Penalizes large weights in the model to prevent overfitting. This encourages the model to keep the weigths smaller, which often leads to better generalization. \n",
    "\n",
    "Combining the two losses ```data_loss``` and ```reg_loss``` gives the total loss used for backpropogation. This ensures that the optimization process considers both the fit to the data and complexity of the model through regularization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b038d110-2424-4e15-9715-09684428c6b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "step 1 loss 0.8958441028683222, accuracy 50.0%\n",
      "step 2 loss 1.723590533697202, accuracy 81.0%\n",
      "step 3 loss 0.7429006313851131, accuracy 77.0%\n",
      "step 4 loss 0.7705641260584201, accuracy 82.0%\n",
      "step 5 loss 0.3692793385976537, accuracy 84.0%\n",
      "step 6 loss 0.313545481918522, accuracy 86.0%\n",
      "step 7 loss 0.28142343497724337, accuracy 89.0%\n",
      "step 8 loss 0.26888733313983904, accuracy 91.0%\n",
      "step 9 loss 0.2567147286057417, accuracy 91.0%\n",
      "step 10 loss 0.27048625516379227, accuracy 91.0%\n",
      "step 11 loss 0.24507023853658041, accuracy 91.0%\n",
      "step 12 loss 0.25099055297915024, accuracy 92.0%\n",
      "step 13 loss 0.21560951851922952, accuracy 91.0%\n",
      "step 14 loss 0.2309037844640272, accuracy 93.0%\n",
      "step 15 loss 0.20152151227899456, accuracy 92.0%\n",
      "step 16 loss 0.2257450627928221, accuracy 93.0%\n",
      "step 17 loss 0.19447987596204114, accuracy 92.0%\n",
      "step 18 loss 0.21089496199246363, accuracy 93.0%\n",
      "step 19 loss 0.15983077356303607, accuracy 94.0%\n",
      "step 20 loss 0.1845374874688391, accuracy 93.0%\n",
      "step 21 loss 0.18977522856087642, accuracy 91.0%\n",
      "step 22 loss 0.19072704042579647, accuracy 93.0%\n",
      "step 23 loss 0.11733695088756486, accuracy 97.0%\n",
      "step 24 loss 0.12173524408232453, accuracy 95.0%\n",
      "step 25 loss 0.12615712612770455, accuracy 95.0%\n",
      "step 26 loss 0.1604909778080166, accuracy 95.0%\n",
      "step 27 loss 0.18747197705245813, accuracy 92.0%\n",
      "step 28 loss 0.16741837891059413, accuracy 95.0%\n",
      "step 29 loss 0.0958658349145541, accuracy 97.0%\n",
      "step 30 loss 0.08778783707420913, accuracy 96.0%\n",
      "step 31 loss 0.11731297569011852, accuracy 95.0%\n",
      "step 32 loss 0.09340146460619832, accuracy 97.0%\n",
      "step 33 loss 0.12454454903103455, accuracy 95.0%\n",
      "step 34 loss 0.07984002652777279, accuracy 97.0%\n",
      "step 35 loss 0.07727519232921685, accuracy 97.0%\n",
      "step 36 loss 0.07661250143094467, accuracy 98.0%\n",
      "step 37 loss 0.10610492379198366, accuracy 96.0%\n",
      "step 38 loss 0.09062808429265971, accuracy 99.0%\n",
      "step 39 loss 0.10671887043036933, accuracy 95.0%\n",
      "step 40 loss 0.052256599219758566, accuracy 98.0%\n",
      "step 41 loss 0.060160098952344594, accuracy 100.0%\n",
      "step 42 loss 0.08596724533333948, accuracy 96.0%\n",
      "step 43 loss 0.051121079431796, accuracy 99.0%\n",
      "step 44 loss 0.05240142401642842, accuracy 97.0%\n",
      "step 45 loss 0.04530684179001555, accuracy 100.0%\n",
      "step 46 loss 0.07211073370655112, accuracy 97.0%\n",
      "step 47 loss 0.03334238651310243, accuracy 99.0%\n",
      "step 48 loss 0.03143222795751121, accuracy 100.0%\n",
      "step 49 loss 0.036585367471115224, accuracy 99.0%\n",
      "step 50 loss 0.048291393823902774, accuracy 99.0%\n",
      "step 51 loss 0.09875114765619629, accuracy 96.0%\n",
      "step 52 loss 0.05449063965875469, accuracy 99.0%\n",
      "step 53 loss 0.033926794357082984, accuracy 100.0%\n",
      "step 54 loss 0.0526151726356845, accuracy 97.0%\n",
      "step 55 loss 0.03250295251424938, accuracy 99.0%\n",
      "step 56 loss 0.028883273872078112, accuracy 100.0%\n",
      "step 57 loss 0.041391511040272486, accuracy 98.0%\n",
      "step 58 loss 0.0189874074261285, accuracy 100.0%\n",
      "step 59 loss 0.02523833523883749, accuracy 100.0%\n",
      "step 60 loss 0.02079656521341884, accuracy 100.0%\n",
      "step 61 loss 0.03259711157810239, accuracy 99.0%\n",
      "step 62 loss 0.017863351693480307, accuracy 100.0%\n",
      "step 63 loss 0.023008717832211728, accuracy 100.0%\n",
      "step 64 loss 0.02207932546358146, accuracy 100.0%\n",
      "step 65 loss 0.029432917853529764, accuracy 99.0%\n",
      "step 66 loss 0.016251514644091865, accuracy 100.0%\n",
      "step 67 loss 0.028468534483264557, accuracy 99.0%\n",
      "step 68 loss 0.01399436554620875, accuracy 100.0%\n",
      "step 69 loss 0.015552344843651311, accuracy 100.0%\n",
      "step 70 loss 0.0338911994616018, accuracy 99.0%\n",
      "step 71 loss 0.01422987006592695, accuracy 100.0%\n",
      "step 72 loss 0.013255281583285525, accuracy 100.0%\n",
      "step 73 loss 0.012300277590022099, accuracy 100.0%\n",
      "step 74 loss 0.01267605249835594, accuracy 100.0%\n",
      "step 75 loss 0.02059381195595479, accuracy 100.0%\n",
      "step 76 loss 0.011845398205364475, accuracy 100.0%\n",
      "step 77 loss 0.016012697472883017, accuracy 100.0%\n",
      "step 78 loss 0.02545836023922218, accuracy 100.0%\n",
      "step 79 loss 0.014382930289661939, accuracy 100.0%\n",
      "step 80 loss 0.011698962425818023, accuracy 100.0%\n",
      "step 81 loss 0.012318500800515733, accuracy 100.0%\n",
      "step 82 loss 0.014121117031464264, accuracy 100.0%\n",
      "step 83 loss 0.011664591962446257, accuracy 100.0%\n",
      "step 84 loss 0.011589314549188712, accuracy 100.0%\n",
      "step 85 loss 0.010990299347735225, accuracy 100.0%\n",
      "step 86 loss 0.01098922672069161, accuracy 100.0%\n",
      "step 87 loss 0.010988193757655071, accuracy 100.0%\n",
      "step 88 loss 0.010987200447388702, accuracy 100.0%\n",
      "step 89 loss 0.010986246779084923, accuracy 100.0%\n",
      "step 90 loss 0.010985332742365272, accuracy 100.0%\n",
      "step 91 loss 0.010984458327280172, accuracy 100.0%\n",
      "step 92 loss 0.010983623524308863, accuracy 100.0%\n",
      "step 93 loss 0.010982828324359071, accuracy 100.0%\n",
      "step 94 loss 0.010982072718767003, accuracy 100.0%\n",
      "step 95 loss 0.010981356699297043, accuracy 100.0%\n",
      "step 96 loss 0.010980680258141725, accuracy 100.0%\n",
      "step 97 loss 0.010980043387921506, accuracy 100.0%\n",
      "step 98 loss 0.010979446081684675, accuracy 100.0%\n",
      "step 99 loss 0.010978888332907227, accuracy 100.0%\n",
      "step 100 loss 0.010978370135492717, accuracy 100.0%\n"
     ]
    }
   ],
   "source": [
    "# optimization\n",
    "for k in range(100):\n",
    "    \n",
    "    # forward, compute total loss and accuracy\n",
    "    total_loss, acc = loss()\n",
    "    \n",
    "    # backward\n",
    "    model.zero_grad() #reset grads\n",
    "    total_loss.backward() \n",
    "    \n",
    "    # update params of model (sgd)\n",
    "    learning_rate = 1.0 - 0.9*k/100 #learning rate goes from 1.0 - .9 *0/100 -> 1.0 - .9 * 99/100 = 1.0 -> .109\n",
    "    for p in model.parameters():\n",
    "        p.data -= learning_rate * p.grad #updates data of each p by subtracting product of learning rate and gradient\n",
    "        #moves p in direction of reduced loss\n",
    "    \n",
    "    if k % 1 == 0:\n",
    "        print(f\"step {k + 1} loss {total_loss.data}, accuracy {acc*100}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66ea69ba-33c4-4b54-82b8-be5e9b27e01a",
   "metadata": {},
   "source": [
    "The model improves significantly, achieving $100%$ ```accuracy``` with very low ```loss.```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3044c64-2cf9-4dad-a825-9d802607688c",
   "metadata": {},
   "source": [
    "### Below we visualize the decision boundary of the trained model. \n",
    "\n",
    "It creates a grid of points over the input space, predicts the class for each point using the model, and then plots the decision boundary along with the original data points. \n",
    "\n",
    "The ```h``` parameter controls the resolution of the grid, and the use of ```X[:, 0]``` and ```X[:, 1]``` extracts the ```x``` and ```y``` coordinates of the data points. The mesh grid creation and prediction steps ensure that the decision boundary is accurately plotted based on the model's predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "312e23f3-a082-4324-9a3f-decb40ac2260",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-1.548639298268643, 1.951360701731357)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# visualize decision boundary\n",
    "\n",
    "h = 0.25 #step size for mesh grid. Defines resolution of grid for plotting decision boundary. Smaller vals = finer grid\n",
    "\n",
    "#x_min, x_max; y_min, y_max are the minimum and maximum values of the x and y-coordinates of the data points, \n",
    "#extended by 1 unit on each side for better visualization.\n",
    "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n",
    "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n",
    "\n",
    "#np.arrange() generates  values from x_min to x_max and y_min to y_max with a step size of h.\n",
    "#np.meshgrid takes these ranges and creates two 2D arrays xx and yy representing the grid coordinates. \n",
    "#xx contains the x-coordinates and yy contains the y-coordinates of the grid point\n",
    "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
    "                     np.arange(y_min, y_max, h))\n",
    "\n",
    "#xx.ravel() and yy.ravel() flatten the 2D arrays xx and yy into 1D arrays\n",
    "#np.c_[...] combines these 1D arrays column-wise to form a 2D array Xmesh where each row is a point on the grid.\n",
    "Xmesh = np.c_[xx.ravel(), yy.ravel()]\n",
    "\n",
    "#each row of Xmesh is converted into a list of Value objects (used for automatic differentiation by the micrograd library)\n",
    "#this step is necessary for the model to process the inputs and make predictions\n",
    "inputs = [list(map(Value, xrow)) for xrow in Xmesh]\n",
    "\n",
    "#map(model, inputs) applies the model to each point in inputs, producing a list of scores (predictions)\n",
    "scores = list(map(model, inputs))\n",
    "\n",
    "#s.data > 0 checks if the score is positive, indicating that the model predicts the positive class for that point\n",
    "#np.array([...]) converts the list of boolean values into a NumPy array.\n",
    "Z = np.array([s.data > 0 for s in scores])\n",
    "\n",
    "#reshapes this 1D array back into the 2D shape of the grid, matching xx and yy.\n",
    "Z = Z.reshape(xx.shape)\n",
    "\n",
    "fig = plot.figure() #new figure for plotting\n",
    "plot.contourf(xx, yy, Z, cmap=plot.cm.Spectral, alpha=0.8) #plot decision boundary\n",
    "plot.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plot.cm.Spectral) #plot original data points\n",
    "plot.xlim(xx.min(), xx.max()) #sets the limits of the x-axis to the range of the grid\n",
    "plot.ylim(yy.min(), yy.max()) #sets the limits of the y-axis to the range of the grid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c0f29f3-6750-4cc4-a219-b6751d9723d0",
   "metadata": {},
   "source": [
    "### Above, our neural network has  learned to separate the two classes well, as indicated by the clear boundary between the orange and green regions. The red and blue points correspond to the two classes, and they are on the correct side of the boundary.\n",
    "\n",
    "# Summary:\n",
    "\n",
    "- The model starts with an initial ```accuracy = 50%``` and ```loss = 0.896```.\n",
    "- Through 100 iterations of training with Stochastic Gradient Descent, the model improves significantly, achieving final ```accuracy = 100%``` and a very low ```loss=0.011```.\n",
    "- The decision boundary visualization confirms that the model has learned to separate the two classes effectively\n",
    "- This process demonstrates the effectiveness of the neural netwrok in learning a non-linear decision boundary to classify the ```make_moons``` dataset accurately."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}