{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kpE4QC7IZRsL"
   },
   "source": [
    "# Lecture 10.3: Deep Neural Network Optimization and Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ntxFpMejb3Ek"
   },
   "source": [
    "\n",
    "## **1. Stochastic Gradient Descent (SGD)**\n",
    "### **1.1 Introduction**\n",
    "\n",
    "Stochastic Gradient Descent is an iterative method used for optimizing objective functions, especially suitable for large-scale datasets. SGD differs from the standard gradient descent method. In standard gradient descent, we use all the training data to compute the gradient and update the model parameters, which can be very time-consuming when dealing with large datasets. In contrast, SGD uses only one sample (or a small batch of samples, known as mini-batch SGD) per iteration to compute the gradient and update parameters, greatly accelerating the optimization process.\n",
    "\n",
    "* Difference between Standard Gradient Descent and Stochastic Gradient Descent\n",
    "    - **Standard Gradient Descent:** Uses all training data to compute the gradient and update the model parameters.\n",
    "    - **Stochastic Gradient Descent:** Uses one sample (or a mini-batch of samples) per iteration to compute the gradient and update parameters.\n",
    "\n",
    "* Advantage\n",
    "    - It is useful and efficient when dealing with large-scale datasets due to its one-sample-per-iteration approach (or a mini-batch of samples), reducing the computational cost significantly.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "FrC0hCbqD0Mn"
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "import torchvision\n",
    "import torchvision.transforms as transforms\n",
    "from torch.utils.data import DataLoader\n",
    "import torch\n",
    "import torch.nn.functional as F"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KR05-tQ6EFvP"
   },
   "source": [
    "#### **Mathematical Formulation**\n",
    "\n",
    "Stochastic Gradient Descent (SGD) is an optimization technique used to minimize (or maximize) an objective function that is the summation of differentiable functions. The fundamental equation of SGD is as follows:\n",
    "\n",
    "$$  \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\nabla f(\\theta^{(t)}) .$$\n",
    "\n",
    "Here,\n",
    "- $\\theta^{(t+1)}$represents the updated parameter at the next iteration,\n",
    "- $ \\theta^{(t)} $ is the current parameter,\n",
    "- $ \\alpha^{(t)} $ is the learning rate, a hyperparameter that determines the step size at each iteration while moving towards a minimum of the objective function,\n",
    "- $ \\nabla f(\\theta^{(t)}) $ represents the gradient of the objective function $ f $ with respect to the parameter $ \\theta^{(t)} $.\n",
    "\n",
    "#### **Difference between SGD and Gradient Descent**\n",
    "- **Gradient Descent (Batch Gradient Descent):**\n",
    "$$  \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\bigtriangledown_{\\theta^{(t)}}\\frac{1}{M}\\sum_{i=1}^M\\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n",
    "Here, $ (x^{(i)},y^{(i)}) $ represents one training sample. $ i $ represents the index of a single randomly selected sample. This method calculates the average gradient using all training samples to update the parameters.\n",
    "\n",
    "- **Stochastic Gradient Descent (SGD):**\n",
    "$$  \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\nabla_{\\theta^{(t)}} \\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n",
    "This method uses the gradient calculated using one randomly selected training sample to update the parameters.\n",
    "\n",
    "- **Mini-batch Stochastic Gradient Descent:**\n",
    "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)}  \\nabla_{\\theta^{(t)}} \\frac{1}{B}\\sum_{i=1}^{B} \\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n",
    "Here, $ B $ represents the batch size. This method uses the average gradient calculated using a mini-batch of training samples to update the parameters.\n",
    "\n",
    "\n",
    "\n",
    "#### **Summary**\n",
    "\n",
    "In SGD, we update our parameters using the gradient of the objective function with respect to the parameters, based on a single training example at each iteration, unlike gradient descent that uses the whole training dataset to compute the gradient. This leads to faster iterations and is especially useful when dealing with large datasets.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tZji5KjPzSRH"
   },
   "source": [
    "### **1.2 Implementation**\n",
    "\n",
    "**Goal:** Implement a Manual SGD optimizer class to optimize the parameters of a given model.\n",
    "\n",
    "To do so, we will define a `ManualSGD` class, which contains the parameters to be optimized, the method to perform a step of optimization, and the method to zero out the gradients.\n",
    "\n",
    "In this class:\n",
    "- The constructor `__init__(self, params, lr)` initializes the optimizer with `params` (the parameters to be optimized) and `lr` (the learning rate).\n",
    "- The `step` method updates the parameters using their gradient and the learning rate.\n",
    "- The `zero_grad` method zeros out the gradients of the parameters to prepare for the next optimization step.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "ewQUQMHaD5hc"
   },
   "outputs": [],
   "source": [
    "class ManualSGD:\n",
    "    def __init__(self, params, lr):\n",
    "        self.params = list(params)\n",
    "        self.lr = lr\n",
    "\n",
    "    def step(self):\n",
    "        for param in self.params:\n",
    "            if param.grad is not None:\n",
    "                param.data -= self.lr * param.grad\n",
    "\n",
    "    def zero_grad(self):\n",
    "        for param in self.params:\n",
    "            if param.grad is not None:\n",
    "                param.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6hL7iN_oCj7e"
   },
   "source": [
    "## **2. Adaptive Moment Estimation（Adam）**\n",
    "### **2.1 Introduction**\n",
    "For a successful deep-learning project, the optimization algorithm plays a crucial role. SGD has played a major role in many successful deep-learning projects and research experiments.\\\n",
    "But SGD has its own limitations as well. The requirement of excessive tuning of the hyperparameters is one of them. Recently, the Adam optimization algorithm has gained a lot of popularity.\n",
    "#### **Mathematical Formulation**\n",
    "Adam is an adaptive learning rate method, which means, it computes individual learning rates for different parameters. Its name is derived from adaptive moment estimation, and the reason it’s called Adam is that it uses estimations of the first and second moments of the gradient to adapt the learning rate $(\\alpha)$ for each weight of the neural network. Moment variables are initialized as $s^{(0)}=0$ and $r^{(0)}=0$. The formula for calculating gradient estimation is as follows:\n",
    "$$\\hat{g}= \\bigtriangledown_{\\theta^{(t)}}\\ell f((x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n",
    "Here, $ (x^{(i)},y^{(i)}) $ represents the number of training samples. $ i $ represents the index of a single randomly selected sample.\\\n",
    "That gradient of the cost function of the neural network can be considered as a random variable, since it is usually evaluated on some small random batch of data. To estimate the moments, Adam utilizes exponentially moving averages, computed on the gradient evaluated on a current mini-batch:\n",
    "$$ s^{(t+1)}=\\rho_{1}s^{(t)}+(1-\\rho_{1})\\hat{g} $$\\\n",
    "$$ r^{(t+1)}=\\rho_{2}r^{(t)}+(1-\\rho_{2})\\hat{g}\\odot\\hat{g}, $$\n",
    "where $s$ and $r$ are moving averages, $\\rho_1,\\rho_2$ are newly introduced hyper-parameters of the algorithm. They have really good default values of $0.9$ and $0.999$ respectively.\\\n",
    "Now we need to correct the estimator so that the expected value is the one we want. This step is usually referred to as bias correction. The final formulas for our estimator will be as follows:\n",
    "$$ \\hat s=\\frac{s^{(t+1)}}{1-\\rho^{t+1}_{1}} $$\\\n",
    "$$ \\hat r=\\frac{r^{(t+1)}}{1-\\rho^{t+1}_{2}}. $$\n",
    "The only thing left to do is to use those moving averages to scale the learning rate individually for each parameter. To update the weight, we do the following:\n",
    "$$ \\theta^{(t+1)}=\\theta^{(t)}-\\alpha\\frac{\\hat s}{\\sqrt{\\hat r}+\\delta }, $$\n",
    "where $ \\theta^{(t)} $ is the current weight, $\\theta^{(t+1)}$ represents the updated weight at the next iteration, $\\alpha$ is the global learning rate. Typically we pick $\\delta =10^{-8}$ for a good trade-off between numerical stability and fidelity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CtL_RULIFiuq"
   },
   "source": [
    "\n",
    "### **2.2 Implementation**\n",
    "\n",
    "**Goal:** Implement a Manual Adam optimizer class to optimize the parameters of a given model.\\\n",
    "To do so, we will define a `ManualAdam` class, which leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.\n",
    "\n",
    "In this class:\n",
    "- The constructor `__init__(self, params, lr，beta1, beta2, epsilon)` initializes the optimizer with `params` (the parameters to be optimized), `lr` (the learning rate), `beta1` and `beta2` (the attenuation factors used to control the first and second moment estimates, which are set by default to $0.9$ and $0.999$, respectively) and `epsilon` (a small constant used to prevent division by zero errors, with the default value of $10^{-8}$).\n",
    "- The `step` method updates the parameters using their learning rate.\n",
    "- The `zero_grad` method is used to zero the gradients of the model parameters so that the gradients are recalculated in each iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "VM-pwtKMdgSI"
   },
   "outputs": [],
   "source": [
    "class ManualAdam:\n",
    "    def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):\n",
    "        self.params = list(params)\n",
    "        self.lr = lr\n",
    "        self.beta1 = beta1\n",
    "        self.beta2 = beta2\n",
    "        self.epsilon = epsilon\n",
    "        self.m = [torch.zeros_like(p) for p in self.params]\n",
    "        self.v = [torch.zeros_like(p) for p in self.params]\n",
    "        self.t = 0\n",
    "\n",
    "    def step(self):\n",
    "        self.t += 1\n",
    "        for i, param in enumerate(self.params):\n",
    "            if param.grad is not None:\n",
    "                self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * param.grad\n",
    "                self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * param.grad ** 2\n",
    "\n",
    "                m_hat = self.m[i] / (1 - self.beta1 ** self.t)\n",
    "                v_hat = self.v[i] / (1 - self.beta2 ** self.t)\n",
    "\n",
    "                param.data -= self.lr * m_hat / (torch.sqrt(v_hat) + self.epsilon)\n",
    "\n",
    "    def zero_grad(self):\n",
    "        for param in self.params:\n",
    "            if param.grad is not None:\n",
    "                param.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "at55FNccEbYc"
   },
   "source": [
    "## **3. Comparisons with PyTorch SGD and Adam Optimizer**\n",
    "### **3.1 Model Architecture and Dataset**\n",
    "#### **Model Architecture**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "SxK5L7cZEbFK"
   },
   "outputs": [],
   "source": [
    "class CNN(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(CNN, self).__init__()\n",
    "        self.conv1 = nn.Conv2d(3, 6, 5)\n",
    "        self.pool = nn.MaxPool2d(2, 2)\n",
    "        self.conv2 = nn.Conv2d(6, 16, 5)\n",
    "        self.fc1 = nn.Linear(16 * 5 * 5, 120)\n",
    "        self.fc2 = nn.Linear(120, 84)\n",
    "        self.fc3 = nn.Linear(84, 10)\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.pool(F.relu(self.conv1(x)))\n",
    "        x = self.pool(F.relu(self.conv2(x)))\n",
    "        x = x.view(-1, 16 * 5 * 5)\n",
    "        x = F.relu(self.fc1(x))\n",
    "        x = F.relu(self.fc2(x))\n",
    "        x = self.fc3(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "id": "-BcZZIz9EhsS"
   },
   "outputs": [],
   "source": [
    "def evaluate_model(model, testloader):\n",
    "    correct = 0\n",
    "    total = 0\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        for data in testloader:\n",
    "            images, labels = data\n",
    "            outputs = model(images)\n",
    "            _, predicted = torch.max(outputs.data, 1)\n",
    "            total += labels.size(0)\n",
    "            correct += (predicted == labels).sum().item()\n",
    "    return 100 * correct / total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wf9cFgqGEkkp"
   },
   "source": [
    "#### **Dataset**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "OdfO-2vaEkrP",
    "outputId": "af3bc5dd-0dff-4a79-dba4-7d17048836d9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data\\cifar-10-python.tar.gz\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "48a11838430846e79e2d005dcf0f96bf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/170498071 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting ./data\\cifar-10-python.tar.gz to ./data\n",
      "Files already downloaded and verified\n"
     ]
    }
   ],
   "source": [
    "transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])\n",
    "trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)\n",
    "trainloader = DataLoader(trainset, batch_size=4, shuffle=True)\n",
    "testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)\n",
    "testloader = DataLoader(testset, batch_size=4, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YigQLtIEEyO2"
   },
   "source": [
    "### **3.2 Comparison with PyTorch SGD**\n",
    "#### **Result of Manual SGD**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Kn6dWZXAEyUT",
    "outputId": "7821b4ae-f960-4142-fc11-1cdbdcbcf5f2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Manual SGD, 1,  2000] loss: 2.303\n",
      "[Manual SGD, 1,  4000] loss: 2.299\n",
      "[Manual SGD, 1,  6000] loss: 2.295\n",
      "[Manual SGD, 1,  8000] loss: 2.280\n",
      "[Manual SGD, 1, 10000] loss: 2.215\n",
      "[Manual SGD, 1, 12000] loss: 2.124\n",
      "[Manual SGD, 2,  2000] loss: 2.032\n",
      "[Manual SGD, 2,  4000] loss: 1.970\n",
      "[Manual SGD, 2,  6000] loss: 1.925\n",
      "[Manual SGD, 2,  8000] loss: 1.880\n",
      "[Manual SGD, 2, 10000] loss: 1.825\n",
      "[Manual SGD, 2, 12000] loss: 1.782\n",
      "Finished Training with Manual SGD\n",
      "Accuracy of the network on the 10000 test images using Manual SGD: 37 %\n"
     ]
    }
   ],
   "source": [
    "model_manual_sgd = CNN()\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_manual = ManualSGD(model_manual_sgd.parameters(), lr=0.001)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_manual.zero_grad()\n",
    "        outputs = model_manual_sgd(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_manual.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[Manual SGD, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with Manual SGD')\n",
    "model_manual_sgd.eval()\n",
    "accuracy_manual_sgd = evaluate_model(model_manual_sgd, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using Manual SGD: %d %%' % accuracy_manual_sgd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F1qlqKc8FWKC"
   },
   "source": [
    "#### **Result of PyTorch SGD**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9WZJYJBqFWpn",
    "outputId": "e6d35f1e-0e8f-46af-cad4-d73fffff254e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Torch SGD, 1,  2000] loss: 2.303\n",
      "[Torch SGD, 1,  4000] loss: 2.300\n",
      "[Torch SGD, 1,  6000] loss: 2.294\n",
      "[Torch SGD, 1,  8000] loss: 2.271\n",
      "[Torch SGD, 1, 10000] loss: 2.216\n",
      "[Torch SGD, 1, 12000] loss: 2.142\n",
      "[Torch SGD, 2,  2000] loss: 2.066\n",
      "[Torch SGD, 2,  4000] loss: 2.010\n",
      "[Torch SGD, 2,  6000] loss: 1.948\n",
      "[Torch SGD, 2,  8000] loss: 1.912\n",
      "[Torch SGD, 2, 10000] loss: 1.868\n",
      "[Torch SGD, 2, 12000] loss: 1.824\n",
      "Finished Training with PyTorch SGD\n",
      "Accuracy of the network on the 10000 test images using PyTorch SGD: 35 %\n"
     ]
    }
   ],
   "source": [
    "model_torch_sgd = CNN()\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_torch = optim.SGD(model_torch_sgd.parameters(), lr=0.001)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_torch.zero_grad()\n",
    "        outputs = model_torch_sgd(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_torch.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[Torch SGD, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with PyTorch SGD')\n",
    "model_torch_sgd.eval()\n",
    "accuracy_torch_sgd = evaluate_model(model_torch_sgd, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using PyTorch SGD: %d %%' % accuracy_torch_sgd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NYV0663-dm8y"
   },
   "source": [
    "### **3.3 Comparison with PyTorch Adam**\n",
    "#### **Result of Manual Adam**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "jL51HLq3dnCO",
    "outputId": "582d4fef-5697-4099-edb9-24fe0a349db2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Manual Adam, 1,  2000] loss: 1.896\n",
      "[Manual Adam, 1,  4000] loss: 1.631\n",
      "[Manual Adam, 1,  6000] loss: 1.527\n",
      "[Manual Adam, 1,  8000] loss: 1.490\n",
      "[Manual Adam, 1, 10000] loss: 1.426\n",
      "[Manual Adam, 1, 12000] loss: 1.389\n",
      "[Manual Adam, 2,  2000] loss: 1.311\n",
      "[Manual Adam, 2,  4000] loss: 1.305\n",
      "[Manual Adam, 2,  6000] loss: 1.293\n",
      "[Manual Adam, 2,  8000] loss: 1.278\n",
      "[Manual Adam, 2, 10000] loss: 1.268\n",
      "[Manual Adam, 2, 12000] loss: 1.234\n",
      "Finished Training with Manual Adam\n",
      "Accuracy of the network on the 10000 test images using Manual Adam: 56 %\n"
     ]
    }
   ],
   "source": [
    "model_manual_adam = CNN()\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_manual_adam = ManualAdam(model_manual_adam.parameters(), lr=0.001)\n",
    "\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_manual_adam.zero_grad()\n",
    "        outputs = model_manual_adam(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_manual_adam.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[Manual Adam, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with Manual Adam')\n",
    "model_manual_adam.eval()\n",
    "accuracy_manual_adam = evaluate_model(model_manual_adam, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using Manual Adam: %d %%' % accuracy_manual_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qxbHZlwLeLd1"
   },
   "source": [
    "#### **Result of PyTorch Adam**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "gdHdmSI3eLjL",
    "outputId": "7bcb1010-5f66-4335-f0e6-63de2eab515f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Torch Adam, 1,  2000] loss: 1.889\n",
      "[Torch Adam, 1,  4000] loss: 1.622\n",
      "[Torch Adam, 1,  6000] loss: 1.536\n",
      "[Torch Adam, 1,  8000] loss: 1.488\n",
      "[Torch Adam, 1, 10000] loss: 1.449\n",
      "[Torch Adam, 1, 12000] loss: 1.397\n",
      "[Torch Adam, 2,  2000] loss: 1.327\n",
      "[Torch Adam, 2,  4000] loss: 1.324\n",
      "[Torch Adam, 2,  6000] loss: 1.314\n",
      "[Torch Adam, 2,  8000] loss: 1.285\n",
      "[Torch Adam, 2, 10000] loss: 1.264\n",
      "[Torch Adam, 2, 12000] loss: 1.241\n",
      "Finished Training with PyTorch Adam\n",
      "Accuracy of the network on the 10000 test images using PyTorch Adam: 55 %\n"
     ]
    }
   ],
   "source": [
    "model_torch_adam = CNN()\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_torch = optim.Adam(model_torch_adam.parameters(),lr=0.001)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_torch.zero_grad()\n",
    "        outputs = model_torch_adam(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_torch.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[Torch Adam, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with PyTorch Adam')\n",
    "model_torch_adam.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_torch_adam, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using PyTorch Adam: %d %%' % accuracy_torch_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AkDN7SkDilrk"
   },
   "source": [
    "## **4. Batch Normalization, Weight Initialization and Learning Rate Schedule**\n",
    "### **4.1 Batch Normalization**\n",
    "**Batch normalization**, often abbreviated as \"batchnorm\", is introduced as a method to facilitate the synchronization of updates across multiple layers within a model. It offers an elegant approach to reparameterize nearly any deep neural network, thereby substantially alleviating the challenge of coordinating updates across numerous layers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "13di6zUkjOu1"
   },
   "source": [
    "#### **Model Architecture and Model Training**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "IuE5hlZeil1J",
    "outputId": "632e92dc-cd37-44f6-f2cb-6aeedb55b488"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Torch Adam, 1,  2000] loss: 2.026\n",
      "[Torch Adam, 1,  4000] loss: 1.878\n",
      "[Torch Adam, 1,  6000] loss: 1.805\n",
      "[Torch Adam, 1,  8000] loss: 1.745\n",
      "[Torch Adam, 1, 10000] loss: 1.714\n",
      "[Torch Adam, 1, 12000] loss: 1.697\n",
      "[Torch Adam, 2,  2000] loss: 1.668\n",
      "[Torch Adam, 2,  4000] loss: 1.680\n",
      "[Torch Adam, 2,  6000] loss: 1.652\n",
      "[Torch Adam, 2,  8000] loss: 1.610\n",
      "[Torch Adam, 2, 10000] loss: 1.619\n",
      "[Torch Adam, 2, 12000] loss: 1.579\n",
      "Finished Training with BN\n",
      "Accuracy of the network on the 10000 test images using BN: 52 %\n"
     ]
    }
   ],
   "source": [
    "class CNNWithBN(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(CNNWithBN, self).__init__()\n",
    "\n",
    "        # Convolution Layer 1\n",
    "        self.conv1 = nn.Conv2d(3, 6, 5)\n",
    "        self.bn1 = nn.BatchNorm2d(6)\n",
    "        self.pool = nn.MaxPool2d(2, 2)\n",
    "\n",
    "        # Convolution Layer 2\n",
    "        self.conv2 = nn.Conv2d(6, 16, 5)\n",
    "        self.bn2 = nn.BatchNorm2d(16)\n",
    "\n",
    "        # Fully Connected Layer 1\n",
    "        self.fc1 = nn.Linear(16 * 5 * 5, 120)\n",
    "        self.bn3 = nn.BatchNorm1d(120)\n",
    "\n",
    "        # Fully Connected Layer 2\n",
    "        self.fc2 = nn.Linear(120, 84)\n",
    "        self.bn4 = nn.BatchNorm1d(84)\n",
    "\n",
    "        # Output Layer\n",
    "        self.fc3 = nn.Linear(84, 10)\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.pool(F.relu(self.bn1(self.conv1(x))))\n",
    "        x = self.pool(F.relu(self.bn2(self.conv2(x))))\n",
    "        x = x.view(-1, 16 * 5 * 5)\n",
    "        x = F.relu(self.bn3(self.fc1(x)))\n",
    "        x = F.relu(self.bn4(self.fc2(x)))\n",
    "        x = self.fc3(x)\n",
    "        return x\n",
    "\n",
    "model_torch_adam = CNNWithBN()\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_torch = optim.Adam(model_torch_adam.parameters(),lr=0.001)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_torch.zero_grad()\n",
    "        outputs = model_torch_adam(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_torch.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[Torch Adam, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with BN')\n",
    "model_torch_adam.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_torch_adam, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using BN: %d %%' % accuracy_torch_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6cu3elYjlYMo"
   },
   "source": [
    "### **4.2 Weight Initialization**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w1q4oFO3nKTH"
   },
   "source": [
    "#### **Kaiming Normal Initialization**\n",
    "**Kaiming Normal**, also known as He Normal initialization, is a method for initializing the weights of neural network layers. Kaiming Normal initialization is primarily used for initializing the weights of layers in deep neural networks, especially in convolutional neural networks (CNNs) and deep feedforward neural networks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "JinTN80blYUM",
    "outputId": "7aa6ca1c-8089-4c62-90eb-ea7eb5506c11"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[kaiming_normal, 1,  2000] loss: 2.070\n",
      "[kaiming_normal, 1,  4000] loss: 1.904\n",
      "[kaiming_normal, 1,  6000] loss: 1.820\n",
      "[kaiming_normal, 1,  8000] loss: 1.787\n",
      "[kaiming_normal, 1, 10000] loss: 1.740\n",
      "[kaiming_normal, 1, 12000] loss: 1.738\n",
      "[kaiming_normal, 2,  2000] loss: 1.676\n",
      "[kaiming_normal, 2,  4000] loss: 1.669\n",
      "[kaiming_normal, 2,  6000] loss: 1.630\n",
      "[kaiming_normal, 2,  8000] loss: 1.651\n",
      "[kaiming_normal, 2, 10000] loss: 1.621\n",
      "[kaiming_normal, 2, 12000] loss: 1.601\n",
      "Finished Training with kaiming_normal\n",
      "Accuracy of the network on the 10000 test images using kaiming_normal: 52 %\n"
     ]
    }
   ],
   "source": [
    "def init_weights_kaiming(m):\n",
    "    if type(m) == nn.Conv2d or type(m) == nn.Linear:\n",
    "        nn.init.kaiming_normal_(m.weight)\n",
    "        m.bias.data.fill_(0.01)\n",
    "\n",
    "\n",
    "def init_weights_xavier(m):\n",
    "    if type(m) == nn.Conv2d or type(m) == nn.Linear:\n",
    "        nn.init.xavier_uniform_(m.weight)\n",
    "        m.bias.data.fill_(0.01)\n",
    "\n",
    "\n",
    "model_kaiming = CNNWithBN()\n",
    "model_kaiming.apply(init_weights_kaiming)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_kaming = optim.Adam(model_kaiming.parameters(), lr=0.001)\n",
    "\n",
    "\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_kaming.zero_grad()\n",
    "        outputs = model_kaiming(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_kaming.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[kaiming_normal, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with kaiming_normal')\n",
    "model_kaiming.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_kaiming, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using kaiming_normal: %d %%' % accuracy_torch_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R7l772UPmnDG"
   },
   "source": [
    "#### **Xavier Uniform Initialization**\n",
    "**Xavier Uniform Initialization**, also known as Glorot Initialization, is a technique used to initialize the weights of neural networks, particularly deep neural networks. It is named after Xavier Glorot, one of its creators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "L53s8HrWmnJp",
    "outputId": "28a3193d-1214-4857-a200-c39d20f7e6ab"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[xavier_uniform, 1,  2000] loss: 2.060\n",
      "[xavier_uniform, 1,  4000] loss: 1.899\n",
      "[xavier_uniform, 1,  6000] loss: 1.830\n",
      "[xavier_uniform, 1,  8000] loss: 1.809\n",
      "[xavier_uniform, 1, 10000] loss: 1.760\n",
      "[xavier_uniform, 1, 12000] loss: 1.728\n",
      "[xavier_uniform, 2,  2000] loss: 1.672\n",
      "[xavier_uniform, 2,  4000] loss: 1.672\n",
      "[xavier_uniform, 2,  6000] loss: 1.651\n",
      "[xavier_uniform, 2,  8000] loss: 1.606\n",
      "[xavier_uniform, 2, 10000] loss: 1.607\n",
      "[xavier_uniform, 2, 12000] loss: 1.593\n",
      "Finished Training with xavier_uniform\n",
      "Accuracy of the network on the 10000 test images using xavier_uniform: 52 %\n"
     ]
    }
   ],
   "source": [
    "model_xavier = CNNWithBN()\n",
    "model_xavier.apply(init_weights_xavier)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer_xavier = optim.Adam(model_xavier.parameters(), lr=0.001)\n",
    "\n",
    "\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_xavier.zero_grad()\n",
    "        outputs = model_xavier(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_xavier.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[xavier_uniform, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with xavier_uniform')\n",
    "model_xavier.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_xavier, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using xavier_uniform: %d %%' % accuracy_torch_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N5TOmkVegP-0"
   },
   "source": [
    "### **4.3 Learning Rate Schedule**\n",
    "Learning rate scheduling is a deep learning technique that dynamically modifies the learning rate throughout neural network training. It aims to enhance training efficiency and prevent instability during the training process.\n",
    "\n",
    "#### **StepLR**\n",
    "\n",
    "The **StepLR** scheduler is a learning rate scheduler in PyTorch that drops the learning rate by a factor every few epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "oG86MwnbaCH8",
    "outputId": "af6b3d2c-fb8c-4736-dde6-7aebcb471cf6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[StepLR, 1,  2000] loss: 1.985\n",
      "[StepLR, 1,  4000] loss: 1.850\n",
      "[StepLR, 1,  6000] loss: 1.784\n",
      "[StepLR, 1,  8000] loss: 1.732\n",
      "[StepLR, 1, 10000] loss: 1.686\n",
      "[StepLR, 1, 12000] loss: 1.675\n",
      "[StepLR, 2,  2000] loss: 1.615\n",
      "[StepLR, 2,  4000] loss: 1.601\n",
      "[StepLR, 2,  6000] loss: 1.580\n",
      "[StepLR, 2,  8000] loss: 1.604\n",
      "[StepLR, 2, 10000] loss: 1.574\n",
      "[StepLR, 2, 12000] loss: 1.539\n",
      "Finished Training with StepLR schedule\n",
      "Accuracy of the network on the 10000 test images using StepLR: 53 %\n"
     ]
    }
   ],
   "source": [
    "from torch.optim import lr_scheduler\n",
    "model_exp_lr = CNNWithBN()\n",
    "optimizer_exp = optim.Adam(model_exp_lr.parameters(), lr=0.001)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "scheduler = lr_scheduler.StepLR(optimizer_exp, step_size=30, gamma=0.95)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_exp.zero_grad()\n",
    "        outputs = model_exp_lr(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_exp.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[StepLR, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with StepLR schedule')\n",
    "model_exp_lr.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_exp_lr, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using StepLR: %d %%' % accuracy_torch_adam)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mirGE_WJorBs"
   },
   "source": [
    "#### **ExponentialLR**\n",
    "\n",
    "The **ExponentialLR** scheduler is a learning rate scheduler in PyTorch that reduces the learning rate exponentially."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rAvU91VdorIq",
    "outputId": "a2dbb78b-6da6-483c-b745-a9ea3a847fad"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ExponentialLR, 1,  2000] loss: 1.994\n",
      "[ExponentialLR, 1,  4000] loss: 1.857\n",
      "[ExponentialLR, 1,  6000] loss: 1.783\n",
      "[ExponentialLR, 1,  8000] loss: 1.747\n",
      "[ExponentialLR, 1, 10000] loss: 1.694\n",
      "[ExponentialLR, 1, 12000] loss: 1.683\n",
      "[ExponentialLR, 2,  2000] loss: 1.630\n",
      "[ExponentialLR, 2,  4000] loss: 1.601\n",
      "[ExponentialLR, 2,  6000] loss: 1.606\n",
      "[ExponentialLR, 2,  8000] loss: 1.572\n",
      "[ExponentialLR, 2, 10000] loss: 1.563\n",
      "[ExponentialLR, 2, 12000] loss: 1.569\n",
      "Finished Training with ExponentialLR schedule\n",
      "Accuracy of the network on the 10000 test images using ExponentialLR: 54 %\n"
     ]
    }
   ],
   "source": [
    "from torch.optim import lr_scheduler\n",
    "model_exp_lr = CNNWithBN()\n",
    "optimizer_exp = optim.Adam(model_exp_lr.parameters(), lr=0.001)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "scheduler_exp = lr_scheduler.ExponentialLR(optimizer_exp, gamma=0.95)\n",
    "\n",
    "for epoch in range(2):\n",
    "    running_loss = 0.0\n",
    "    for i, data in enumerate(trainloader, 0):\n",
    "        inputs, labels = data\n",
    "        optimizer_exp.zero_grad()\n",
    "        outputs = model_exp_lr(inputs)\n",
    "        loss = criterion(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimizer_exp.step()\n",
    "\n",
    "        running_loss += loss.item()\n",
    "        if i % 2000 == 1999:\n",
    "            print('[ExponentialLR, %d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))\n",
    "            running_loss = 0.0\n",
    "\n",
    "print('Finished Training with ExponentialLR schedule')\n",
    "model_exp_lr.eval()\n",
    "accuracy_torch_adam = evaluate_model(model_exp_lr, testloader)\n",
    "print('Accuracy of the network on the 10000 test images using ExponentialLR: %d %%' % accuracy_torch_adam)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "provenance": []
  },
  "gpuClass": "standard",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}