{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "kpE4QC7IZRsL" }, "source": [ "# Lecture 10.3: Deep Neural Network Optimization and Regularization" ] }, { "cell_type": "markdown", "metadata": { "id": "ntxFpMejb3Ek" }, "source": [ "\n", "## **1. Stochastic Gradient Descent (SGD)**\n", "### **1.1 Introduction**\n", "\n", "Stochastic Gradient Descent is an iterative method used for optimizing objective functions, especially suitable for large-scale datasets. SGD differs from the standard gradient descent method. In standard gradient descent, we use all the training data to compute the gradient and update the model parameters, which can be very time-consuming when dealing with large datasets. In contrast, SGD uses only one sample (or a small batch of samples, known as mini-batch SGD) per iteration to compute the gradient and update parameters, greatly accelerating the optimization process.\n", "\n", "* Difference between Standard Gradient Descent and Stochastic Gradient Descent\n", " - **Standard Gradient Descent:** Uses all training data to compute the gradient and update the model parameters.\n", " - **Stochastic Gradient Descent:** Uses one sample (or a mini-batch of samples) per iteration to compute the gradient and update parameters.\n", "\n", "* Advantage\n", " - It is useful and efficient when dealing with large-scale datasets due to its one-sample-per-iteration approach (or a mini-batch of samples), reducing the computational cost significantly.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "FrC0hCbqD0Mn" }, "outputs": [], "source": [ "import torch.nn as nn\n", "import torch.optim as optim\n", "import torchvision\n", "import torchvision.transforms as transforms\n", "from torch.utils.data import DataLoader\n", "import torch\n", "import torch.nn.functional as F" ] }, { "cell_type": "markdown", "metadata": { "id": "KR05-tQ6EFvP" }, "source": [ "#### **Mathematical Formulation**\n", "\n", "Stochastic Gradient Descent (SGD) is an optimization technique used to minimize (or maximize) an objective function that is the summation of differentiable functions. The fundamental equation of SGD is as follows:\n", "\n", "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\nabla f(\\theta^{(t)}) .$$\n", "\n", "Here,\n", "- $\\theta^{(t+1)}$represents the updated parameter at the next iteration,\n", "- $ \\theta^{(t)} $ is the current parameter,\n", "- $ \\alpha^{(t)} $ is the learning rate, a hyperparameter that determines the step size at each iteration while moving towards a minimum of the objective function,\n", "- $ \\nabla f(\\theta^{(t)}) $ represents the gradient of the objective function $ f $ with respect to the parameter $ \\theta^{(t)} $.\n", "\n", "#### **Difference between SGD and Gradient Descent**\n", "- **Gradient Descent (Batch Gradient Descent):**\n", "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\bigtriangledown_{\\theta^{(t)}}\\frac{1}{M}\\sum_{i=1}^M\\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n", "Here, $ (x^{(i)},y^{(i)}) $ represents one training sample. $ i $ represents the index of a single randomly selected sample. This method calculates the average gradient using all training samples to update the parameters.\n", "\n", "- **Stochastic Gradient Descent (SGD):**\n", "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\nabla_{\\theta^{(t)}} \\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n", "This method uses the gradient calculated using one randomly selected training sample to update the parameters.\n", "\n", "- **Mini-batch Stochastic Gradient Descent:**\n", "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\alpha^{(t)} \\nabla_{\\theta^{(t)}} \\frac{1}{B}\\sum_{i=1}^{B} \\ell (f(x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n", "Here, $ B $ represents the batch size. This method uses the average gradient calculated using a mini-batch of training samples to update the parameters.\n", "\n", "\n", "\n", "#### **Summary**\n", "\n", "In SGD, we update our parameters using the gradient of the objective function with respect to the parameters, based on a single training example at each iteration, unlike gradient descent that uses the whole training dataset to compute the gradient. This leads to faster iterations and is especially useful when dealing with large datasets.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "tZji5KjPzSRH" }, "source": [ "### **1.2 Implementation**\n", "\n", "**Goal:** Implement a Manual SGD optimizer class to optimize the parameters of a given model.\n", "\n", "To do so, we will define a `ManualSGD` class, which contains the parameters to be optimized, the method to perform a step of optimization, and the method to zero out the gradients.\n", "\n", "In this class:\n", "- The constructor `__init__(self, params, lr)` initializes the optimizer with `params` (the parameters to be optimized) and `lr` (the learning rate).\n", "- The `step` method updates the parameters using their gradient and the learning rate.\n", "- The `zero_grad` method zeros out the gradients of the parameters to prepare for the next optimization step.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "ewQUQMHaD5hc" }, "outputs": [], "source": [ "class ManualSGD:\n", " def __init__(self, params, lr):\n", " self.params = list(params)\n", " self.lr = lr\n", "\n", " def step(self):\n", " for param in self.params:\n", " if param.grad is not None:\n", " param.data -= self.lr * param.grad\n", "\n", " def zero_grad(self):\n", " for param in self.params:\n", " if param.grad is not None:\n", " param.grad.zero_()" ] }, { "cell_type": "markdown", "metadata": { "id": "6hL7iN_oCj7e" }, "source": [ "## **2. Adaptive Moment Estimation(Adam)**\n", "### **2.1 Introduction**\n", "For a successful deep-learning project, the optimization algorithm plays a crucial role. SGD has played a major role in many successful deep-learning projects and research experiments.\\\n", "But SGD has its own limitations as well. The requirement of excessive tuning of the hyperparameters is one of them. Recently, the Adam optimization algorithm has gained a lot of popularity.\n", "#### **Mathematical Formulation**\n", "Adam is an adaptive learning rate method, which means, it computes individual learning rates for different parameters. Its name is derived from adaptive moment estimation, and the reason it’s called Adam is that it uses estimations of the first and second moments of the gradient to adapt the learning rate $(\\alpha)$ for each weight of the neural network. Moment variables are initialized as $s^{(0)}=0$ and $r^{(0)}=0$. The formula for calculating gradient estimation is as follows:\n", "$$\\hat{g}= \\bigtriangledown_{\\theta^{(t)}}\\ell f((x^{(i)};\\theta^{(t)}),y^{(i)}). $$\n", "Here, $ (x^{(i)},y^{(i)}) $ represents the number of training samples. $ i $ represents the index of a single randomly selected sample.\\\n", "That gradient of the cost function of the neural network can be considered as a random variable, since it is usually evaluated on some small random batch of data. To estimate the moments, Adam utilizes exponentially moving averages, computed on the gradient evaluated on a current mini-batch:\n", "$$ s^{(t+1)}=\\rho_{1}s^{(t)}+(1-\\rho_{1})\\hat{g} $$\\\n", "$$ r^{(t+1)}=\\rho_{2}r^{(t)}+(1-\\rho_{2})\\hat{g}\\odot\\hat{g}, $$\n", "where $s$ and $r$ are moving averages, $\\rho_1,\\rho_2$ are newly introduced hyper-parameters of the algorithm. They have really good default values of $0.9$ and $0.999$ respectively.\\\n", "Now we need to correct the estimator so that the expected value is the one we want. This step is usually referred to as bias correction. The final formulas for our estimator will be as follows:\n", "$$ \\hat s=\\frac{s^{(t+1)}}{1-\\rho^{t+1}_{1}} $$\\\n", "$$ \\hat r=\\frac{r^{(t+1)}}{1-\\rho^{t+1}_{2}}. $$\n", "The only thing left to do is to use those moving averages to scale the learning rate individually for each parameter. To update the weight, we do the following:\n", "$$ \\theta^{(t+1)}=\\theta^{(t)}-\\alpha\\frac{\\hat s}{\\sqrt{\\hat r}+\\delta }, $$\n", "where $ \\theta^{(t)} $ is the current weight, $\\theta^{(t+1)}$ represents the updated weight at the next iteration, $\\alpha$ is the global learning rate. Typically we pick $\\delta =10^{-8}$ for a good trade-off between numerical stability and fidelity." ] }, { "cell_type": "markdown", "metadata": { "id": "CtL_RULIFiuq" }, "source": [ "\n", "### **2.2 Implementation**\n", "\n", "**Goal:** Implement a Manual Adam optimizer class to optimize the parameters of a given model.\\\n", "To do so, we will define a `ManualAdam` class, which leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.\n", "\n", "In this class:\n", "- The constructor `__init__(self, params, lr,beta1, beta2, epsilon)` initializes the optimizer with `params` (the parameters to be optimized), `lr` (the learning rate), `beta1` and `beta2` (the attenuation factors used to control the first and second moment estimates, which are set by default to $0.9$ and $0.999$, respectively) and `epsilon` (a small constant used to prevent division by zero errors, with the default value of $10^{-8}$).\n", "- The `step` method updates the parameters using their learning rate.\n", "- The `zero_grad` method is used to zero the gradients of the model parameters so that the gradients are recalculated in each iteration." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "VM-pwtKMdgSI" }, "outputs": [], "source": [ "class ManualAdam:\n", " def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):\n", " self.params = list(params)\n", " self.lr = lr\n", " self.beta1 = beta1\n", " self.beta2 = beta2\n", " self.epsilon = epsilon\n", " self.m = [torch.zeros_like(p) for p in self.params]\n", " self.v = [torch.zeros_like(p) for p in self.params]\n", " self.t = 0\n", "\n", " def step(self):\n", " self.t += 1\n", " for i, param in enumerate(self.params):\n", " if param.grad is not None:\n", " self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * param.grad\n", " self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * param.grad ** 2\n", "\n", " m_hat = self.m[i] / (1 - self.beta1 ** self.t)\n", " v_hat = self.v[i] / (1 - self.beta2 ** self.t)\n", "\n", " param.data -= self.lr * m_hat / (torch.sqrt(v_hat) + self.epsilon)\n", "\n", " def zero_grad(self):\n", " for param in self.params:\n", " if param.grad is not None:\n", " param.grad.zero_()" ] }, { "cell_type": "markdown", "metadata": { "id": "at55FNccEbYc" }, "source": [ "## **3. Comparisons with PyTorch SGD and Adam Optimizer**\n", "### **3.1 Model Architecture and Dataset**\n", "#### **Model Architecture**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "SxK5L7cZEbFK" }, "outputs": [], "source": [ "class CNN(nn.Module):\n", " def __init__(self):\n", " super(CNN, self).__init__()\n", " self.conv1 = nn.Conv2d(3, 6, 5)\n", " self.pool = nn.MaxPool2d(2, 2)\n", " self.conv2 = nn.Conv2d(6, 16, 5)\n", " self.fc1 = nn.Linear(16 * 5 * 5, 120)\n", " self.fc2 = nn.Linear(120, 84)\n", " self.fc3 = nn.Linear(84, 10)\n", "\n", " def forward(self, x):\n", " x = self.pool(F.relu(self.conv1(x)))\n", " x = self.pool(F.relu(self.conv2(x)))\n", " x = x.view(-1, 16 * 5 * 5)\n", " x = F.relu(self.fc1(x))\n", " x = F.relu(self.fc2(x))\n", " x = self.fc3(x)\n", " return x" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "-BcZZIz9EhsS" }, "outputs": [], "source": [ "def evaluate_model(model, testloader):\n", " correct = 0\n", " total = 0\n", " model.eval()\n", " with torch.no_grad():\n", " for data in testloader:\n", " images, labels = data\n", " outputs = model(images)\n", " _, predicted = torch.max(outputs.data, 1)\n", " total += labels.size(0)\n", " correct += (predicted == labels).sum().item()\n", " return 100 * correct / total" ] }, { "cell_type": "markdown", "metadata": { "id": "wf9cFgqGEkkp" }, "source": [ "#### **Dataset**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OdfO-2vaEkrP", "outputId": "af3bc5dd-0dff-4a79-dba4-7d17048836d9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data\\cifar-10-python.tar.gz\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "48a11838430846e79e2d005dcf0f96bf", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/170498071 [00:00