{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be implementing a **Linear Layer** as they are a fundamental operation in DL. The objective of a linear layer is to map a fixed number of inputs to a desired output (whether it be a regression or classification task)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Forward Pass\n", "\n", "A neural network architecture consists of 2 main layers: first layer (**input**) and last layer (**output**).\n", "\n", "**Node** or **neuron** is the simplest unit of the neural network. Each neuron held a numerical value that will be passed (forward direction in this case) to the next neuron by a mapping. For the sake of simplicity, we will only discuss the **linear neural network** and linear mapping in this lesson.\n", "\n", "Let's consider a simple connection between 2 layers, each has 1 neuron,\n", "\n", "\n", " \n", "We can map the input neuron $x$ to the output neuron $y$ by a linear equation,\n", "\n", "$$y = wx + \\beta$$\n", "\n", "where $w$ is called the **weight** and $\\beta$ is called the **bias term**.\n", "\n", "If we have $n$ input neurons ($n>1$) then the output neuron is the linear combination,\n", "\n", "\n", "\n", "\n", "$$\n", "\\hat{y}=\\beta + x_1w_1+x_2w_2+ \\cdots +x_{n}w_n\n", "$$\n", "\n", "where $w_i$'s are weights corresponding to each map (or arrow). \n", "\n", "Similarly, if there are $m$ output neurons then the ouput is the system of multi-linear equations,\n", "\n", "\n", "\n", "\n", "$$\\hat{y_1}=\\beta_1 + x_1 w_{1,1}+x_2 w_{1,2}+ \\cdots +x_nw_{1,n}$$\n", "\n", "$$\\hat{y_2}=\\beta_2 + x_2 w_{2,1}+x_2 w_{2,2}+ \\cdots +x_nw_{2,n}$$\n", "$$\\vdots$$\n", "$$\\hat{y_m}=\\beta_m + x_n w_{m,1}+x_2 w_{m,2}+ \\cdots +x_nw_{m,n}$$\n", "\n", "Compactedly, it can be written in matrix form\n", "$$\n", "\\hat{Y} \n", "= \\left(\\begin{array}{c} \\hat{y}_{0} \\\\ \\hat{y}_{1} \\\\ \\vdots \\\\ \\hat{y}_{m} \\end{array}\\right)\n", "= \\left(\\begin{array}{ccccc} \n", "\\beta_1 & w_{1,1} & w_{1,2} & \\cdots & w_{1,n} \\\\\n", "\\beta_2 & w_{2,1} & w_{2,2} & \\cdots & w_{2,n} \\\\\n", "\\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n", "\\beta_m & w_{m,1} & w_{m,2} & \\cdots & w_{m,n}\n", "\\end{array}\\right)\n", "\\cdot \\left(\\begin{array}{c} x_{1} \\\\ x_{1} \\\\ \\vdots \\\\ x_m \\end{array}\\right)\n", "= W \\cdot X\n", "$$\n", "\n", "This logic can be extented further as we increase more layers.\n", "\n", "\n", "\n", "The second layer (and beyond) is called the **hidden layer**. The number of hidden layer is usually decided by the complexity of the problem.\n", "\n", "**Fact:**\n", "\n", "* If the weight $w_i\\neq 0$ for all $i$, then we have a **fully connected neural network.**\n", "\n", "* The number of of neuron for each layers can be different. Moreover, they tend to decrease sequentially. Ex: \n", " $$500 \\text{ neurons} \\rightarrow 100 \\text{ neurons} \\rightarrow 20 \\text{ neurons}$$\n", "\n", "* Most of the practical neural networks are non-linear. This result is achieved by applying a non-linear function on top of the linear combination. This is called the **activation function**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Backward Pass\n", "\n", "Now that we know how to implement the forward pass, we must next solve how it is that we are going to backpropagate our linear operation. \n", "\n", "Keep in mind that backpropagation is simply the gradient of our latest forward operation (call it $o$) w.r.t. our weight parameters $w$, which, if many intermediate operations have been performed, we attain by the chain-rule\n", "\n", "$$\n", "\\hat{y} = 1w_0+x_1w_1+x_2w_2+x_3w_3\\\\z = \\sigma(\\hat{y}) \\\\\n", "o = L(z,y) \n", "$$\n", "\n", "$$\n", "\\frac{\\partial o}{\\partial w} = \\frac{\\partial o}{\\partial z}*\\frac{\\partial z}{\\partial \\hat{y}}*\\frac{\\partial \\hat{y}}{\\partial w}\n", "$$\n", "\n", "Now, notice that during the backward pass, partial gradients can be classified in two ways:\n", "\n", "1. An **Intermediate operation** ($\\frac{\\partial o}{\\partial z},\\frac{\\partial z}{\\partial \\hat{y}}$) or \n", "2. A **\"Receiver\" operation** ($\\frac{\\partial \\hat{y}}{\\partial w}$)\n", "\n", "Notice that the intermediates have to be calculated to get to our \"Receiver\" operation, which receives a \"step\" operation once its gradient has been calculated.\n", "\n", "In the above example, none of our intermediate operations introduced any new parameters to our model. However, what if they did? Look below\n", "\n", "$$\n", "\\hat{y_1} = 1w_0+x_1w_1+x_2w_2+x_3w_3\\\\z = \\sigma(\\hat{y})\\\\l = z*w_4 \\\\o = L(l,y) \n", "$$\n", "\n", "$$\n", "\\frac{\\partial o}{\\partial w_{0:3}} = \\frac{\\partial o}{\\partial l}*\\frac{\\partial l}{\\partial z}*\\frac{\\partial z}{\\partial \\hat{y}}*\\frac{\\partial \\hat{y}}{\\partial w_{0:3}} \\\\\\frac{\\partial o}{\\partial w_{4}} = \\frac{\\partial o}{\\partial l} * \\frac{\\partial l}{\\partial w_4}\n", "$$\n", "\n", "Given that we now have two operations that introduce parameters to our model, we need to make two backward calculations. More importantly, however, notice that their \"paths\" differ in the way that they take the gradient of $l$ w.r.t. either its parameter $w_4$ or its input $z$\n", "\n", "Clearly, these operations are not equivalent\n", "\n", "$$\n", "\\frac{\\partial l}{\\partial z} \\not= \\frac{\\partial l}{\\partial w_4}\n", "$$\n", "\n", "Despite them originating from the same forward linear operation. \n", "\n", "Hence, this demonstrates that for any forward operation with weights, such as our Linear Layer, we need to implement two different backward operations: the intermediate pass (which takes gradient w.r.t. the input) and the \"Receiver\" pass (which takes gradient w.r.t. operation parameter). For either of these operations, we must integrate the incoming gradient ($\\frac{\\partial z}{\\partial \\hat{y}},\\frac{\\partial o}{\\partial l}$) with our Linear Layer gradient ($\\frac{\\partial \\hat{y}}{\\partial w_{0:3}},\\frac{\\partial l}{\\partial w_4}$)\n", "\n", "Having defined the two types of backward operations, we will now define the general method to compute both calculations on our Linear Layer.\n", "\n", "Assume we have below forward operation\n", "\n", "$$\n", "y=1w_0+2w_1+3w_2+4w_3\n", "$$\n", "\n", "Then, for the backward phase, we need to take the partial derivative w.r.t. to each weight coefficient\n", "\n", "$$\n", "\\frac{\\partial y}{\\partial w} = 1\\frac{\\partial y}{\\partial w_0} + 2\\frac{\\partial y}{\\partial w_1} + 3\\frac{\\partial y}{\\partial w_2} + 4\\frac{\\partial y}{\\partial w_3}=1+2+3+4\n", "$$\n", "\n", "\n", "What about the partial w.r.t. its input?\n", "\n", "$$\n", "\\frac{\\partial y}{\\partial x} = w_0\\frac{\\partial y}{\\partial x_0} + w_1\\frac{\\partial y}{\\partial x_1} + w_2\\frac{\\partial y}{\\partial x_2} + w_3\\frac{\\partial y}{\\partial x_3}=w_0+w_1+w_2+w_3\n", "$$\n", "\n", "\n", "Easy, right? We find that the \"Receiver\" version of our backward pass is equivalent to the input while its intermediate derivative is equal to its weight parameters. \n", "\n", "As a last step, to really be able to generalize these operations to any kind of differentiable architecture, we will show the general procedure to integrate the incoming gradient with our Linear gradient\n", "\n", "**Gradient Generalization w.r.t weights and input**\n", "\n", "\n", "$$\n", "input: \\text{n x f}\n", "$$\n", "\n", "$$\n", "weights: \\text{f x h}\n", "$$\n", "\n", "$$\n", "y: \\text{n x h}\n", "$$\n", "\n", "$$\n", "incoming\\_grad: \\text{n x h}\n", "$$\n", "\n", "$$\n", "grad\\_y\\_wrt\\_weights: \\text{(incoming_grad'*input)' = (h x n * n x f)' = f x h}\n", "$$\n", "\n", "$$\n", "grad\\_y\\_wrt\\_input: \\text{(incoming_grad*weights') = (n x h * h x f) = n x f}\n", "$$\n", "\n", "\n", "Now that we know how to generalize a linear layer, let's implement the above concepts in PyTorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Linear Layer with PyTorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will implement our own Linear Layer in PyTorch using the concepts we defined above. \n", "\n", "**However**, before we begin, we will take a different approach in how we will define our bias\n", "\n", "Initially, we defined a bias column as below:\n", "\n", "$$\n", "\\begin{pmatrix}1 & x_{11} & x_{12} & x_{13} \\\\1 & x_{21} & x_{22} & x_{21} \\\\1 & x_{31} & x_{32} & x_{33} \\\\\\end{pmatrix}\n", "$$\n", "\n", "However, this formulation has some practical problems. For every forward input that we receive, we will have to ***manually add a column bias***. This column addition is a non-differentiable operation and hence, it messes with the entire DL methodology of only operating with differentiable functions. \n", "\n", "Therefore, we will re-formulate the bias as an addition operation of our linear output\n", "\n", "$$\n", "\$$\\begin{pmatrix}1 & x_{11} & x_{12} & x_{13} \\\\1 & x_{21} & x_{22} & x_{21} \\\\1 & x_{31} & x_{32} & x_{33} \\\\\\end{pmatrix}\\begin{pmatrix}w_0 \\\\w_1 \\\\w_2 \\\\w_3\\end{pmatrix}\$$ = \n", "\\begin{pmatrix}y_0 \\\\y_1 \\\\y_2 \\end{pmatrix} = \n", "\\begin{pmatrix} x_{11} & x_{12} & x_{13} \\\\ x_{21} & x_{22} & x_{21} \\\\ x_{31} & x_{32} & x_{33} \\\\\\end{pmatrix}\n", "\\begin{pmatrix}w_1 \\\\w_2 \\\\w_3\\end{pmatrix} + \n", "\\begin{pmatrix}w_0 \\\\w_0 \\\\w_0\\end{pmatrix}\n", "$$\n", "\n", "In this sense, our Linear Layer will now be a two-step operation if the bias is included. \n", "\n", "As for the backward pass, the differential of a simple addition will always be 1s. Hence, our forward and backward pass for the bias becomes two simple operations. \n", "\n", "Now, to reduce boilerplate code, we will subclass our Linear operation under PyTorch's torch.autograd.Function. This enables us to do three things:\n", "\n", "i) define and generalize the forward and backward pass \n", "\n", "ii) use PyTorch's \"context manager\" that allows us to save objects from the forward and backward pass and lets us know which forward inputs need gradients (which let us know if we need to apply an Intermediate or \"Receiver\" operation during backward phase)\n", "\n", "iii) Store backward's gradient output to our defined weight parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Uncomment this line to install torch library\n", "#!pip install torch" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[ 0.6623, 0.8345],\n", " [-0.1770, 0.7527]], device='cuda:0')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "import torch.nn as nn\n", "\n", "#No Nvidia graphic card\n", "torch.rand((2,2))\n", "\n", "# Nvidia graphic card\n", "torch.randn((2,2)).cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What do the codes above do?\n", "The import command will load the torch library into your notebook. \n", "torch.rand((m,n)) will create a matrix size m x n filled with random values in range [0,1)\n", "\n", "> Note: You will see the output has a type called Tensor which is a matrix used for storing arbitrary numbers.\n", "\n", "If your computer/laptop does not have Nvidia graphic card, the torch.rand((m,n)).cuda() will yield an error. \n", "\n", "> Note: Having a graphic card with CUDA interface will enable parallel computing capability when building deep learning model which can drastically decrease training time. However, our model can still be trained without it.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# keep in mind that @staticmethod simply let's us initiate a class without instantiating it\n", "# Remember that our gradient will be of equal dimensions as our weight parameters\n", "\n", "\n", "class Linear_Layer(torch.autograd.Function):\n", " \"\"\"\n", " Define a Linear Layer operation\n", " \"\"\"\n", " @staticmethod\n", " def forward(ctx, input,weights, bias = None):\n", " \"\"\"\n", " In the forward pass, we feed this class all necessary objects to \n", " compute a linear layer (input, weights, and bias)\n", " \"\"\"\n", " # input.dim = (B, in_dim)\n", " # weights.dim = (in_dim, out_dim)\n", " \n", " # given that the grad(output) wrt weight parameters equals the input,\n", " # we will save it to use for backpropagation\n", " ctx.save_for_backward(input, weights, bias)\n", " \n", " \n", " # linear transformation\n", " # (B, out_dim) = (B, in_dim) * (in_dim, out_dim)\n", " output = torch.mm(input, weights)\n", " \n", " if bias is not None:\n", " # bias.shape = (out_dim)\n", " \n", " # expanded_bias.shape = (B, out_dim), repeats bias B times\n", " expanded_bias = bias.unsqueeze(0).expand_as(output)\n", " \n", " # element-wise addition\n", " output += expanded_bias\n", " \n", " return output\n", "\n", " \n", " # incoming_grad represents the incoming gradient that we defined on the \"Backward Pass\" section\n", " # incoming_grad.shape == output.shape == (B, out_dim)\n", " \n", " @staticmethod\n", " def backward(ctx, incoming_grad):\n", " \"\"\"\n", " In the backward pass we receive a Tensor (output_grad) containing the \n", " gradient of the loss with respect to our f(x) output, \n", " and we now need to compute the gradient of the loss\n", " with respect to our defined function.\n", " \"\"\"\n", " # incoming_grad.shape = (B, out_dim)\n", " \n", " # extract inputs from forward pass\n", " input, weights, bias = ctx.saved_tensors \n", " \n", " # assume none of the inputs need gradients\n", " grad_input = grad_weight = grad_bias = None\n", " \n", " \n", " # we will figure out which forward inputs need grads\n", " # with ctx.needs_input_grad, which stores True/False\n", " # values in the order that the forward inputs came \n", " \n", " # in each of the below gradients, \n", " # we need to return as many parameters as we used during forward pass\n", "\n", " \n", " # if input requires grad\n", " if ctx.needs_input_grad[0]:\n", " # (B, in_dim) = (B, out_dim) * (out_dim, in_dim)\n", " grad_input = incoming_grad.mm(weights.t())\n", " \n", " # if weights require grad\n", " if ctx.needs_input_grad[1]:\n", " # (out_dim, in_dim) = (out_dim, B) * (B, in_dim) \n", " grad_weight = incoming_grad.t().mm(input)\n", " \n", " # if bias requires grad\n", " if bias is not None and ctx.needs_input_grad[2]:\n", " # below operation is equivalent of doing it the \"long\" way\n", " # given that bias grads = 1,\n", " # torch.ones((1,B)).mm(incoming_grad) \n", " # (out) = (1,B)*(B,out_dim)\n", " grad_bias = incoming_grad.sum(0)\n", " \n", " \n", " \n", " \n", " # below, if any of the grads = None, they will simply be ignored\n", " \n", " # add grad_output.t() to match original layout of weight parameter\n", " return grad_input, grad_weight.t(), grad_bias\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "forward output: \n", "tensor([[0.7532, 0.5865, 0.9564]], grad_fn=)\n", "----------------------------------------------------------------------\n", "forward output shape: \n", "torch.Size([1, 3])\n" ] } ], "source": [ "# test forward method\n", "\n", "# input_dim & output_dim can be any dimensions (you choose)\n", "input_dim = 1\n", "output_dim = 2\n", "dummy_input= torch.ones((input_dim, output_dim)) # input that will be fed to model\n", "\n", "# create a random set of weights that matches the dimensions of input to perform matrix multiplication\n", "final_output_dim = 3 # can be set to any integer > 0\n", "dummy_weight = nn.Parameter(torch.randn((output_dim, final_output_dim))) # nn.Parameter registers weights as parameters of the model\n", "\n", "# feed input and weight tensors to our Linear Layer operation\n", "output = Linear_Layer.apply(dummy_input, dummy_weight)\n", "print(f\"forward output: \\n{output}\")\n", "print('-'*70)\n", "print(f\"forward output shape: \\n{output.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Code explanation\n", "\n", "We first create a 1D Tensor of size two and initialize it with value 1 dummy_input = tensor(([1.,1.])).\n", "We then a wrap a tensor filled with random values under nn.Parameter with dimensions (2,3) that represents the weights of our Linear Layer operation. \n", "\n", "> NOTE: We wrap our weights under nn.Parameter because when we implement our Linear Layer to any Deep Learning architecture, the wrapper will automagically register our weight tensor as a model parameter to make for easy extraction by just calling model.parameters(). Without it, the model will not be able to differentiate parameter from inputs.\n", "\n", "After that, we obtain the output for forward propagration using the apply method providing the input and the weight. The apply function will call the forward function defined in the class Linear_Layer and return the result for forward propagration.\n", "\n", "We then check the result and the shape of our output to make sure the calculation is done correctly.\n", "At this point, if we check the gradient of dummy_weight, we will see nothing since we need to propagate backward to obtain the gradient of the weight. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Weight's gradient {dummy_weight.grad}\")" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# test backward pass\n", "\n", "## calculate gradient of subsequent operation w.r.t. defined weight parameters\n", "incoming_grad = torch.ones((1,3)) # shape equals output dims\n", "output.backward(incoming_grad) # calculate parameter gradients" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[1., 1., 1.],\n", " [1., 1., 1.]])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# extract calculated gradient \n", "dummy_weight.grad " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our forward and backward method defined, let us define some important concepts. \n", "\n", "By nature, Tensors that require gradients (such as parameters) automatically \"record\" a history of all the operations that have been applied to them. \n", "\n", "For example, our above forward output contains the method grad_fn=, which tells us that our output is the result of our defined Linear Layer operation, which its history began with dummy_weight.\n", "\n", "As such, once we call output.backward(incoming_grad), PyTorch automatically, from the last operation to the first, calls the backward method in order to compute the chain-gradient that corresponds to our parameters.\n", "\n", "To truly understand what is going on and how PyTorch simplifies the backward phase, we will show a more extensive example where we manually compute the gradient of our paramters with our own defined backward() methods" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "class Linear_Layer_():\n", " def __init__(self):\n", " ''\n", "\n", " def forward(self, input,weights, bias = None):\n", " self.input = input\n", " self.weights = weights\n", " self.bias = bias\n", " \n", " output = torch.mm(input, weights)\n", " \n", " if bias is not None:\n", " # bias.shape = (out_dim)\n", " \n", " # expanded_bias.shape = (B, out_dim), repeats bias B times\n", " expanded_bias = bias.unsqueeze(0).expand_as(output)\n", " \n", " # element-wise addition\n", " output += expanded_bias\n", " \n", " return output\n", "\n", " def backward(self, incoming_grad):\n", "\n", " # extract inputs from forward pass\n", " input = self.input\n", " weights = self.weights\n", " bias = self.bias\n", " \n", " grad_input = grad_weight = grad_bias = None\n", " \n", " # if input requires grad\n", " if input.requires_grad:\n", " grad_input = incoming_grad.mm(weights.t())\n", " \n", " # if weights require grad\n", " if weights.requires_grad:\n", " grad_weight = incoming_grad.t().mm(input)\n", " \n", " # if bias requires grad\n", " if bias.requires_grad:\n", " grad_bias = incoming_grad.sum(0)\n", " \n", " return grad_input, grad_weight.t(), grad_bias" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "outpu1.shape: torch.Size([1, 3])\n", "--------------------------------------------------\n", "outpu2.shape: torch.Size([1, 5])\n", "--------------------------------------------------\n", "outpu3.shape: torch.Size([1, 1])\n" ] } ], "source": [ "# manual forward pass\n", "\n", "input= torch.ones((1,2)) # input \n", "\n", "# define weights for linear layers\n", "weight1 = nn.Parameter(torch.randn((2,3))) \n", "weight2 = nn.Parameter(torch.randn((3,5))) \n", "weight3 = nn.Parameter(torch.randn((5,1))) \n", "\n", "# define bias for Linear layers\n", "bias1 = nn.Parameter(torch.randn((3))) \n", "bias2 = nn.Parameter(torch.randn((5))) \n", "bias3 = nn.Parameter(torch.randn((1))) \n", "\n", "# define Linear Layers\n", "linear1 = Linear_Layer_()\n", "linear2 = Linear_Layer_()\n", "linear3 = Linear_Layer_()\n", "\n", "\n", "# define forward pass\n", "output1 = linear1.forward(input, weight1,bias1)\n", "output2 = linear2.forward(output1, weight2,bias2)\n", "output3 = linear3.forward(output2, weight3,bias3)\n", "\n", "print(f\"outpu1.shape: {output1.shape}\")\n", "print('-'*50)\n", "print(f\"outpu2.shape: {output2.shape}\")\n", "print('-'*50)\n", "print(f\"outpu3.shape: {output3.shape}\")" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input_grad1.shape: torch.Size([1, 5])\n", "--------------------------------------------------\n", "weight_grad1.shape: torch.Size([5, 1])\n", "--------------------------------------------------\n", "bias_grad1.shape: torch.Size([1])\n" ] } ], "source": [ "# manual backward pass\n", "\n", "# compute intermediate and receiver backward pass\n", "input_grad1, weight_grad1, bias_grad1 = linear3.backward(torch.tensor([[1.]]))\n", "\n", "print(f\"input_grad1.shape: {input_grad1.shape}\")\n", "print('-'*50)\n", "print(f\"weight_grad1.shape: {weight_grad1.shape}\")\n", "print('-'*50)\n", "print(f\"bias_grad1.shape: {bias_grad1.shape}\")" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input_grad2.shape: torch.Size([1, 3])\n", "--------------------------------------------------\n", "weight_grad2.shape: torch.Size([3, 5])\n", "--------------------------------------------------\n", "bias_grad2.shape: torch.Size([5])\n" ] } ], "source": [ "# compute intermediate and receiver backward pass\n", "input_grad2, weight_grad2, bias_grad2 = linear2.backward(input_grad1)\n", "\n", "print(f\"input_grad2.shape: {input_grad2.shape}\")\n", "print('-'*50)\n", "print(f\"weight_grad2.shape: {weight_grad2.shape}\")\n", "print('-'*50)\n", "print(f\"bias_grad2.shape: {bias_grad2.shape}\")" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input_grad3: None\n", "--------------------------------------------------\n", "weight_grad3.shape: torch.Size([2, 3])\n", "--------------------------------------------------\n", "bias_grad3.shape: torch.Size([3])\n" ] } ], "source": [ "# compute receiver backward pass\n", "input_grad3, weight_grad3, bias_grad3 = linear1.backward(input_grad2)\n", "\n", "print(f\"input_grad3: {input_grad3}\")\n", "print('-'*50)\n", "print(f\"weight_grad3.shape: {weight_grad3.shape}\")\n", "print('-'*50)\n", "print(f\"bias_grad3.shape: {bias_grad3.shape}\")" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "# now, add gradients to the corresponding parameters\n", "weight1.grad = weight_grad3\n", "weight2.grad = weight_grad2\n", "weight3.grad = weight_grad1\n", "\n", "bias1.grad = bias_grad3\n", "bias2.grad = bias_grad2\n", "bias3.grad = bias_grad1" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "weight1.grad = \n", "tensor([[-0.9869, 0.0548, 0.3107],\n", " [-0.9869, 0.0548, 0.3107]], grad_fn=)\n", "----------------------------------------------------------------------\n", "weight2.grad = \n", "tensor([[ 2.3822, 0.9312, 2.2510, -1.0365, 3.1596],\n", " [ 1.3770, 0.5383, 1.3011, -0.5992, 1.8263],\n", " [-1.3396, -0.5237, -1.2658, 0.5829, -1.7767]], grad_fn=)\n", "----------------------------------------------------------------------\n", "weight3.grad = \n", "tensor([[-6.3651],\n", " [-3.5532],\n", " [-5.9865],\n", " [ 0.7347],\n", " [ 5.3876]], grad_fn=)\n", "----------------------------------------------------------------------\n", "bias1.grad = \n", "tensor([-0.9869, 0.0548, 0.3107], grad_fn=)\n", "----------------------------------------------------------------------\n", "bias2.grad = \n", "tensor([ 0.6981, 0.2729, 0.6597, -0.3038, 0.9260], grad_fn=)\n", "----------------------------------------------------------------------\n", "bias3.grad = \n", "tensor([1.])\n" ] } ], "source": [ "# inspect manual calculated gradients\n", "\n", "print(f\"weight1.grad = \\n{weight1.grad}\")\n", "print('-'*70)\n", "print(f\"weight2.grad = \\n{weight2.grad}\")\n", "print('-'*70)\n", "print(f\"weight3.grad = \\n{weight3.grad}\")\n", "print('-'*70)\n", "\n", "print(f\"bias1.grad = \\n{bias1.grad}\") \n", "print('-'*70)\n", "print(f\"bias2.grad = \\n{bias2.grad}\")\n", "print('-'*70)\n", "print(f\"bias3.grad = \\n{bias3.grad}\")" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([0.])" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now, we take our \"step\"\n", "lr = .01\n", "\n", "# perform \"step\" on weight parameters\n", "weight1.data.add_(weight1.grad, alpha = -lr) # ==weight1.data+weight1.grad*-lr\n", "weight2.data.add_(weight2.grad, alpha = -lr)\n", "weight2.data.add_(weight2.grad, alpha = -lr)\n", "\n", "# perform \"step\" on bias parameters\n", "bias1.data.add_(bias1.grad, alpha = -lr)\n", "bias2.data.add_(bias2.grad, alpha = -lr)\n", "bias2.data.add_(bias2.grad, alpha = -lr)\n", "\n", "# now that the step has been performed, zero out gradient values\n", "weight1.grad.zero_()\n", "weight2.grad.zero_()\n", "weight3.grad.zero_()\n", "\n", "bias1.grad.zero_()\n", "bias2.grad.zero_()\n", "bias3.grad.zero_()\n", "\n", "# get ready for the next forward pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Phew! We have now officially performed a \"step\" update! Let's review what we did:\n", "\n", "**1. Defined all needed forward and backward operations**\n", "\n", "**2. Created a 3-layer model**\n", "\n", "**3. Calculated forward pass**\n", "\n", "**4. Calculated backward pass for all parameters**\n", "\n", "**5. Performed step**\n", "\n", "**6. zero-out gradients**\n", "\n", "Of coarse, we could have simplified the code by creating a list like structure and loop all needed operations. \n", "\n", "However, for sake of clarity and understanding, we layed out all the steps in a logical manner. \n", "\n", "Now, how can the **equivalent of the forward and backward operations be performed in PyTorch?**" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "outpu1.shape: torch.Size([1, 3])\n", "--------------------------------------------------\n", "outpu2.shape: torch.Size([1, 5])\n", "--------------------------------------------------\n", "outpu3.shape: torch.Size([1, 1])\n" ] } ], "source": [ "# PyTorch forward pass\n", "\n", "input= torch.ones((1,2)) # input \n", "\n", "# define weights for linear layers\n", "weight1 = nn.Parameter(torch.randn((2,3))) \n", "weight2 = nn.Parameter(torch.randn((3,5))) \n", "weight3 = nn.Parameter(torch.randn((5,1))) \n", "\n", "# define bias for Linear layers\n", "bias1 = nn.Parameter(torch.randn((3))) \n", "bias2 = nn.Parameter(torch.randn((5))) \n", "bias3 = nn.Parameter(torch.randn((1))) \n", "\n", "# define Linear Layers\n", "output1 = Linear_Layer.apply(input,weight1,bias1)\n", "output2 = Linear_Layer.apply(output1, weight2, bias2)\n", "output3 = Linear_Layer.apply(output2, weight3, bias3)\n", "\n", "\n", "\n", "print(f\"outpu1.shape: {output1.shape}\")\n", "print('-'*50)\n", "print(f\"outpu2.shape: {output2.shape}\")\n", "print('-'*50)\n", "print(f\"outpu3.shape: {output3.shape}\")" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "# calculate all gradients with PyTorch's \"operation history\"\n", "# it essentially just calls our defined backward methods in \n", "# the order of applied operations (such as we did above)\n", "output3.backward()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "weight1.grad = \n", "tensor([[ 0.2195, -3.4776, 3.3395],\n", " [ 0.2195, -3.4776, 3.3395]])\n", "----------------------------------------------------------------------\n", "weight2.grad = \n", "tensor([[ 2.6869, -0.6504, 1.1048, -1.9001, 3.5497],\n", " [ 1.7754, -0.4298, 0.7300, -1.2555, 2.3455],\n", " [ 1.1182, -0.2707, 0.4598, -0.7908, 1.4773]])\n", "----------------------------------------------------------------------\n", "weight3.grad = \n", "tensor([[ 0.0630],\n", " [ 1.2594],\n", " [-3.3520],\n", " [-1.9508],\n", " [-0.3700]])\n", "----------------------------------------------------------------------\n", "bias1.grad = \n", "tensor([ 0.2195, -3.4776, 3.3395])\n", "----------------------------------------------------------------------\n", "bias2.grad = \n", "tensor([ 1.3815, -0.3344, 0.5681, -0.9770, 1.8251])\n", "----------------------------------------------------------------------\n", "bias3.grad = \n", "tensor([1.])\n" ] } ], "source": [ "# inspect PyTorch calculated gradients\n", "\n", "print(f\"weight1.grad = \\n{weight1.grad}\")\n", "print('-'*70)\n", "print(f\"weight2.grad = \\n{weight2.grad}\")\n", "print('-'*70)\n", "print(f\"weight3.grad = \\n{weight3.grad}\")\n", "print('-'*70)\n", "\n", "print(f\"bias1.grad = \\n{bias1.grad}\") \n", "print('-'*70)\n", "print(f\"bias2.grad = \\n{bias2.grad}\")\n", "print('-'*70)\n", "print(f\"bias3.grad = \\n{bias3.grad}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, instead of having to define a weight and parameter bias each time we need a Linear_Layer, we will wrap our operation on PyTorch's nn.Module, which allows us to:\n", "\n", "i) define all parameters (weight and bias) in a single object and \n", "\n", "ii) create an easy-to-use interface to create any Linear transformation of any shape (as long as it is feasible to your memory)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "class Linear(nn.Module):\n", " def __init__(self, in_dim, out_dim, bias = True):\n", " super().__init__()\n", " self.in_dim = in_dim\n", " self.out_dim = out_dim\n", " \n", " # define parameters\n", " \n", " # weight parameter\n", " self.weight = nn.Parameter(torch.randn((in_dim, out_dim)))\n", " \n", " # bias parameter\n", " if bias:\n", " self.bias = nn.Parameter(torch.randn((out_dim)))\n", " else:\n", " # register parameter as None if not initialized\n", " self.register_parameter('bias',None)\n", " \n", " def forward(self, input):\n", " output = Linear_Layer.apply(input, self.weight, self.bias)\n", " return output" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Parameter containing:\n", " tensor([[-1.7011]], requires_grad=True),\n", " Parameter containing:\n", " tensor([-0.0320], requires_grad=True)]" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initialize model and extract all model parameters\n", "m = Linear(1,1, bias = True)\n", "param = list(m.parameters()) \n", "param" ] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [], "source": [ "# once gradients have been computed and a step has been taken, \n", "# we can zero-out all gradient values in parameters with below\n", "m.zero_grad()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST\n", "\n", "We will implement our Linear Layer operation to classify digits on the MNIST dataset. \n", "\n", "This data is often used as an introduction to DL as it has two desired properties:\n", "\n", "1. 60000 records of observations\n", "\n", "2. Binary input (dramatically reduces complexity)\n", "\n", "Given the volumen of data, it may not be very feasible to load all 60000 images at once and feed it to our model. Hence, we will parse our data into batches of 128 to alleviate I/O.\n", "\n", "We will import this data using torchvision and feed it to our DataLoader that enables us to parse our data into batches" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([60000, 28, 28])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import trainingMNIST dataset\n", "\n", "import torchvision\n", "from torchvision import transforms\n", "import numpy as np\n", "from torchvision.utils import make_grid \n", "import matplotlib.pyplot as plt\n", "from torch.utils.data import DataLoader\n", "\n", "root = r'C:\\Users\\erick\\PycharmProjects\\untitled\\3D_2D_GAN\\MNIST_experimentation'\n", "train_mnist = torchvision.datasets.MNIST(root = root, \n", " train = True, \n", " transform = transforms.ToTensor(),\n", " download = False, \n", " )\n", "\n", "train_mnist.data.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([10000, 28, 28])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import testing MNIST dataset\n", "\n", "eval_mnist = torchvision.datasets.MNIST(root = root, \n", " train = False,\n", " transform = transforms.ToTensor(),\n", " download = False, \n", " )\n", "eval_mnist.data.shape" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "