{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\n", "\\newcommand{\\mat}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\mattr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\matinv}[1]{\\boldsymbol {#1}^{-1}}\n", "\\newcommand{\\vec}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\vectr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\rvar}[1]{\\mathrm {#1}}\n", "\\newcommand{\\rvec}[1]{\\boldsymbol{\\mathrm{#1}}}\n", "\\newcommand{\\diag}{\\mathop{\\mathrm {diag}}}\n", "\\newcommand{\\set}[1]{\\mathbb {#1}}\n", "\\newcommand{\\norm}[1]{\\left\\lVert#1\\right\\rVert}\n", "\\newcommand{\\pderiv}[2]{\\frac{\\partial #1}{\\partial #2}}\n", "\\newcommand{\\bb}[1]{\\boldsymbol{#1}}\n", "$$\n", "# CS236605: Deep Learning\n", "# Tutorial 3: Multilayer Perceptron" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction\n", "\n", "In this tutorial, we will cover:\n", "\n", "* Linear (fully connected) layers\n", "* Activation functions\n", "* 2-Layer MLP implementation from scratch\n", "* Back-propagation\n", "* N-layer MLP with PyTorch's `autograd` and `optim` modules" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Setup\n", "%matplotlib inline\n", "import os\n", "import numpy as np\n", "import sklearn\n", "import torch\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['font.size'] = 20" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Reminder: The perceptron hypothesis class" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The following hypothesis class\n", "$$\n", "\\mathcal{H} =\n", "\\left\\{ h: \\mathcal{X}\\rightarrow\\mathcal{Y}\n", "~\\vert~\n", "h(\\vec{x}) = \\varphi(\\vectr{w}\\vec{x}+b); \\vec{w}\\in\\set{R}^D,~b\\in\\set{R}\\right\\}\n", "$$\n", "where $\\varphi(\\cdot)$ is some nonlinear function, is composed of functions representing the **perceptron** model." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Schematic of a single perceptron, and it's inspiration, a biological neuron.\n", "\n", "| . | . |\n", "|---|---|\n", "| | |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Last tutorial: we trained a **logistic regression** model by using\n", "$$\\varphi(\\vec{z})=\\sigma(\\vec{z})=\\frac{1}{1+\\exp(-\\vec{z})}\\in[0,1].$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Limitation**: logistic regression is still a linear classifier. In what sense is it linear though?\n", "\n", "$$\\hat{y} = \\sigma(\\vectr{w}\\vec{x}+b)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Generalized linear model: Output depends only on a linear combination of weights and inputs.\n", "\n", "Decision boundaries are therefore straight lines:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multilayer Perceptron (MLP)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Model\n", "\n", "\n", "\n", "Composed of $L$ hidden **layers**, each layer $l$ with $n_l$ **perceptron** (\"neuron\") units." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Each layer $l$ operates on the output of the previous layer ($\\vec{y}_{l-1}$) and calculates:\n", "$$\n", "\\vec{y}_l = \\varphi\\left( \\mat{W}_l \\vec{y}_{l-1} + \\vec{b}_l \\right),~\n", "\\mat{W}_l\\in\\set{R}^{n_{l}\\times n_{l-1}},~ \\vec{b}_l\\in\\set{R}^{n_l}.\n", "$$\n", "\n", "- Note that both input and output are **vectors**. We can think of the above equation as describing a layer of **multiple perceptrons**.\n", "- We'll henceforth refer to such layers as **fully-connected** or FC layers.\n", "- The first layer accepts the input of the model, i.e. $\\vec{y}_0=\\vec{x}\\in\\set{R}^d$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given an input sample $\\vec{x}^i$, the computed function of an $L$-layer MLP is:\n", "$$\n", "\\vec{y}_L^i= \\varphi \\left(\n", "\\mat{W}_L \\varphi \\left( \\cdots\n", "\\varphi \\left( \\mat{W}_1 \\vec{x}^i + \\vec{b}_1 \\right)\n", "\\cdots \\right)\n", "+ \\vec{b}_L \\right)\n", "$$\n", "\n", "- Universal approximator theorem: an MLP with $L>1$, can approximate (almost) any function given enough parameters (Cybenko, 1989).\n", "- This expression is fully differentiable w.r.t. parameters using the Chain Rule." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given an input sample $\\vec{x}^i$, the computed function of an $L$-layer MLP is:\n", "$$\n", "\\vec{y}_L^i= \\varphi \\left(\n", "\\mat{W}_L \\varphi \\left( \\cdots\n", "\\varphi \\left( \\mat{W}_1 \\vec{x}^i + \\vec{b}_1 \\right)\n", "\\cdots \\right)\n", "+ \\vec{b}_L \\right)\n", "$$\n", "\n", "- Universal approximator theorem: an MLP with $L>1$, can approximate (almost) any function given enough parameters (Cybenko, 1989).\n", "- This expression is fully differentiable w.r.t. parameters using the Chain Rule.\n", "\n", "- Since it has a non-linear dependency on it's weights and inputs, non-linear decision boundaries are possible\n", " - MLP with 1, 2 and 4 hidden layers, 3 neurons each\n", " \n", " \"overfit1\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Activation functions " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An **activation function** is the non-linear elementwise function $\\varphi(\\cdot)$ which operates on the affine part of the perceptron model.\n", "\n", "But why do we even need non-linearities?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Without them, the MLP model would be equivalent to a single affine transform." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Common choices for the activation functions are:\n", "\n", "- The logistic function (sigmoid)\n", " $$ \\varphi(t) = \\sigma(t) = \\frac{1}{1+e^{-t}} \\in [0,1] $$\n", "- The hyperbolic tangent (a shifted and scaled sigmoid)\n", " $$ \\varphi(t) = \\mathrm{tanh}(t) = \\frac{e^t - e^{-t}}{e^t +e^{-t}} \\in [-1,1]$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- ReLU, rectified linear unit\n", " $$ \\varphi(t) = \\max\\{t,0\\} $$\n", "Note that\n", " $$ \\pderiv{\\varphi}{t} = \\begin{cases} 1, & t>0 \\\\ 0, & t<0 \\end{cases} $$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Plot some activation functions and their gradients\n", "# Activation functions\n", "relu = lambda x: np.maximum(0, x)\n", "sigmoid = lambda x: 1 / (1 + np.exp(-x))\n", "tanh = lambda x: (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))\n", "# Their gradients\n", "g_relu = lambda x: np.array(relu(x) > 0, dtype=np.float)\n", "g_sigmoid = lambda x: sigmoid(x) * (1-sigmoid(x))\n", "g_tanh = lambda x: (1 - tanh(x) ** 2)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.linspace(-4, 4, num=1024)\n", "_, axes = plt.subplots(nrows=1, ncols=2, figsize=(16,5))\n", "axes[0].plot(x, relu(x), x, sigmoid(x), x, tanh(x))\n", "axes[1].plot(x, g_relu(x), x, g_sigmoid(x), x, g_tanh(x))\n", "legend_entries = (r'\\mathrm{ReLU}(x)', r'\\sigma(x)', r'\\mathrm{tanh}(x)')\n", "for ax, legend_prefix in zip(axes, ('', r'\\frac{\\partial}{\\partial x}')):\n", " ax.grid(True)\n", " ax.legend(tuple(f'${legend_prefix}{legend_entry}$' for legend_entry in legend_entries))\n", " ax.set_ylim((-1.1,1.1))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Some reasons for using ReLU are:\n", "\n", "- Mitigates vanishing gradients due to many layers (even though they can still be zero).\n", "- Promotes sparse weight vectors: \"dead neurons\" arguably cause sparsity in the next layer.\n", "- Much faster to compute than sigmoid and tanh." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2-Layer MLP from scratch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's solve a simple **regression** problem with a 2-layer MLP (one hidden layer, one output layer)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We're trying to learn a continuous and perhaps non-deterministic function $y=f(\\vec{x})$.\n", "\n", "- Domain: $\\vec{x}^i \\in \\set{R}^{D_{\\text{in}}}$\n", "- Target: $y^i \\in \\set{R}^{D_{\\text{out}}}$\n", "- Model: $\\hat{y} = \\varphi(\\mat{X}\\mat{W}_1 + \\vec{b}_1)\\mat{W}_2 + \\vec{b}_2$ (2-layer MLP), where:\n", " - $\\mat{X}$ is the $(N,D_{\\text{in}})$ matrix with samples in it's rows\n", " - $\\mat{W}_1\\in\\set{R}^{D_{\\text{in}}\\times H},\\ \\vec{b}_1\\in\\set{R}^{H}$\n", " - $\\mat{W}_2\\in\\set{R}^{H\\times D_{\\text{out}}},\\ \\vec{b}_2\\in\\set{R}^{D_{\\text{out}}}$\n", " - $\\varphi(\\cdot) = \\mathrm{ReLU}(\\cdot) = \\max\\{\\cdot,0\\}$\n", " - $H$ is the hidden dimension\n", " - We'll set $D_{\\text{out}}=1$ so output is a vector\n", "- MSE loss with L2 regularization:\n", " $$\n", " L_{\\mathcal{D}}(h) =\n", " \\frac{1}{N}\\norm{\\hat{\\vec{y}} - \\vec{y}}_2^2 + \\frac{\\lambda}{2}\\left(\\norm{\\mat{W}_1}_F^2 + \\norm{\\mat{W}_2}_F^2 \\right)\n", " $$\n", "- Optimization scheme: Vanilla SGD" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Computing the loss gradients with backpropagation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Ignoring regularization, we define for brevity, $\\delta X \\triangleq \\pderiv{L(h)}{X}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can now apply the chain rule and write\n", "$$\n", "\\begin{align}\n", "\\delta \\hat{\\vec{y}} &= \\pderiv{L}{\\hat{\\vec{y}}} = \\frac{2}{N}\\left(\\hat{\\vec{y}} - \\vec{y}\\right) \\\\\n", "\\delta \\mat{W}_2 &= \\delta\\hat{\\vec{y}} \\pderiv{\\hat{\\vec{y}}}{\\mat{W}_2} = \\mattr{A}_b \\delta\\hat{\\vec{y}}\\\\\n", "\\delta\\mat{A}_b &= \\delta\\hat{\\vec{y}} \\pderiv{\\hat{\\vec{y}}}{\\mat{A}_b} = \\delta\\hat{\\vec{y}} \\mattr{W}_2 \\\\\n", "\\delta\\mat{Z} &= \\delta\\mat{A}\\pderiv{\\mat{A}}{\\mat{Z}} = \\delta\\mat{A}\\odot\\mathbb{1}(\\mat{Z}>0) \\\\\n", "\\delta\\mat{W}_1 &= \\delta\\mat{Z}\\pderiv{\\mat{Z}}{\\mat{W}_1} = \\mattr{X}_b \\delta\\mat{Z}\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The final gradients for weight update, including regularization will be $\\delta\\mat{W}_i + \\lambda\\mat{W}_i$." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# A simple MLP with one hidden layer\n", "\n", "# N: batch size\n", "# D_in: number of features\n", "N, D_in = 64, 10\n", "# H: hidden-layer\n", "# D_out: output dimension\n", "H, D_out = 100, 1\n", "\n", "# Random input data\n", "X = np.random.randn(N, D_in)\n", "y = np.random.randn(N, D_out)\n", "\n", "# Model weights (note: bias included)\n", "W1 = np.random.randn(D_in+1, H)\n", "W2 = np.random.randn(H+1, D_out)\n", "\n", "reg_lambda = 0.5\n", "learning_rate = 1e-3" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".........................................................................................................................................................................................................................................................." ] } ], "source": [ "losses = []\n", "for epoch in range(250):\n", " # Forward pass, hidden layer: A = relu(Xb W1), Shape: (N, H+1)\n", " Xb = np.hstack((np.ones((N,1)), X))\n", " Z = Xb.dot(W1)\n", " A = np.maximum(Z, 0)\n", " # Forward pass, output layer: Y_hat = A W2, Shape: (N, D_out)\n", " Ab = np.hstack((np.ones((N,1)), A))\n", " Y_hat = Ab.dot(W2)\n", " \n", " # Loss calculation (MSE)\n", " loss = np.mean((Y_hat - y) ** 2); losses.append(loss)\n", " \n", " # Backward pass: Output layer\n", " d_Y_hat = (2./N) * (Y_hat - y)\n", " d_W2 = Ab.transpose().dot(d_Y_hat)\n", " d_Ab = d_Y_hat.dot(W2.transpose())\n", " # Backward pass: Hidden layer\n", " d_A = d_Ab[:,1:] # remove bias col\n", " d_Z = d_A * np.array(Z > 0, dtype=np.float)\n", " d_W1 = Xb.transpose().dot(d_Z)\n", " # Backward pass: Regularization term\n", " d_W2 += reg_lambda * W2\n", " d_W1 += reg_lambda * W1\n", " \n", " # Gradient descent step\n", " W2 -= d_W2 * learning_rate\n", " W1 -= d_W1 * learning_rate\n", " \n", " print('.', end='')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "_, ax = plt.subplots(figsize=(10,5))\n", "ax.plot(losses)\n", "ax.set_ylabel('MSE loss'); ax.set_xlabel('Epoch');" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Note that this implementation is not ideal, as it's:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Non modular (hard to switch components)\n", "- Hard to extend (e.g. to add layers)\n", "- Error prone (hard-coded manual calculations)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But it works!\n", "- In HW2, you'll implement a from scratch MLP that addresses these concerns.\n", "- And now, we'll see how to address these issues using PyTorch's API." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## N-Layer MLP using PyTorch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's create all our usual components:\n", "- Dataset\n", "- Model\n", "- Loss function\n", "- Optimizer\n", "\n", "But this time we'll create a modular implementation where each of these components is separate and can be changed independently of the others." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Dataset" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As in the previous tutorial we'll tackle an image classification task, the MNIST database of handwritten digits." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0: torch.Size([1, 28, 28]), y0: 5\n" ] } ], "source": [ "import torch\n", "import torch.utils.data\n", "import torchvision\n", "import torchvision.transforms as tvtf\n", "\n", "# Define the transforms that should be applied to each image in the dataset before returning it\n", "tf_ds = tvtf.Compose([\n", " tvtf.ToTensor(), # Convert PIL image to pytorch Tensor\n", " tvtf.Normalize(mean=(0.1307,), std=(0.3081,)) # normalize to zero mean and unit std\n", "])\n", "\n", "batch_size = 512\n", "train_size = batch_size * 10\n", "test_size = batch_size * 2\n", "\n", "# Datasets and loaders\n", "root_dir = os.path.expanduser('~/.pytorch-datasets/mnist/')\n", "ds_train = torchvision.datasets.MNIST(root=root_dir, download=True, train=True, transform=tf_ds)\n", "dl_train = torch.utils.data.DataLoader(ds_train, batch_size,\n", " sampler=torch.utils.data.SubsetRandomSampler(range(0,train_size)))\n", "ds_test = torchvision.datasets.MNIST(root=root_dir, download=True, train=False, transform=tf_ds)\n", "dl_test = torch.utils.data.DataLoader(ds_test, batch_size,\n", " sampler=torch.utils.data.SubsetRandomSampler(range(0,test_size)))\n", "\n", "x0, y0 = ds_train[0]\n", "n_features = torch.numel(x0)\n", "n_classes = 10\n", "\n", "print(f'x0: {x0.shape}, y0: {y0}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Model Implementation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The `torch.nn` module contains building blocks such as neural network layers,\n", " loss functions, activations and more.\n", "\n", "- We'll implement our model as a subclass of `nn.Module`, which means:\n", " - Any tensors we set as properties will be registered as model parameters.\n", " - We can nest `nn.Modules` and get all model parameters from the top-level `nn.Module`.\n", " - Can be used as a function if we implement the `forward()` method." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import torch.nn as nn\n", "\n", "class MLP(torch.nn.Module):\n", " def __init__(self, D_in: int, hidden_dims: list, D_out: int):\n", " super().__init__()\n", " \n", " all_dims = [D_in, *hidden_dims, D_out]\n", " layers = []\n", " \n", " for in_dim, out_dim in zip(all_dims[:-1], all_dims[1:]):\n", " layers += [\n", " nn.Linear(in_dim, out_dim, bias=True),\n", " nn.ReLU()\n", " ]\n", " \n", " self.fc = nn.Sequential(*layers)\n", " self.log_softmax = nn.LogSoftmax(dim=1)\n", "\n", " def forward(self, x):\n", " x = torch.reshape(x, (x.shape[0], -1))\n", " z = self.fc(x)\n", " y_pred = self.log_softmax(z)\n", " return y_pred" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MLP(\n", " (fc): Sequential(\n", " (0): Linear(in_features=784, out_features=32, bias=True)\n", " (1): ReLU()\n", " (2): Linear(in_features=32, out_features=64, bias=True)\n", " (3): ReLU()\n", " (4): Linear(in_features=64, out_features=128, bias=True)\n", " (5): ReLU()\n", " (6): Linear(in_features=128, out_features=64, bias=True)\n", " (7): ReLU()\n", " (8): Linear(in_features=64, out_features=10, bias=True)\n", " (9): ReLU()\n", " )\n", " (log_softmax): LogSoftmax()\n", ")\n" ] } ], "source": [ "# Create an instance of the model (5-layer MLP)\n", "mlp5 = MLP(D_in=n_features, hidden_dims=[32, 64, 128, 64], D_out=n_classes)\n", "print(mlp5)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of parameter tensors: 10\n" ] } ], "source": [ "print(f'number of parameter tensors: {len(list(mlp5.parameters()))}')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of parameters: 44458\n" ] } ], "source": [ "print(f'number of parameters: {np.sum([torch.numel(p) for p in mlp5.parameters()])}')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "y_hat0=tensor([[-2.3167, -2.3167, -2.2767, -2.2198, -2.3167, -2.3167, -2.3167, -2.3167,\n", " -2.3167, -2.3167]], grad_fn=),\n", "shape=torch.Size([1, 10])\n" ] } ], "source": [ "# Test a forward pass\n", "y_hat0 = mlp5(x0)\n", "\n", "print(f'y_hat0={y_hat0},\\nshape={y_hat0.shape}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Loss and Optimizer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For the loss, we'll use PyTorch's built in cross-entropy loss.\n", "- We won't need to calculate the loss gradient this time, as we'll use `autograd` for automatic differentiation.\n", "- As for the optimization scheme, we'll use a built in SGD optimizer from the `torch.optim` module." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import torch.optim\n", "\n", "# Model\n", "model = MLP(D_in=n_features, hidden_dims=[32, 32, 32], D_out=n_classes)\n", "\n", "# Loss:\n", "# Note: NLLLoss assumes log-probabilities (given by our LogSoftmax layer)\n", "loss_fn = nn.NLLLoss()\n", "\n", "# Optimizer\n", "optimizer = torch.optim.SGD(params=model.parameters(), lr=1e-2, weight_decay=0.1, momentum=0.9)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Training loop" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This time we'll train over lazy-loaded batches from our data loader.\n", "\n", "Notice that except from our model's `__init__()` and `__forward()__`, we're using PyTorch facilities for the entire training implementation." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch #1: Avg. loss=2.3054928541183473\n", "Epoch #2: Avg. loss=2.3014203310012817\n", "Epoch #3: Avg. loss=2.2982772827148437\n", "Epoch #4: Avg. loss=2.2965104579925537\n", "Epoch #5: Avg. loss=2.2955925703048705\n", "Epoch #6: Avg. loss=2.2952797651290893\n", "Epoch #7: Avg. loss=2.2953588247299193\n", "Epoch #8: Avg. loss=2.2957545042037966\n", "Epoch #9: Avg. loss=2.29633150100708\n", "Epoch #10: Avg. loss=2.296979832649231\n" ] } ], "source": [ "num_epochs = 10\n", "for epoch_idx in range(num_epochs):\n", " total_loss = 0\n", " for batch_idx, (X, y) in enumerate(dl_train):\n", " # Forward pass\n", " y_pred = model(X)\n", "\n", " # Compute loss\n", " loss = loss_fn(y_pred, y)\n", " total_loss += loss.item()\n", "\n", " # Backward pass\n", " optimizer.zero_grad() # Zero gradients of all parameters\n", " loss.backward()\n", " \n", " # Weight update\n", " optimizer.step()\n", " \n", " print(f'Epoch #{epoch_idx+1}: Avg. loss={total_loss/len(dl_train)}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Image credits\n", "- MartinThoma [CC0], via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Perceptron-unit.svg\n", "- Sebastian Raschka https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html\n", "- Favio Vázquez https://towardsdatascience.com/a-conversation-about-deep-learning-9a915983107\n", "- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }