{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\\newcommand{\\mat}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\mattr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\matinv}[1]{\\boldsymbol {#1}^{-1}}\n", "\\newcommand{\\vec}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\vectr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\rvar}[1]{\\mathrm {#1}}\n", "\\newcommand{\\rvec}[1]{\\boldsymbol{\\mathrm{#1}}}\n", "\\newcommand{\\diag}{\\mathop{\\mathrm {diag}}}\n", "\\newcommand{\\set}[1]{\\mathbb {#1}}\n", "\\newcommand{\\norm}[1]{\\left\\lVert#1\\right\\rVert}\n", "\\newcommand{\\pderiv}[2]{\\frac{\\partial #1}{\\partial #2}}\n", "\\newcommand{\\bb}[1]{\\boldsymbol{#1}}$$\n", "\n", "# CS236781: Deep Learning\n", "# Tutorial 2: Multilayer Perceptron" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction\n", "\n", "In this tutorial, we will cover:\n", "\n", "* Linear (fully connected) layers\n", "* Activation functions\n", "* 2-Layer MLP implementation from scratch (self study)\n", "* N-layer MLP with PyTorch's `autograd` and `optim` modules" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:49.016538Z", "iopub.status.busy": "2021-05-06T06:05:49.016034Z", "iopub.status.idle": "2021-05-06T06:05:50.095393Z", "shell.execute_reply": "2021-05-06T06:05:50.095954Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Setup\n", "%matplotlib inline\n", "import os\n", "import numpy as np\n", "import sklearn\n", "import torch\n", "import matplotlib.pyplot as plt\n", "from typing import Sequence" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.099377Z", "iopub.status.busy": "2021-05-06T06:05:50.098809Z", "iopub.status.idle": "2021-05-06T06:05:50.126349Z", "shell.execute_reply": "2021-05-06T06:05:50.126947Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "plt.rcParams['font.size'] = 20" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Reminder: Perceptrons and linear models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The following hypothesis class\n", "$$\n", "\\mathcal{H} =\n", "\\left\\{ h: \\mathcal{X}\\rightarrow\\mathcal{Y}\n", "~\\vert~\n", "h(\\vec{x}) = \\varphi(\\vectr{w}\\vec{x}+b); \\vec{w}\\in\\set{R}^D,~b\\in\\set{R}\\right\\}\n", "$$\n", "where $\\varphi(\\cdot)$ is some nonlinear function, is composed of functions representing the **perceptron** model.\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Last tutorial: we trained a **logistic regression** model by using\n", "$$\\varphi(\\vec{z})=\\sigma(\\vec{z})=\\frac{1}{1+\\exp(-\\vec{z})}\\in[0,1].$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Limitation**: logistic regression is still a linear classifier. In what sense is it linear though?\n", "\n", "$$\\hat{y} = \\sigma(\\vectr{w}\\vec{x}+b)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Linear** in the sense that output depends only on a linear combination of weights and inputs.\n", "\n", "Decision boundaries are therefore straight lines:\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "If we define $y(\\vec{x})=\\vectr{w}\\vec{x}+b$, then the **decision surface** of the classifier is $y(\\vec{x})=\\text{const}$ (usually zero). Note that $b=w_0$ in the diagram below." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "1. For any $\\vec{x_1}, \\vec{x_2}$ on the decision boundary, $\\vectr{w}(\\vec{x_1}-\\vec{x_2})=0$ $\\Rightarrow \\vec{w}$ orthogonal to every vector in the surface.\n", "1. For any $\\vec{x_1}$ on the decision boundary, $\\vectr{w}\\vec{x_1}=-b$. \n", "Since $\\frac{\\vectr{w}}{\\norm{\\vec{w}}}\\vec{x_1}$ is the length of the projection of $\\vec{x_1}$ onto $\\vec{w}$, then the distance of the boundary from origin is given by\n", "$\\frac{-b}{\\norm{\\vec{w}}}$.\n", "1. For any point $\\vec{x}$, $\\frac{y(\\vec{x})}{\\norm{\\vec{w}}}$ is the **signed perpendicular distance** from the decision boundary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What if our data is not linearly separable? Can we still use e.g. logistic regression?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What if we first apply a **fixed** non-linear transformation to the data?\n", "\n", "For some $\\psi: \\set{R}^d \\rightarrow \\set{R}^D$, what does the following classifier do?\n", "\n", "$$\\hat{y} = \\varphi(\\vectr{w}\\psi(\\vec{x})+b).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We still get a linear decision boundary, but relative to the new features $\\psi(\\vec{x})\\in\\set{R}^D$.\n", "Projecting this boundary back to $\\set{R}^d$, we can get nonlinear decision boundaries with respect to $\\vec{x}\\in\\set{R}^d$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But how can we choose a nonlinear transformation? \n", "Traditional ML: Craft it painstakingly based on domain-knowledge.\n", "\n", "But what if we want to **learn** it?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multilayer Perceptron (MLP)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Model\n", "\n", "
\n", "\n", "Composed of $L$ **layers**, each layer $l$ with $n_l$ **perceptron** (\"neuron\") units." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Each layer $l$ operates on the output of the previous layer ($\\vec{y}_{l-1}$) and calculates:\n", "\n", "$$\n", "\\vec{y}_l = \\varphi\\left( \\mat{W}_l \\vec{y}_{l-1} + \\vec{b}_l \\right),~\n", "\\mat{W}_l\\in\\set{R}^{n_{l}\\times n_{l-1}},~ \\vec{b}_l\\in\\set{R}^{n_l},~ l \\in \\{1,2,\\dots,L\\}.\n", "$$\n", "\n", "- We'll refer to such layers as **fully-connected** or FC layers.\n", "- First layer accepts the input, i.e. $\\vec{y}_0=\\vec{x}\\in\\set{R}^d$.\n", "- Last layer $y_L$ is the output of the model.\n", "- The layers $1, 2, \\dots, L-1$ reffered to as hidden layers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How powerful is an MLP model? I.e. which functions can it approximate?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Universal approximator theorem**\n", "\n", "Given enough parameters, an MLP with $L>1$ and any non-linear activation function, can approximate any continuous function up to any specified precision (Cybenko, 1989).\n", "\n", "The MLP is therefore a **potent hypothesis class** (recall approximation error).\n", "\n", "See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for an intuitive explanation of the UAT." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given an input sample $\\vec{x}^i$, the computed function of an $L$-layer MLP is:\n", "$$\n", "\\vec{y}_L^i= \\varphi \\left(\n", "\\mat{W}_L \\varphi \\left( \\cdots\n", "\\varphi \\left( \\mat{W}_1 \\vec{x}^i + \\vec{b}_1 \\right)\n", "\\cdots \\right)\n", "+ \\vec{b}_L \\right)\n", "$$\n", "\n", "This expression is fully differentiable w.r.t. parameters using the Chain Rule.\n", "\n", "And notice that $\\vec{y}_L^i = \\varphi(\\mat{W}_L \\vec{y}_{L-1}^i + \\vec{b}_L)$ is just the linear model we started with.\n", "\n", "So, intuitively, we can think of $\\vec{y}_{L-1}$ as **learned** non-linear features of the input! In other words, $\\vec{y}_{L-1} = \\psi_{\\vec{\\Theta}}(\\vec{x})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since an MLP it has a non-linear dependency on it's inputs, through a learned transformation, non-linear decision boundaries are possible.\n", "\n", "For example, an MLP with 1, 2 and 4 hidden layers, 3 neurons each:\n", " \n", "
\"overfit1\"
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Activation functions " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "An **activation function** is the non-linear elementwise function $\\varphi(\\cdot)$ which operates on the affine part of the perceptron model.\n", "\n", "Why do we even need non-linearities in the first place? Isn't the depth enough?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Without them, the MLP model would be equivalent to a single affine transform." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Common choices for the activation functions are:\n", "\n", "- The logistic function (sigmoid)\n", " $$ \\varphi(t) = \\sigma(t) = \\frac{1}{1+e^{-t}} \\in [0,1] $$\n", "- The hyperbolic tangent (a shifted and scaled sigmoid)\n", " $$ \\varphi(t) = \\mathrm{tanh}(t) = \\frac{e^t - e^{-t}}{e^t +e^{-t}} \\in [-1,1]$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- ReLU, rectified linear unit\n", " $$ \\varphi(t) = \\max\\{t,0\\} $$\n", "Note that ReLU is not strictly differentiable. However, sub-gradients exist. We will define its gradient as:\n", " $$ \\pderiv{\\varphi}{t} = \\begin{cases} 1, & t\\geq0 \\\\ 0, & t<0 \\end{cases} $$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.133600Z", "iopub.status.busy": "2021-05-06T06:05:50.133072Z", "iopub.status.idle": "2021-05-06T06:05:50.160650Z", "shell.execute_reply": "2021-05-06T06:05:50.161177Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Plot some activation functions and their gradients\n", "\n", "# Activation functions\n", "relu = lambda x: np.maximum(0, x)\n", "sigmoid = lambda x: 1 / (1 + np.exp(-x))\n", "tanh = lambda x: (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))\n", "\n", "# Their gradients\n", "g_relu = lambda x: np.array(relu(x) > 0, float)\n", "g_sigmoid = lambda x: sigmoid(x) * (1-sigmoid(x))\n", "g_tanh = lambda x: (1 - tanh(x) ** 2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.165974Z", "iopub.status.busy": "2021-05-06T06:05:50.165449Z", "iopub.status.idle": "2021-05-06T06:05:50.760248Z", "shell.execute_reply": "2021-05-06T06:05:50.760759Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = np.linspace(-4, 4, num=1024)\n", "_, axes = plt.subplots(nrows=1, ncols=2, figsize=(16,5))\n", "axes[0].plot(x, relu(x), x, sigmoid(x), x, tanh(x))\n", "axes[1].plot(x, g_relu(x), x, g_sigmoid(x), x, g_tanh(x))\n", "legend_entries = (r'\\mathrm{ReLU}(x)', r'\\sigma(x)', r'\\mathrm{tanh}(x)')\n", "for ax, legend_prefix in zip(axes, ('', r'\\frac{\\partial}{\\partial x}')):\n", " ax.grid(True)\n", " ax.legend(tuple(f'${legend_prefix}{legend_entry}$' for legend_entry in legend_entries))\n", " ax.set_ylim((-1.1,1.1))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Some reasons people cite for using ReLU are:\n", "\n", "- Does not suffer from vanishing gradient when $x$ is far from zero (though gradient can still be zero).\n", "\n", "- Much faster to compute than sigmoid and tanh.\n", "\n", "- Promotes sparse weight vectors: \"dead neurons\" arguably cause sparsity in the next layer." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Visualization of hand-crafted features vs. MLP" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Visualization of the effect of features and MLP layers: http://playground.tensorflow.org/\n", "\n", "Try to explain what you see for the circles dataset with:\n", "- No hidden layers\n", "- One hidden layer, two neurons\n", "- One hiddlen layer, three neurons" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 1: Two-layer MLP from scratch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's solve a simple **regression** problem with a 2-layer MLP (one hidden layer, one output layer)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We're trying to learn a continuous and perhaps non-deterministic function $y=f(\\vec{x})$.\n", "\n", "- Domain: $\\vec{x}^i \\in \\set{R}^{D_{\\text{in}}}$\n", "\n", "- Target: $y^i \\in \\set{R}^{D_{\\text{out}}}$\n", "\n", "- Model: $\\hat{y} =\\mat{W}_2~ \\varphi(\\mat{W}_1 \\vec{x}+ \\vec{b}_1) + \\vec{b}_2$
\n", " i.e. a 2-layer MLP, where:\n", " - $\\vec{x}\\in\\set{R}^{D_{\\text{in}}}$ sample (feature vector)\n", " - $\\mat{W}_1\\in\\set{R}^{H\\times D_{\\text{in}}},\\ \\vec{b}_1\\in\\set{R}^{H}$\n", " - $\\mat{W}_2\\in\\set{R}^{D_{\\text{out}}\\times H},\\ \\vec{b}_2\\in\\set{R}^{D_{\\text{out}}}$\n", " - $\\varphi(\\cdot) = \\mathrm{ReLU}(\\cdot) = \\max\\{\\cdot,0\\}$\n", " - $H$ is the hidden dimension\n", " - We'll set $D_{\\text{out}}=1$ so output is a scalar\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "- MSE loss with L2 regularization:\n", " $$\n", " \\begin{align}\n", " \\ell(\\vec{y},\\vec{\\hat y}) &= \\frac{1}{2}\\norm{\\vec{\\hat y} - \\vec{y}}^2 \\\\\n", " L_{\\mathcal{S}} &= \\frac{1}{N}\\sum_{i=1}^{N}\\ell(\\vec{y},\\vec{\\hat y}) + \\frac{\\lambda}{2}\\left(\\norm{\\mat{W}_1}_F^2 + \\norm{\\mat{W}_2}_F^2 \\right)\n", " \\end{align}\n", " $$\n", "- Optimization scheme: Vanilla SGD" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Computing the loss gradients with backpropagation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's write our model as \n", "$$\n", "\\hat{y} =\\mat{W}_2~ \\underbrace{\\mathrm{relu}(\\overbrace{\\mat{W}_1 \\vec{x}+ \\vec{b}_1}^{\\vec{z}})}_{\\vec{a}} + \\vec{b}_2,\n", "$$\n", "\n", "and manually derive the gradient of the point-wise loss $\\ell(\\vec{y},\\vec{\\hat y})$ using the **chain rule**.\n", "Remember that to use SGD, we need the gradient of the loss w.r.t. our parameter tensors." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "$$\n", "\\begin{align}\n", "&\\pderiv{\\ell}{\\vec{\\hat y}}=2\\cdot\\frac{1}{2}(\\vec{\\hat y}-\\vec{y}) = (\\vec{\\hat y}-\\vec{y})\\\\\n", "(\\ast)~~&\\pderiv{\\ell}{\\mat{W}_2}= \\pderiv{\\ell}{\\vec{\\hat y}}\\pderiv{\\vec{\\hat y}}{\\mat{W}_2}\n", "=(\\vec{\\hat y}-\\vec{y})\\vectr{a}\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How do we continue into the nonlinearity? Recall that we defined\n", "\n", "$$ \\pderiv{\\mathrm{relu}(x)}{x} = \\begin{cases} 1, & x\\geq0 \\\\ 0, & x<0 \\end{cases} $$\n", "\n", "and that we apply the non-linearity elementwise on input tensors.\n", "\n", "Also remember that the gradient of a vector w.r.t. another vector is the Jacobian, a matrix of mixed derivatives:\n", "$\\pderiv{a_i}{z_j}$.\n", "\n", "We have $\\vec{a}=\\mathrm{relu}(\\vec{z})$. Thus,\n", "\n", "$$\n", "\\pderiv{\\vec{a}}{\\vec{z}}=\\mathrm{diag}(\\mathbb{1}[\\vec{z}>0]).\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And so,\n", "$$\n", "\\begin{align}\n", "&\\pderiv{\\ell}{\\vec{a}}=\\pderiv{\\vec{\\hat y}}{\\vec{a}}\\pderiv{\\ell}{\\vec{\\hat y}}\n", "=\\mattr{W}_2(\\vec{\\hat y}-\\vec{y})\\\\\n", "&\\pderiv{\\ell}{\\vec{z}}=\\pderiv{\\vec{a}}{\\vec{z}}\\pderiv{\\ell}{\\vec{a}}\n", "=\\mathrm{diag}(\\mathbb{1}[\\vec{z}>0])\\mattr{W}_2(\\vec{\\hat y}-\\vec{y})\\\\\n", "(\\ast)~~&\\pderiv{\\ell}{\\mat{W}_1}=\\pderiv{\\ell}{\\vec{z}}\\pderiv{\\vec{z}}{\\mat{W}_1}\n", "=\\mathrm{diag}(\\mathbb{1}[\\vec{z}>0])\\mattr{W}_2(\\vec{\\hat y}-\\vec{y})\\vectr{x}\\\\\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For the biases, we can easily see that:\n", "$$\n", "\\begin{align}\n", "(\\ast)~~&\\pderiv{\\ell}{\\vec{b}_2}=\\pderiv{\\vec{\\hat y}}{\\vec{b}_2}\\pderiv{\\ell}{\\vec{\\hat y}}\n", "=I_{D_{\\text{out}}} \\pderiv{\\ell}{\\vec{\\hat y}} \\\\\n", "(\\ast)~~&\\pderiv{\\ell}{\\vec{b}_1}=\\pderiv{\\vec{z}}{\\vec{b}_1}\\pderiv{\\ell}{\\vec{z}}\n", "=I_{H}\\pderiv{\\ell}{\\vec{z}}\\\\\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The final gradients for weight update, including regularization will be\n", "\n", "$$\n", "\\nabla_{\\mat{W}_j}L_{\\mathcal{S}}=\\frac{1}{N}\\sum_{i=1}^{N} \\pderiv{\\ell_i}{\\mat{W}_j} + \\lambda\\mat{W}_j, \\ j=1,2.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's implement it from scratch using just `numpy`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.765583Z", "iopub.status.busy": "2021-05-06T06:05:50.765098Z", "iopub.status.idle": "2021-05-06T06:05:50.793785Z", "shell.execute_reply": "2021-05-06T06:05:50.794381Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# A simple MLP with one hidden layer\n", "\n", "# N: batch size\n", "# D_in: number of features\n", "N, D_in = 64, 10\n", "# H: hidden-layer\n", "# D_out: output dimension\n", "H, D_out = 100, 1\n", "\n", "# Random input data\n", "X = np.random.randn(N, D_in)\n", "y = np.random.randn(N, D_out)\n", "\n", "# Model weights and biases\n", "wstd = 0.01\n", "W1 = np.random.randn(H, D_in)*wstd\n", "b1 = np.random.randn(H,)*wstd + 0.1\n", "W2 = np.random.randn(D_out, H)*wstd\n", "b2 = np.random.randn(D_out,)*wstd + 0.1\n", "\n", "reg_lambda = 0.5\n", "learning_rate = 1e-3" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.799749Z", "iopub.status.busy": "2021-05-06T06:05:50.799193Z", "iopub.status.idle": "2021-05-06T06:05:50.925423Z", "shell.execute_reply": "2021-05-06T06:05:50.925994Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".........................................................................................................................................................................................................................................................." ] } ], "source": [ "losses = []\n", "for epoch in range(250):\n", " # Forward pass, hidden layer: A = relu(X W1 + b1), Shape: (N, H)\n", " Z = X.dot(W1.T) + b1\n", " A = np.maximum(Z, 0)\n", " \n", " # Forward pass, output layer: Y_hat = A W2 + b2, Shape: (N, D_out)\n", " Y_hat = A.dot(W2.T) + b2\n", " \n", " # Loss calculation (MSE)\n", " loss = np.mean((Y_hat - y) ** 2); losses.append(loss) # (N, D_out)\n", " \n", " # Backward pass: Output layer\n", " d_Y_hat = (1./N) * (Y_hat - y) # (N, D_out)\n", " d_W2 = d_Y_hat.T.dot(A) # (D_out, H)\n", " d_A = d_Y_hat.dot(W2) # (N, H)\n", " d_b2 = np.sum(d_Y_hat, axis=0) # (D_out,)\n", " \n", " # Backward pass: Hidden layer\n", " d_Z = d_A * np.array(Z > 0, dtype=float) # (N, H)\n", " d_W1 = d_Z.T.dot(X) # (H, D_in)\n", " d_b1 = np.sum(d_Z, axis=0) # (H,)\n", " \n", " # Backward pass: Regularization term\n", " d_W2 += reg_lambda * W2\n", " d_W1 += reg_lambda * W1\n", " \n", " # Gradient descent step\n", " W2 -= d_W2 * learning_rate; b2 -= d_b2 * learning_rate\n", " W1 -= d_W1 * learning_rate; b1 -= d_b1 * learning_rate\n", " print('.', end='')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:50.929259Z", "iopub.status.busy": "2021-05-06T06:05:50.928652Z", "iopub.status.idle": "2021-05-06T06:05:51.053249Z", "shell.execute_reply": "2021-05-06T06:05:51.053843Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot losses\n", "_, ax = plt.subplots(figsize=(10,5))\n", "ax.plot(losses)\n", "ax.set_ylabel('MSE loss'); ax.set_xlabel('Epoch');" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Note that this implementation is not ideal, as it's:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Non modular (hard to switch components)\n", "- Hard to extend (e.g. to add layers)\n", "- Error prone (hard-coded manual calculations)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But it works!\n", "- In HW2, you'll implement a from scratch MLP that addresses these concerns.\n", "- And now, we'll see how to address these issues using PyTorch's API." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 2: N-Layer MLP using PyTorch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's create all our usual components:\n", "- Dataset\n", "- Model\n", "- Loss function\n", "- Optimizer\n", "\n", "But this time we'll create a modular implementation where each of these components is separate and can be changed independently of the others." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Dataset" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As in the previous tutorial we'll tackle an image classification task, the MNIST database of handwritten digits." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.057759Z", "iopub.status.busy": "2021-05-06T06:05:51.057036Z", "iopub.status.idle": "2021-05-06T06:05:51.122188Z", "shell.execute_reply": "2021-05-06T06:05:51.122511Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import torch\n", "import torch.utils.data\n", "import torchvision\n", "import torchvision.transforms\n", "\n", "root_dir = os.path.expanduser('~/.pytorch-datasets/')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.128526Z", "iopub.status.busy": "2021-05-06T06:05:51.128053Z", "iopub.status.idle": "2021-05-06T06:05:51.184958Z", "shell.execute_reply": "2021-05-06T06:05:51.185529Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0: torch.Size([1, 28, 28]), y0: 5\n" ] } ], "source": [ "tf_ds = torchvision.transforms.ToTensor()\n", "\n", "batch_size = 512\n", "train_size = batch_size * 10\n", "test_size = batch_size * 2\n", "\n", "# Datasets and loaders\n", "ds_train = torchvision.datasets.MNIST(root=root_dir, download=True, train=True, transform=tf_ds)\n", "dl_train = torch.utils.data.DataLoader(ds_train, batch_size,\n", " sampler=torch.utils.data.SubsetRandomSampler(range(0,train_size)))\n", "ds_test = torchvision.datasets.MNIST(root=root_dir, download=True, train=False, transform=tf_ds)\n", "dl_test = torch.utils.data.DataLoader(ds_test, batch_size,\n", " sampler=torch.utils.data.SubsetRandomSampler(range(0,test_size)))\n", "\n", "x0, y0 = ds_train[0]\n", "n_features = torch.numel(x0)\n", "n_classes = 10\n", "\n", "print(f'x0: {x0.shape}, y0: {y0}')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.188929Z", "iopub.status.busy": "2021-05-06T06:05:51.188421Z", "iopub.status.idle": "2021-05-06T06:05:51.488462Z", "shell.execute_reply": "2021-05-06T06:05:51.488953Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0: torch.Size([1, 28, 28]), y0: 5\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import sys\n", "sys.path.append('.')\n", "import plot_utils as plot_utils\n", "# Show first few samples\n", "print(f'x0: {x0.shape}, y0: {y0}')\n", "plot_utils.dataset_first_n(ds_train, 10, cmap='gray', show_classes=True);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Model Implementation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The `torch.nn` module contains building blocks such as neural network layers,\n", " loss functions, activations and more.\n", "- In this section, we'll see various parts of the `torch.nn` API.\n", "- We'll use `nn.Linear` which implements a single MLP layer.\n", "- We'll implement our model as a subclass of `nn.Module`, which means:\n", " - Any tensors we set as properties will be registered as model parameters.\n", " - We can nest `nn.Modules` and get all model parameters from the top-level `nn.Module`.\n", " - Can be used as a function if we implement the `forward()` method." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To understand `nn.Module`, lets look at a very basic one: the **fully-connected** layer." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.492641Z", "iopub.status.busy": "2021-05-06T06:05:51.492155Z", "iopub.status.idle": "2021-05-06T06:05:51.520279Z", "shell.execute_reply": "2021-05-06T06:05:51.520839Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "tensor([[-0.5117, 1.4815, -0.3080, 0.2384, 0.6938]],\n", " grad_fn=)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch.nn as nn\n", "\n", "fc = nn.Linear(in_features=3, out_features=5, bias=True)\n", "\n", "# Input tensor with 10 samples of 3 features\n", "t = torch.randn(1, 3)\n", "\n", "# Forward pass, notice that grad_fn exists\n", "fc(t)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`nn.Modules` have registered **parameters**, which are tensors which `require_grad`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.523614Z", "iopub.status.busy": "2021-05-06T06:05:51.523140Z", "iopub.status.idle": "2021-05-06T06:05:51.550562Z", "shell.execute_reply": "2021-05-06T06:05:51.551115Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter #0 of shape torch.Size([5, 3]):\n", "tensor([[-0.0539, 0.0425, 0.3778],\n", " [-0.4717, 0.1436, -0.4493],\n", " [ 0.2278, -0.2985, -0.2353],\n", " [ 0.0305, -0.1864, -0.5171],\n", " [-0.1947, -0.1781, -0.3550]])\n", "\n", "Parameter #1 of shape torch.Size([5]):\n", "tensor([-0.2540, 0.5406, -0.1456, -0.0992, 0.2658])\n", "\n" ] } ], "source": [ "# Note parameter shapes\n", "for i, param in enumerate(fc.parameters()):\n", " print(f\"Parameter #{i} of shape {param.shape}:\\n{param.data}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can create custom `nn.Module`s with arbitrary logic. \n", "\n", "Let's recreate a fully-connected layer ourselves:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.555214Z", "iopub.status.busy": "2021-05-06T06:05:51.554725Z", "iopub.status.idle": "2021-05-06T06:05:51.581317Z", "shell.execute_reply": "2021-05-06T06:05:51.581882Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class MyFullyConnectedLayer(nn.Module):\n", " \n", " def __init__(self, in_features, out_features):\n", " super().__init__() # don't forget this!\n", " \n", " # nn.Parameter just marks W,b for inclusion in list of parameters\n", " self.W = nn.Parameter(torch.randn(out_features, in_features, requires_grad=False))\n", " self.b = nn.Parameter(torch.randn(out_features, requires_grad=True))\n", " \n", " def forward(self, x):\n", " # x assumed to be (N, in_features)\n", " z = torch.matmul(x, self.W.transpose(0, 1)) + self.b\n", " \n", " # Our custom FC layer multiplies all outputs by 3 for no good reason\n", " return 3 * z\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.584813Z", "iopub.status.busy": "2021-05-06T06:05:51.584336Z", "iopub.status.idle": "2021-05-06T06:05:51.610648Z", "shell.execute_reply": "2021-05-06T06:05:51.611204Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "myfc = MyFullyConnectedLayer(in_features=3, out_features=5)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.614053Z", "iopub.status.busy": "2021-05-06T06:05:51.613580Z", "iopub.status.idle": "2021-05-06T06:05:51.641014Z", "shell.execute_reply": "2021-05-06T06:05:51.641566Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "tensor([[-3.8027, 0.4861, 1.4844, 2.2856, 5.4585]], grad_fn=)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myfc(t)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.644410Z", "iopub.status.busy": "2021-05-06T06:05:51.643920Z", "iopub.status.idle": "2021-05-06T06:05:51.671299Z", "shell.execute_reply": "2021-05-06T06:05:51.671807Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[Parameter containing:\n", " tensor([[ 0.6184, -1.3998, -0.3757],\n", " [ 0.8692, 0.6723, 0.7671],\n", " [-0.5638, -1.0807, -0.6356],\n", " [-0.7414, -0.9982, 0.6884],\n", " [-0.1062, 1.3147, -0.4994]], requires_grad=True),\n", " Parameter containing:\n", " tensor([-0.3234, 1.4104, -0.1391, 1.0728, 0.6665], requires_grad=True)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(myfc.parameters())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Quick visualization of our custom module's **computation graph**:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.674956Z", "iopub.status.busy": "2021-05-06T06:05:51.674474Z", "iopub.status.idle": "2021-05-06T06:05:51.793273Z", "shell.execute_reply": "2021-05-06T06:05:51.793925Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "11249740992\n", "\n", " (1, 5)\n", "\n", "\n", "\n", "11377792528\n", "\n", "MulBackward0\n", "\n", "\n", "\n", "11377792528->11249740992\n", "\n", "\n", "\n", "\n", "\n", "11377792624\n", "\n", "AddBackward0\n", "\n", "\n", "\n", "11377792624->11377792528\n", "\n", "\n", "\n", "\n", "\n", "11377792336\n", "\n", "MmBackward0\n", "\n", "\n", "\n", "11377792336->11377792624\n", "\n", "\n", "\n", "\n", "\n", "11377792720\n", "\n", "TransposeBackward0\n", "\n", "\n", "\n", "11377792720->11377792336\n", "\n", "\n", "\n", "\n", "\n", "11377792816\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377792816->11377792720\n", "\n", "\n", "\n", "\n", "\n", "11373582112\n", "\n", "W\n", " (5, 3)\n", "\n", "\n", "\n", "11373582112->11377792816\n", "\n", "\n", "\n", "\n", "\n", "11377792384\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377792384->11377792624\n", "\n", "\n", "\n", "\n", "\n", "11373582672\n", "\n", "b\n", " (5)\n", "\n", "\n", "\n", "11373582672->11377792384\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torchviz\n", "torchviz.make_dot(myfc(t), params=dict(W=myfc.W, b=myfc.b))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now that we know about `nn.Module`s, lets create an fairly-general MLP for multiclass classification." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.801400Z", "iopub.status.busy": "2021-05-06T06:05:51.800819Z", "iopub.status.idle": "2021-05-06T06:05:51.833955Z", "shell.execute_reply": "2021-05-06T06:05:51.834336Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class MLP(torch.nn.Module):\n", " NLS = {'relu': torch.nn.ReLU, 'tanh': nn.Tanh, 'sigmoid': nn.Sigmoid, 'softmax': nn.Softmax, 'logsoftmax': nn.LogSoftmax}\n", "\n", " def __init__(self, D_in: int, hidden_dims: Sequence[int], D_out: int, nonlin='relu'):\n", " super().__init__()\n", " \n", " all_dims = [D_in, *hidden_dims, D_out]\n", " non_linearity = MLP.NLS[nonlin]\n", " layers = []\n", " \n", " for in_dim, out_dim in zip(all_dims[:-1], all_dims[1:]):\n", " layers += [\n", " nn.Linear(in_dim, out_dim, bias=True),\n", " non_linearity()\n", " ]\n", " \n", " # Sequential is a container for layers\n", " self.fc_layers = nn.Sequential(*layers[:-1])\n", " \n", " # Output non-linearity\n", " self.log_softmax = nn.LogSoftmax(dim=1)\n", "\n", " def forward(self, x):\n", " x = torch.reshape(x, (x.shape[0], -1))\n", " z = self.fc_layers(x)\n", " y_pred = self.log_softmax(z)\n", " # Output is always log-probability\n", " return y_pred" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.837660Z", "iopub.status.busy": "2021-05-06T06:05:51.837141Z", "iopub.status.idle": "2021-05-06T06:05:51.864715Z", "shell.execute_reply": "2021-05-06T06:05:51.865228Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MLP(\n", " (fc_layers): Sequential(\n", " (0): Linear(in_features=784, out_features=32, bias=True)\n", " (1): Tanh()\n", " (2): Linear(in_features=32, out_features=64, bias=True)\n", " (3): Tanh()\n", " (4): Linear(in_features=64, out_features=128, bias=True)\n", " (5): Tanh()\n", " (6): Linear(in_features=128, out_features=64, bias=True)\n", " (7): Tanh()\n", " (8): Linear(in_features=64, out_features=10, bias=True)\n", " )\n", " (log_softmax): LogSoftmax(dim=1)\n", ")\n" ] } ], "source": [ "# Create an instance of the model: 5-layer MLP\n", "mlp5 = MLP(D_in=n_features, hidden_dims=[32, 64, 128, 64], D_out=n_classes, nonlin='tanh')\n", "\n", "print(mlp5)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.868232Z", "iopub.status.busy": "2021-05-06T06:05:51.867753Z", "iopub.status.idle": "2021-05-06T06:05:51.895960Z", "shell.execute_reply": "2021-05-06T06:05:51.896670Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of parameter tensors: 10\n" ] } ], "source": [ "# Parameter tensors in nested nn.Modules are automatically discovered.\n", "print(f'number of parameter tensors: {len(list(mlp5.parameters()))}')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.900699Z", "iopub.status.busy": "2021-05-06T06:05:51.900139Z", "iopub.status.idle": "2021-05-06T06:05:51.928199Z", "shell.execute_reply": "2021-05-06T06:05:51.928711Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of parameters: 44458\n" ] } ], "source": [ "print(f'number of parameters: {np.sum([torch.numel(p) for p in mlp5.parameters()])}')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.932129Z", "iopub.status.busy": "2021-05-06T06:05:51.931582Z", "iopub.status.idle": "2021-05-06T06:05:51.961558Z", "shell.execute_reply": "2021-05-06T06:05:51.962101Z" }, "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0.shape=torch.Size([1, 28, 28])\n", "\n", "y_hat0.shape=torch.Size([1, 10])\n", "\n", "y_hat0=tensor([[-2.2258, -2.3713, -2.4023, -2.2198, -2.3206, -2.2607, -2.3160, -2.3575,\n", " -2.3782, -2.1983]], grad_fn=)\n" ] } ], "source": [ "# Test a forward pass\n", "y_hat0 = mlp5(x0)\n", "\n", "print(f'{x0.shape=}\\n')\n", "print(f'{y_hat0.shape=}\\n')\n", "print(f'{y_hat0=}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Quick visualization of our full MLP's **computation graph**:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:51.965171Z", "iopub.status.busy": "2021-05-06T06:05:51.964693Z", "iopub.status.idle": "2021-05-06T06:05:52.025504Z", "shell.execute_reply": "2021-05-06T06:05:52.026023Z" }, "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "11377955600\n", "\n", " (1, 10)\n", "\n", "\n", "\n", "11377793536\n", "\n", "LogSoftmaxBackward0\n", "\n", "\n", "\n", "11377793536->11377955600\n", "\n", "\n", "\n", "\n", "\n", "11377792384\n", "\n", "AddmmBackward0\n", "\n", "\n", "\n", "11377792384->11377793536\n", "\n", "\n", "\n", "\n", "\n", "11377794256\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377794256->11377792384\n", "\n", "\n", "\n", "\n", "\n", "11377957520\n", "\n", "fc_layers.8.bias\n", " (10)\n", "\n", "\n", "\n", "11377957520->11377794256\n", "\n", "\n", "\n", "\n", "\n", "11373551040\n", "\n", "TanhBackward0\n", "\n", "\n", "\n", "11373551040->11377792384\n", "\n", "\n", "\n", "\n", "\n", "11373551088\n", "\n", "AddmmBackward0\n", "\n", "\n", "\n", "11373551088->11373551040\n", "\n", "\n", "\n", "\n", "\n", "11373550416\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11373550416->11373551088\n", "\n", "\n", "\n", "\n", "\n", "11377956880\n", "\n", "fc_layers.6.bias\n", " (64)\n", "\n", "\n", "\n", "11377956880->11373550416\n", "\n", "\n", "\n", "\n", "\n", "11377948656\n", "\n", "TanhBackward0\n", "\n", "\n", "\n", "11377948656->11373551088\n", "\n", "\n", "\n", "\n", "\n", "11377946736\n", "\n", "AddmmBackward0\n", "\n", "\n", "\n", "11377946736->11377948656\n", "\n", "\n", "\n", "\n", "\n", "11377948176\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377948176->11377946736\n", "\n", "\n", "\n", "\n", "\n", "11377955680\n", "\n", "fc_layers.4.bias\n", " (128)\n", "\n", "\n", "\n", "11377955680->11377948176\n", "\n", "\n", "\n", "\n", "\n", "11377947024\n", "\n", "TanhBackward0\n", "\n", "\n", "\n", "11377947024->11377946736\n", "\n", "\n", "\n", "\n", "\n", "11377949376\n", "\n", "AddmmBackward0\n", "\n", "\n", "\n", "11377949376->11377947024\n", "\n", "\n", "\n", "\n", "\n", "11377949280\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377949280->11377949376\n", "\n", "\n", "\n", "\n", "\n", "11377955920\n", "\n", "fc_layers.2.bias\n", " (64)\n", "\n", "\n", "\n", "11377955920->11377949280\n", "\n", "\n", "\n", "\n", "\n", "11377949424\n", "\n", "TanhBackward0\n", "\n", "\n", "\n", "11377949424->11377949376\n", "\n", "\n", "\n", "\n", "\n", "11377948416\n", "\n", "AddmmBackward0\n", "\n", "\n", "\n", "11377948416->11377949424\n", "\n", "\n", "\n", "\n", "\n", "11377947408\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377947408->11377948416\n", "\n", "\n", "\n", "\n", "\n", "11377473712\n", "\n", "fc_layers.0.bias\n", " (32)\n", "\n", "\n", "\n", "11377473712->11377947408\n", "\n", "\n", "\n", "\n", "\n", "11377948944\n", "\n", "TBackward0\n", "\n", "\n", "\n", "11377948944->11377948416\n", "\n", "\n", "\n", "\n", "\n", "11377948128\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377948128->11377948944\n", "\n", "\n", "\n", "\n", "\n", "11377954880\n", "\n", "fc_layers.0.weight\n", " (32, 784)\n", "\n", "\n", "\n", "11377954880->11377948128\n", "\n", "\n", "\n", "\n", "\n", "11377949472\n", "\n", "TBackward0\n", "\n", "\n", "\n", "11377949472->11377949376\n", "\n", "\n", "\n", "\n", "\n", "11377947264\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377947264->11377949472\n", "\n", "\n", "\n", "\n", "\n", "11377956000\n", "\n", "fc_layers.2.weight\n", " (64, 32)\n", "\n", "\n", "\n", "11377956000->11377947264\n", "\n", "\n", "\n", "\n", "\n", "11377949136\n", "\n", "TBackward0\n", "\n", "\n", "\n", "11377949136->11377946736\n", "\n", "\n", "\n", "\n", "\n", "11377949712\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377949712->11377949136\n", "\n", "\n", "\n", "\n", "\n", "11377955760\n", "\n", "fc_layers.4.weight\n", " (128, 64)\n", "\n", "\n", "\n", "11377955760->11377949712\n", "\n", "\n", "\n", "\n", "\n", "11377949088\n", "\n", "TBackward0\n", "\n", "\n", "\n", "11377949088->11373551088\n", "\n", "\n", "\n", "\n", "\n", "11377949520\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11377949520->11377949088\n", "\n", "\n", "\n", "\n", "\n", "11377956640\n", "\n", "fc_layers.6.weight\n", " (64, 128)\n", "\n", "\n", "\n", "11377956640->11377949520\n", "\n", "\n", "\n", "\n", "\n", "11373548880\n", "\n", "TBackward0\n", "\n", "\n", "\n", "11373548880->11377792384\n", "\n", "\n", "\n", "\n", "\n", "11373550512\n", "\n", "AccumulateGrad\n", "\n", "\n", "\n", "11373550512->11373548880\n", "\n", "\n", "\n", "\n", "\n", "11377956960\n", "\n", "fc_layers.8.weight\n", " (10, 64)\n", "\n", "\n", "\n", "11377956960->11373550512\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "torchviz.make_dot(mlp5(x0), params=dict(mlp5.named_parameters()))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Loss and Optimizer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For the loss function, we'll use PyTorch's built in negative log-likelihood loss since our model outputs probabilities." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:52.029870Z", "iopub.status.busy": "2021-05-06T06:05:52.029381Z", "iopub.status.idle": "2021-05-06T06:05:52.059506Z", "shell.execute_reply": "2021-05-06T06:05:52.060056Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "tensor(2.3782, grad_fn=)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch.optim\n", "\n", "# Loss:\n", "# Note: NLLLoss assumes *log*-probabilities (given by our LogSoftmax layer)\n", "loss_fn = nn.NLLLoss()\n", "\n", "# Fake ground-truth labels\n", "# Notice that we don't need to 1-hot encode them!\n", "yt = torch.randint(low=0, high=n_classes, size=(y_hat0.shape[0],))\n", "\n", "# Try out the loss\n", "loss_fn(y_hat0, yt) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Note that `nn.NLLLoss(y_hat, y)` assumes that $\\hat{\\vec{y}}^{(i)}=\\log\\left(\\mathrm{softmax}\\left(\\vec{z}^{(i)}\\right)\\right)$, where $\\vec{z}^{(i)}\\in\\set{R}^C$ contains the raw scores for input $\\vec{x}^{(i)}$, and therefore it simply computes:\n", "\n", "$$\n", "L_{\\mathrm{NLL}}\n", "= \\sum_{i=1}^{N} -\\left[\\hat{\\vec{y}}^{(i)}\\right]_{y^{(i)}}\n", "= \\sum_{i=1}^{N} -\\left[ \\log\\left(\\mathrm{softmax}\\left(\\vec{z}^{(i)}\\right)\\right) \\right]_{y^{(i)}}\n", "$$\n", "\n", "where $y^{(i)}\\in\\left\\{0,1,\\dots,C-1\\right\\}$ is the ground-truth class label of for sample $i$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As for the optimization scheme, we'll use a built in SGD optimizer from the `torch.optim` module.\n", "We will see that the semantics of using it are similar to the simple optimizer we implemented last tutorial.\n", "\n", "We won't need to calculate the loss gradient this time, as we'll use `autograd` for automatic differentiation." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:52.063675Z", "iopub.status.busy": "2021-05-06T06:05:52.063193Z", "iopub.status.idle": "2021-05-06T06:05:52.097596Z", "shell.execute_reply": "2021-05-06T06:05:52.096943Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "torch.manual_seed(42)\n", "\n", "# Model for training\n", "model = MLP(D_in=n_features, hidden_dims=[32, 32, 32], D_out=n_classes, nonlin='relu')\n", "\n", "# Optimizer over our model's parameters\n", "optimizer = torch.optim.SGD(params=model.parameters(), lr=5e-2, weight_decay=0.01, momentum=0.9)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Training loop" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This time we'll train over lazy-loaded batches from our data loader.\n", "\n", "Notice that except from our model's `__init__()` and `__forward()__`, we're using PyTorch facilities for the entire training implementation." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2021-05-06T06:05:52.102057Z", "iopub.status.busy": "2021-05-06T06:05:52.101567Z", "iopub.status.idle": "2021-05-06T06:05:56.017546Z", "shell.execute_reply": "2021-05-06T06:05:56.018062Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch #1: Avg. loss=2.305134725570679\n", "Epoch #2: Avg. loss=2.286450743675232\n", "Epoch #3: Avg. loss=2.256749749183655\n", "Epoch #4: Avg. loss=2.176987957954407\n", "Epoch #5: Avg. loss=1.9311090946197509\n", "Epoch #6: Avg. loss=1.4161057472229004\n", "Epoch #7: Avg. loss=0.9191414415836334\n", "Epoch #8: Avg. loss=0.6763368904590606\n", "Epoch #9: Avg. loss=0.5613920748233795\n", "Epoch #10: Avg. loss=0.5014606505632401\n" ] } ], "source": [ "num_epochs = 10\n", "for epoch_idx in range(num_epochs):\n", " total_loss = 0\n", " \n", " for batch_idx, (X, y) in enumerate(dl_train):\n", " # Forward pass\n", " y_pred = model(X)\n", "\n", " # Compute loss\n", " loss = loss_fn(y_pred, y)\n", " total_loss += loss.item()\n", "\n", " # Backward pass\n", " optimizer.zero_grad() # Zero gradients of all parameters\n", " loss.backward() # Run backprop algorithms to calculate gradients\n", " \n", " # Optimization step\n", " optimizer.step() # Use gradients to update model parameters\n", " \n", " print(f'Epoch #{epoch_idx+1}: Avg. loss={total_loss/len(dl_train)}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using the basic PyTorch building blocks we have arrived at a much more robust implementation:\n", "- Easy to change architecture: layers and activation functions\n", "- Easy to change loss\n", "- Easy to change optimization method\n", "- No need for manual gradient derivations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Thanks!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**Credits**\n", "\n", "This tutorial was written by [Aviv A. Rosenberg](https://avivr.net), modified by [Moshe Kimhi](https://mkimhi.github.io/).
\n", "To re-use, please provide attribution and link to the original.\n", "\n", "Some images in this tutorial were taken and/or adapted from the following sources:\n", "\n", "- MartinThoma [CC0], via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Perceptron-unit.svg\n", "- Pattern Recognition and Machine Learning, C. M. Bishop, Springer, 2006\n", "- Sebastian Raschka https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html\n", "- Favio Vázquez https://towardsdatascience.com/a-conversation-about-deep-learning-9a915983107\n", "- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }