{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\n", "\\newcommand{\\mat}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\mattr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\matinv}[1]{\\boldsymbol {#1}^{-1}}\n", "\\newcommand{\\vec}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\vectr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\rvar}[1]{\\mathrm {#1}}\n", "\\newcommand{\\rvec}[1]{\\boldsymbol{\\mathrm{#1}}}\n", "\\newcommand{\\diag}{\\mathop{\\mathrm {diag}}}\n", "\\newcommand{\\set}[1]{\\mathbb {#1}}\n", "\\newcommand{\\norm}[1]{\\left\\lVert#1\\right\\rVert}\n", "\\newcommand{\\pderiv}[2]{\\frac{\\partial #1}{\\partial #2}}\n", "\\newcommand{\\bb}[1]{\\boldsymbol{#1}}\n", "$$\n", "\n", "\n", "# CS236781: Deep Learning\n", "# Tutorial 3: Convolutional Neural Networks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Introduction\n", "\n", "In this tutorial, we will cover:\n", "\n", "- Convolutional layers\n", "- Pooling layers\n", "- Network architecture\n", "- Spatial classification with fully-convolutional nets\n", "- Residual nets" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:40.304945Z", "iopub.status.busy": "2022-03-24T07:23:40.300065Z", "iopub.status.idle": "2022-03-24T07:23:41.606026Z", "shell.execute_reply": "2022-03-24T07:23:41.605724Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Setup\n", "%matplotlib inline\n", "import os\n", "import sys\n", "import torch\n", "import torchvision\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:41.608025Z", "iopub.status.busy": "2022-03-24T07:23:41.607915Z", "iopub.status.idle": "2022-03-24T07:23:41.621298Z", "shell.execute_reply": "2022-03-24T07:23:41.621040Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.rcParams['font.size'] = 20\n", "data_dir = os.path.expanduser('~/.pytorch-datasets')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Theory Reminders" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Multilayer Perceptron (MLP)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Model\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Composed of multiple **layers**.\n", "\n", "Each layer $j$ consists of $n_j$ regular perceptrons (\"neurons\") which calculate:\n", "$$\n", "\\vec{y}_j = \\varphi\\left( \\mat{W}_j \\vec{y}_{j-1} + \\vec{b}_j \\right),~\n", "\\mat{W}_j\\in\\set{R}^{n_{j}\\times n_{j-1}},~ \\vec{b}_j\\in\\set{R}^{n_j}.\n", "$$\n", "\n", "- Note that both input and output are **vectors**. We can think of the above equation as describing a layer of **multiple perceptrons**.\n", "- We'll henceforth refer to such layers as **fully-connected** or FC layers.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given an input sample $\\vec{x}^i$, the computed function of an $L$-layer MLP is:\n", "$$\n", "\\vec{y}_L^i= \\varphi \\left(\n", "\\mat{W}_L \\varphi \\left( \\cdots\n", "\\varphi \\left( \\mat{W}_1 \\vec{x}^i + \\vec{b}_1 \\right)\n", "\\cdots \\right)\n", "+ \\vec{b}_L \\right)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Potent hypothesis class**: An MLP with $L>1$, can approximate virtually any continuous function given enough parameters (Cybenko, 1989)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Limitations of MLPs for image classification" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Number of parameters increases quadratically with image size due to connectivity.\n", "- 28x28 MNIST image: 784 weights per neuron in the first layer\n", "- 1000x1000x3 color image: 3M weights **per neuron**\n", " \n", "
\"scale\"
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Not enough compute\n", "\n", "* Overfitting\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Fully-connected layers are highly sensitivity to translation, while image features are inherently translation-invariant.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Despite all these limitations we still want to use deep neural nets because they allow us to learn hierarchical,\n", "non-linear transformations of the input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Convolutional Layers" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We'll explain how convolutional layers work in using three different \"views\", from the most non-formal to the most formal." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Structural view" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Just for intuition, a convolutional layer **can be viewed** as a composition of neurons (as in an FC layer) but with three important distinctions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. The neurons can be thought of as stacked in a **3D** grid (insead of 1D).\n", "1. Neurons that are at the same depth in the grid **share the same weights** (parameters $\\mat{W},~\\vec{b}$) (represented by color).\n", "1. Each neuron is only **connected to a small region** of the previous layer's output (represented by location).\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Crucially, each neuron is spatially local, but operates on the **full depth** dimension of its input layer.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Filter-based view" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since each neuron in a given depth-slice of operates on a small region of the input layer, we can think of the combined **output of that depth-slice** as a **filtered version of the input volume**.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Imagine sliding the filter along the input and computing an inner product at each point.\n", "\n", "
\n", "\n", "Since we have multiple depth-slices per convolutional layer, the layer computes multiple convolutions of the same input with different kernels (filters).\n", "\n", "Each 2D slice of an input and output volume is known as **feature map** or a **channel**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Formal definitions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Given an input tensor $\\vec{x}$ of shape $(C_{\\text{in}}, H_{\\text{in}}, W_{\\text{in}})$,\n", "a convolutional layer produces an output tensor $\\vec{y}$ of shape $(C_{\\text{out}}, H_{\\text{out}}, W_{\\text{out}})$,\n", "such that:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$\n", "\\vec{y}^j = \\sum_{i=1}^{C_\\text{in}} \\vec{w}^{ij}\\ast\\vec{x}^i+b^j;\\ j=1,2,\\dots,C_\\text{out}\n", "$$\n", "is the $j$-th feature map (or channel) of the output tensor $\\vec{y}$, the $\\ast$ denotes convolution, and $x^i$ is the $i$-th input feature map." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Recall the definition of the convolution operator:\n", "$$\n", "\\left\\{\\vec{g}\\ast\\vec{f}\\right\\}_j = \\sum_{i} g_{j-i} f_{i}.\n", "$$\n", "\n", "
\n", "\n", "Note that in practice, correlation is used instead of convolution, as there's no need to \"flip\" a learned filter." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Convolution is a **linear** and **shift-equivariant** operator." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Linear means it can be represented simply as a matrix multiplication.\n", "\n", "Shift-equivariance means that a shifted input will result in an output shifted by the same amount.\n", "Due to this property, the matrix representing a convolution is always a **Toeplitz** matrix.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Hyperparameters & dimentions\n", "\n", "Assume an input volume of shape $(C_{\\mathrm{in}}, H_{\\mathrm{in}}, W_{\\mathrm{in}})$, i.e. channels, height, width.\n", "Define,\n", "\n", "1. Number of kernels, $K \\geq 1$.\n", "2. Spatial extent (size) of each kernel, $F \\geq 1$. \n", "3. Stride $S\\geq 1$: spatial distance between consecutive applications of a kernel.\n", "4. Padding $P\\geq 0$: Number of \"pixels\" to zero-pad around each input feature map.\n", "5. Dilation $D \\geq 1$: Spacing between kernel elements when applying to input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In the following animations, **blue** maps are inputs,\n", "**green** maps are outputs and\n", "the **shaded** area is the kernel with $F=3$.\n", "\n", "| $P=0,~S=1,~D=1$ | $P=1,~S=1,~D=1$ | $P=1,~S=2,~D=1$ | $P=0,~S=1,~D=2$ |\n", "|-----------------|-----------------|-----------------| --------------- |\n", "|| | | |\n", "\n", "\n", "We can see that the second combination, $F=3,~P=1,~S=1,~D=1$, leads to identical sizes of input and output feature maps." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A 3D view\n", "\n", "| $P=0,~S=1,~D=1$ | $P=1,~S=1,~D=1$ | $P=1,~S=2,~D=1$ |\n", "|-----------------|-----------------|-----------------|\n", "|| | | \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then, given a set of hyperparameters,\n", "\n", "- Each convolution kernel will (usually) be a tensor of shape $(C_{\\mathrm{in}}, F, F)$.\n", "- The ouput volume dimensions will be:\n", "\n", " $$\\begin{align}\n", " H_{\\mathrm{out}} &= \\left\\lfloor \\frac{H_{\\mathrm{in}} + 2P - D\\cdot(F-1) -1}{S} \\right\\rfloor + 1\\\\\n", " W_{\\mathrm{out}} &= \\left\\lfloor \\frac{W_{\\mathrm{in}} + 2P - D\\cdot(F-1) -1}{S} \\right\\rfloor + 1\\\\\n", " C_{\\mathrm{out}} &= K\\\\\n", " \\end{align}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The number of parameters in a convolutional **layer** will be:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$\n", "\\underbrace{K}_{\\mathrm{kernels}} \\cdot \\left(\n", "\\underbrace{C_{\\mathrm{in}} \\cdot F^2}_{\\mathrm{kernel\\ parameters}} + \\underbrace{1}_{\\mathrm{bias\\ term}}\n", "\\right)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Example**: Input image is 1000x1000x3, and the first conv layer has $10$ kernels of size 5x5.\n", "The number of parameters in the first layer will be: $ 10 \\cdot 3 \\cdot 5^2 + 10 = 760 $.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Pytorch `Conv2d` layer example" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:41.624486Z", "iopub.status.busy": "2022-03-24T07:23:41.624382Z", "iopub.status.idle": "2022-03-24T07:23:42.462133Z", "shell.execute_reply": "2022-03-24T07:23:42.461799Z" }, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Files already downloaded and verified\n" ] } ], "source": [ "import torchvision.transforms as tvtf\n", "\n", "tf = tvtf.Compose([tvtf.ToTensor()])\n", "ds_cifar10 = torchvision.datasets.CIFAR10(data_dir, download=True, train=True, transform=tf)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.464224Z", "iopub.status.busy": "2022-03-24T07:23:42.464117Z", "iopub.status.idle": "2022-03-24T07:23:42.479634Z", "shell.execute_reply": "2022-03-24T07:23:42.479339Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0 shape with batch dim: torch.Size([1, 3, 32, 32])\n" ] } ], "source": [ "# Load first CIFAR10 image\n", "x0,y0 = ds_cifar10[0]\n", "\n", "# add batch dimension\n", "x0 = x0.unsqueeze(0)\n", "\n", "# Note: channels come before spatial extent\n", "print('x0 shape with batch dim:', x0.shape)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.481747Z", "iopub.status.busy": "2022-03-24T07:23:42.481629Z", "iopub.status.idle": "2022-03-24T07:23:42.494972Z", "shell.execute_reply": "2022-03-24T07:23:42.494679Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# A function to count the number of parameters in an nn.Module.\n", "def num_params(layer):\n", " return sum([p.numel() for p in layer.parameters()])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's create our first conv layer with pytorch:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.496857Z", "iopub.status.busy": "2022-03-24T07:23:42.496774Z", "iopub.status.idle": "2022-03-24T07:23:42.511150Z", "shell.execute_reply": "2022-03-24T07:23:42.510875Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "conv1: 760 parameters\n" ] } ], "source": [ "import torch.nn as nn\n", "\n", "# First conv layer: works on input image volume\n", "conv1 = nn.Conv2d(in_channels=x0.shape[1], out_channels=10, padding=1, kernel_size=5, stride=1,dialation=1)\n", "\n", "print(f'conv1: {num_params(conv1)} parameters')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Number of parameters: $10\\cdot(3\\cdot3^2+1)=280$" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.512891Z", "iopub.status.busy": "2022-03-24T07:23:42.512788Z", "iopub.status.idle": "2022-03-24T07:23:42.532553Z", "shell.execute_reply": "2022-03-24T07:23:42.532226Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input image shape: torch.Size([1, 3, 32, 32])\n", "After first conv layer: torch.Size([1, 10, 30, 30])\n" ] } ], "source": [ "# Apply the layer to an input\n", "print(f'{\"Input image shape:\":25s}{x0.shape}')\n", "\n", "y1 = conv1(x0)\n", "print(f'{\"After first conv layer:\":25s}{y1.shape}')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.534579Z", "iopub.status.busy": "2022-03-24T07:23:42.534466Z", "iopub.status.idle": "2022-03-24T07:23:42.553679Z", "shell.execute_reply": "2022-03-24T07:23:42.553246Z" }, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "conv2: 9820 parameters\n", "After second conv layer: torch.Size([1, 20, 12, 12])\n" ] } ], "source": [ "# Second conv layer: works on output volume of first layer\n", "conv2 = nn.Conv2d(in_channels=10, out_channels=20, padding=0, kernel_size=7, stride=2)\n", "print(f'conv2: {num_params(conv2)} parameters')\n", "\n", "y2 = conv2(conv1(x0))\n", "print(f'{\"After second conv layer:\":25s}{y2.shape}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "New spatial extent:\n", "\n", "$$\n", "H_{\\mathrm{out}} = \\left\\lfloor \\frac{H_{\\mathrm{in}} + 2P -F}{S} \\right\\rfloor + 1\n", "=\n", "\\left\\lfloor \\frac{32 + 2\\cdot 0 -6}{2} \\right\\rfloor + 1\n", "=\n", "14\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Note**: observe that the width and height dimensions of the input image were never specified!\n", "more on the significance of that later." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pooling layers" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In addition to strides, another way to reduce the size of feature maps between the convolutional layers,\n", "is by adding **pooling** layers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A pooling layer has the following hyperparameters (but **no trainable parameters**):\n", "\n", "1. Spatial extent (size) of each pooling kernel, $F \\geq 2$. \n", "1. Stride $S\\geq 2$: spatial distance between consecutive applications.\n", "1. Operation (e.g. max, average, $p$-norm)\n", "\n", "**Example**: $\\max$-pooling with $F=2,~S=2$ performing a factor-2 downsample:\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Why pool feature maps after convolutions?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One reason is to more rapidly increase the **receptive field** of each layer.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Receptive field size increases more rapidly if we add pooling, strides or dilation.\n", "- We want successive conv layers to be affected by increasingly larger parts of the input image.\n", "- This allows us to learn a hierarchy of visual features.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Another reason is to add **invariance** to changes in the input.\n", "\n", "- Pooling within feature maps: introduces invariance to small translations\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "- Pooling across feature maps: introduces invariance to learned transformations\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PyTorch `Pool2d` layer example" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.556622Z", "iopub.status.busy": "2022-03-24T07:23:42.556478Z", "iopub.status.idle": "2022-03-24T07:23:42.574020Z", "shell.execute_reply": "2022-03-24T07:23:42.573654Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After second conv layer: torch.Size([1, 20, 12, 12])\n", "After max-pool: torch.Size([1, 20, 6, 6])\n" ] } ], "source": [ "pool = nn.MaxPool2d(kernel_size=2, stride=2)\n", "\n", "print(f'{\"After second conv layer:\":25s}{conv2(conv1(x0)).shape}')\n", "print(f'{\"After max-pool:\":25s}{pool(conv2(conv1(x0))).shape}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Network Architecture" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The basic way to build an architecture of a deep convolutional neural net, is to repeat groups of **conv-relu** layers, optionally add **pooling** in between and end with an **FC-softmax** combination.\n", "\n", "\n", "Why does such a scheme make sense, e.g. for image classification?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In the above image,\n", "\n", "- all the **conv** blocks shown are actually **conv-relu** (or some other nonlinearity).\n", "- The repeating **conv-conv-...-pool** blocks are learned, non-linear feature extractors: they learn to detect specific features in an image (e.g. lines at different orientations).\n", "- The pooling controls the receptive field increase, so that more high-level features can be generated by each conv group (e.g. shapes composed from multiple simple lines).\n", "- The **FC-softmax** at the end is just an MLP that uses the extracted features for classification.\n", "- Training end-to-end learns the classifier together with the features!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The rightmost architecture is called VGG, and used to be a relevant architecture for ImageNet classification.\n", "- Other types of layers, such as normalization layers are usually also added." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "There are many other things to consider as part of the architecture:\n", "- Size of conv kernels\n", "- Number of consecutive convolutions\n", "- Use of batch normalization to speed up training\n", "- Dropout for improved generalization\n", "- Not using FC layers (we'll see later)\n", "- Skip connections (we'll see later)\n", "\n", "All of these could be hyperparameters to cross-validate over!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Many different network architectures exist, made famous mainly by repeated improvements on the ImageNet classification challenge since 2012.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Notable ImageNet-winning architectures:\n", "\n", "- AlexNet, 5 layers (2012): Based on LeNet, deeper, with ReLU, trained with GPUs\n", "- Inception/GoogLeNet, 22 layers (2014): Multiple (small) kernel sizes at same depth\n", "- ResNet, 152 (!) layers (2015): Skip connections" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### What filters are deep CNNs learning?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "CNNs capture hierarchical features, with deeper layers capturing higher-level, class-specific features\n", "(Zeiler & Fergus, 2013).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This visualization shows patterns which maximally-activate kernels at various layers of a conv net." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### PyTorch network architecture example" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's implement **LeNet**, arguably the first successful CNN model for MNIST (LeCun, 1998).\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.576853Z", "iopub.status.busy": "2022-03-24T07:23:42.576761Z", "iopub.status.idle": "2022-03-24T07:23:42.593373Z", "shell.execute_reply": "2022-03-24T07:23:42.593067Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class LeNet(nn.Module):\n", " def __init__(self, in_channels=3):\n", " super().__init__()\n", " self.feature_extractor = nn.Sequential(\n", " nn.Conv2d(in_channels, out_channels=6, kernel_size=5),\n", " nn.ReLU(),\n", " nn.MaxPool2d(2),\n", " nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5),\n", " nn.ReLU(),\n", " nn.MaxPool2d(2),\n", " )\n", " self.classifier = nn.Sequential(\n", " nn.Linear(16*5*5, 120), # Why 16*5*5 ?\n", " nn.ReLU(), \n", " nn.Linear(120, 84), # (N, 120) -> (N, 84)\n", " nn.ReLU(),\n", " nn.Linear(84, 10) # (N, 84) -> (N, 10)\n", " )\n", " def forward(self, x):\n", " features = self.feature_extractor(x)\n", " features = features.view(features.size(0), -1)\n", " class_scores = self.classifier(features)\n", " return class_scores" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.595148Z", "iopub.status.busy": "2022-03-24T07:23:42.595066Z", "iopub.status.idle": "2022-03-24T07:23:42.608738Z", "shell.execute_reply": "2022-03-24T07:23:42.608473Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LeNet(\n", " (feature_extractor): Sequential(\n", " (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))\n", " (1): ReLU()\n", " (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n", " (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))\n", " (4): ReLU()\n", " (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n", " )\n", " (classifier): Sequential(\n", " (0): Linear(in_features=400, out_features=120, bias=True)\n", " (1): ReLU()\n", " (2): Linear(in_features=120, out_features=84, bias=True)\n", " (3): ReLU()\n", " (4): Linear(in_features=84, out_features=10, bias=True)\n", " )\n", ")\n" ] } ], "source": [ "net = LeNet()\n", "print(net)\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.610411Z", "iopub.status.busy": "2022-03-24T07:23:42.610309Z", "iopub.status.idle": "2022-03-24T07:23:42.661448Z", "shell.execute_reply": "2022-03-24T07:23:42.661183Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x0 shape= torch.Size([1, 3, 32, 32])\n", "\n", "LeNet(x0)= tensor([[-0.0388, 0.0337, -0.0120, -0.0205, -0.0326, 0.0651, -0.0826, -0.0595,\n", " 0.0831, 0.1228]], grad_fn=)\n", "\n", "shape= torch.Size([1, 10])\n" ] } ], "source": [ "# Test forward pass\n", "print('x0 shape=', x0.shape, end='\\n\\n')\n", "print('LeNet(x0)=', net(x0), end='\\n\\n')\n", "print('shape=', net(x0).shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Fully-convolutional Networks\n", "you can read at home, not for the homework" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Notice how we never actually specified the input image size when implementing the network.\n", "\n", "**Does this mean we can use the network on images of any size**?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**No**, because of the FC layers at the end.\n", "\n", "Here, let's try:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.663315Z", "iopub.status.busy": "2022-03-24T07:23:42.663210Z", "iopub.status.idle": "2022-03-24T07:23:42.698988Z", "shell.execute_reply": "2022-03-24T07:23:42.698684Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mat1 and mat2 shapes cannot be multiplied (1x2704 and 400x120)\n" ] } ], "source": [ "large_image = torch.randn(1,3,32*2,32*2)\n", "try:\n", " net(large_image)\n", "except RuntimeError as e:\n", " print(e, file=sys.stderr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "However: Only the FC layers at the end require actual knowledge of exact image sizes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We can replace them with... More convolutions, of course" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "What would we get from:\n", "\n", "- Kernels of size 1x1?\n", "- Kernels of size HxW (full spatial extent)?\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Lets create a fully-convolutional LeNet:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.700836Z", "iopub.status.busy": "2022-03-24T07:23:42.700733Z", "iopub.status.idle": "2022-03-24T07:23:42.715096Z", "shell.execute_reply": "2022-03-24T07:23:42.714801Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "class LeNetFullyConv(LeNet):\n", " def __init__(self):\n", " super().__init__()\n", " # Remember: the last feature map volume has shape (16,5,5) for the original image size\n", " # Override the classifier with 5x5 then 1x1 convolutions\n", " # Try to figure out the output shape after each of the following convolutions:\n", " self.classifier = nn.Sequential(\n", " nn.Conv2d(in_channels=16, out_channels=120, kernel_size=5), # no padding or strides!\n", " nn.ReLU(),\n", " nn.Conv2d(in_channels=120, out_channels=84, kernel_size=1), # 1x1 conv\n", " nn.ReLU(),\n", " nn.Conv2d(in_channels=84, out_channels=10, kernel_size=1), # 1x1 conv\n", " )\n", " \n", " def forward(self, x):\n", " # Using feature extractor block from the base model\n", " features = self.feature_extractor(x)\n", " # note: no need to reshape the features now\n", " class_scores = self.classifier(features)\n", " return class_scores" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.716758Z", "iopub.status.busy": "2022-03-24T07:23:42.716679Z", "iopub.status.idle": "2022-03-24T07:23:42.731167Z", "shell.execute_reply": "2022-03-24T07:23:42.730846Z" }, "scrolled": true, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LeNetFullyConv(\n", " (feature_extractor): Sequential(\n", " (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))\n", " (1): ReLU()\n", " (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n", " (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))\n", " (4): ReLU()\n", " (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n", " )\n", " (classifier): Sequential(\n", " (0): Conv2d(16, 120, kernel_size=(5, 5), stride=(1, 1))\n", " (1): ReLU()\n", " (2): Conv2d(120, 84, kernel_size=(1, 1), stride=(1, 1))\n", " (3): ReLU()\n", " (4): Conv2d(84, 10, kernel_size=(1, 1), stride=(1, 1))\n", " )\n", ")\n" ] } ], "source": [ "net_fully_conv = LeNetFullyConv()\n", "print(net_fully_conv)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Let's forward the original-sized image and the larger image through the network and observe the output shapes:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2022-03-24T07:23:42.733071Z", "iopub.status.busy": "2022-03-24T07:23:42.732968Z", "iopub.status.idle": "2022-03-24T07:23:42.763894Z", "shell.execute_reply": "2022-03-24T07:23:42.763569Z" }, "scrolled": true, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "regular image output shape: torch.Size([1, 10, 1, 1])\n", "large image output shape: torch.Size([1, 10, 9, 9])\n" ] } ], "source": [ "print('regular image output shape:', net_fully_conv(x0).shape)\n", "print('large image output shape:', net_fully_conv(large_image).shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**What's the meaning of the output after conversion to fully convolutional?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "It's now a **spatial classification map**.\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Residual Networks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "For image-related tasks it seems that **deeper is better**: learn more complex features.\n", "\n", "How deep can we go? Should more depth always improve results?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In theory, adding an addition layer should provide **at least** the same accuracy as before." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Extra layers could always be just identity maps.\n", "\n", "In practice, there are two major problems with adding depth:\n", "1. More difficult convergence: vanishing gradients\n", "1. More difficult optimization: parameter space increases" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "I.e., even if the same solution (or better) exists, SGD-based optimization can't find it. **Optimization error** increased with depth." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "ResNets attempt to address these issues by building a network architecture composed of convolutional blocks with added **shortcut-connections**:\n", "\n", "
\n", "\n", "(Left: basic block; right: bottleneck block).\n", "\n", "Here the weight layers are `3x3` or `1x1` convolutions followed by batch-normalization.\n", "\n", "**Why do these shortcut-connections help?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "These shortcuts create two key advantages:\n", "- Allow gradients to \"flow\" freely backwards\n", "- Each block only learns the \"residual mapping\", i.e. some delta from the identity map which is easier to optimize." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Implementation: In the homeworks :)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Thanks!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**Credits**\n", "\n", "This tutorial was written by [Aviv A. Rosenberg](https://avivr.net) and [Moshe Kimhi](https://mkimhi.github.io/).
\n", "To re-use, please provide attribution and link to the original.\n", "\n", "Some images in this tutorial were taken and/or adapted from the following sources:\n", "\n", "- Sebastian Raschka, https://sebastianraschka.com/\n", "- Deep Learning, Goodfellow, Bengio and Courville, MIT Press, 2016\n", "- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017\n", "- Deep Learning with Python, Francios Chollet, Manning 2018\n", "- Stanford cs231n course notes by Andrej Karpathy\n", "- https://github.com/vdumoulin/conv_arithmetic\n", "- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.\n", "- Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications.\n", "- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.\n", "- A Comprehensive Introduction to Different Types of Convolutions in Deep Learning, Kulun Bai\n", "- https://animatedai.github.io/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "rise": { "scroll": true } }, "nbformat": 4, "nbformat_minor": 4 }