{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\n", "\\newcommand{\\mat}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\mattr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\matinv}[1]{\\boldsymbol {#1}^{-1}}\n", "\\newcommand{\\vec}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\vectr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\rvar}[1]{\\mathrm {#1}}\n", "\\newcommand{\\rvec}[1]{\\boldsymbol{\\mathrm{#1}}}\n", "\\newcommand{\\diag}{\\mathop{\\mathrm {diag}}}\n", "\\newcommand{\\set}[1]{\\mathbb {#1}}\n", "\\newcommand{\\norm}[1]{\\left\\lVert#1\\right\\rVert}\n", "\\newcommand{\\pderiv}[2]{\\frac{\\partial #1}{\\partial #2}}\n", "\\newcommand{\\bb}[1]{\\boldsymbol{#1}}\n", "\\newcommand{\\ip}[3]{\\left<#1,#2\\right>_{#3}}\n", "\\newcommand{\\E}[2][]{\\mathbb{E}_{#1}\\left[#2\\right]}\n", "$$\n", "\n", "# CS236781: Deep Learning\n", "# Tutorial 5: Optimization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Introduction\n", "\n", "In this tutorial, we will cover:\n", "\n", "- Descent-based optimization\n", "- Back-propagation\n", "- Automatic differentiation\n", "- PyTorch backward functions\n", "- Bi-level differentiable optimization\n", "- Time-series prediction with CNNs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2020-11-24T17:30:31.338632Z", "iopub.status.busy": "2020-11-24T17:30:31.337989Z", "iopub.status.idle": "2020-11-24T17:30:31.962948Z", "shell.execute_reply": "2020-11-24T17:30:31.963542Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Setup\n", "%matplotlib inline\n", "import os\n", "import sys\n", "import time\n", "import torch\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2020-11-24T17:30:31.966723Z", "iopub.status.busy": "2020-11-24T17:30:31.966239Z", "iopub.status.idle": "2020-11-24T17:30:31.985956Z", "shell.execute_reply": "2020-11-24T17:30:31.986488Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.rcParams['font.size'] = 14\n", "data_dir = os.path.expanduser('~/.pytorch-datasets')\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Theory Reminders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Descent-based optimization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As we have seen, training deep neural network is performed iteratively using descent-based optimization." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The general scheme is,\n", "\n", "1. Initialize parameters to some $\\vec{\\Theta}^0 \\in \\set{R}^P$, and set $k\\leftarrow 0$.\n", "2. While not converged:\n", " 1. Choose a direction $\\vec{d}^k\\in\\set{R}^P$\n", " 2. Choose a step size $\\eta_k\\in\\set{R}$\n", " 3. Update: $\\vec{\\Theta}^{k+1} \\leftarrow \\vec{\\Theta}^k + \\eta_k \\vec{d}^k$\n", " 4. $k\\leftarrow k+1$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which descent direction to choose?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The one which maximally decreases the loss function $L(\\vec{\\Theta})$:\n", "\n", "$$\n", "\\vec{d} =\\arg\\min_{\\vec{d'}} L(\\vec{\\Theta}+\\vec{d'})-L(\\vec{\\Theta})\n", "\\approx\n", "\\arg\\min_{\\vec{d'}}\\nabla L(\\vec{\\Theta})^\\top\\vec{d'}, \\\n", "\\mathrm{s.t.} \\norm{\\vec{d}}_p=1\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choice of norm determines $\\vec{d}$. For example,\n", "- $p=1$: Coordinate descent: direction of the largest gradient component.\n", "- $p=2$: Gradient descent: $\\vec{d}=-\\nabla L(\\vec{\\Theta})$.\n", "\n", "|$p=1$|$p=2$|\n", "|---|---|\n", "| | | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Drawbacks and mitigations?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Susceptible to initialization**\n", "\n", "Initializing near local minima can prevent finding better ones.\n", "\n", "