{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "$\\LaTeX \\text{ commands here}\n", "\\newcommand{\\R}{\\mathbb{R}}\n", "\\newcommand{\\im}{\\text{im}\\,}\n", "\\newcommand{\\norm}[1]{||#1||}\n", "\\newcommand{\\inner}[1]{\\langle #1 \\rangle}\n", "\\newcommand{\\span}{\\mathrm{span}}\n", "\\newcommand{\\proj}{\\mathrm{proj}}\n", "\\newcommand{\\OPT}{\\mathrm{OPT}}\n", "\\newcommand{\\grad}{\\nabla}\n", "\\newcommand{\\eps}{\\varepsilon}\n", "$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "**Georgia Tech, CS 4540**\n", "\n", "# L12: SGD and Online-to-Batch\n", "\n", "Jake Abernethy\n", "\n", "Quiz password: sgd\n", "\n", "*Tuesday, October 1, 2019*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction to Online Convex Optimization\n", "\n", "It became clear in the early 2000s that a lot of the tools that were being developed in sequential (online) learning problems were part of a generic framework. Researchers realized that you could generalize all of these ideas to a master setting. The basic setup is that you're given a convex and bounded decision set $K \\subset \\mathbb{R}^d$, and you will execute the following protocol.\n", "\n", "* For $t=1, \\ldots, T$\n", " + Alg selects $x_t \\in K$\n", " + Alg observes convex loss function $f_t : K \\to \\mathbb{R}$\n", "\n", "*Objective*: Minimize the regret, given by\n", "$$\n", "\\text{Regret}_T := \\sum_{t=1}^T f_t(x_t) - \\min_{x \\in K} \\sum_{t=1}^T f_t(x)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Online Gradient Descent (OGD)\n", "\n", "You've now read about the OGD algorithm:\n", "* Initialize $x_1 \\in K$ to be any point\n", "* For $t=1, \\ldots, T$:\n", " + Observe $f_t(\\cdot)$\n", " + Update $x_{t+1} = \\Pi_K(x_t - \\eta_t \\nabla f_t(x_t))$\n", "\n", "**Theorem**: One can show that, as long as the functions $f_t$ are $G$-lipschitz, and the set $K$ has diameter $D$, then\n", "$$\\text{Regret}_T(\\text{OGD}) \\leq \\frac 3 2 GD\\sqrt{T}$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Perceptron Algorithm as OGD\n", "\n", "Recall the Perceptron algorithm. We have a sample of data $\\{(x_1, y_1), \\ldots, (x_T, y_T)\\}$ where $x_t \\in \\mathbb{R}^d$ and $y_t = \\pm 1$. We assume there exists some $w^* \\in \\mathbb{R}^d$ such that $y_t(w^*\\cdot x_t) > \\gamma$.\n", "\n", "- Initialize: $w_1 = \\vec{0}$\n", "- For $t=1,2,\\ldots$:\n", " + If $y_t(w_t\\cdot x_t) > 0$ then $w_{t+1} = w_t$\n", " + Otherwise, $w_{t+1} = w_t + y_t x_t$\n", " \n", "#### Problem\n", "\n", "Show that the Perceptron algorithm is a special case of Online Gradient Descent" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Convergence bound for Gradient Descent\n", "\n", "You will notice that the OGD algorithm looks very similar to the standard GD algorithm for minimizing a fixed convex lipschitz function $\\Phi(\\cdot)$. But keep in mind the differences here: OGD uses a sequence of functions, whereas GD operates on only a single function.\n", "\n", "#### Problem\n", "\n", "How can we used the regret bound for OGD to get a convergence bound for (iterate-averaged) gradient descent? *Hint*: You can use $\\Phi(\\cdot)$ more than once! And you might find Jensen useful." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Solution\n", "\n", "We can start with the regret bound for OGD:\n", "$$\\sum_{t = 1}^T f_t(x_t) - min_{x \\in K} \\sum_{t = 1}^T f_t(x) \\leq \\frac{3}{2}GD\\sqrt{T}$$\n", "\n", "Our function $\\phi(x_t)$ is convex, so we may plug it into this regret bound:\n", "\n", "$$\\sum_{t = 1}^T \\phi(x_t) - min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3}{2}GD\\sqrt{T}$$\n", "\n", "Note that $\\phi(x_t)$, as a function does not change with time, t. Since we are trying to find a bound for iterate-averaged GD, we should find the average of the function's output by dividing by the total time T.\n", "\n", "$$ \\frac{1}{T}\\sum_{t = 1}^T \\phi(x_t) - \\frac{1}{T}min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n", "\n", "We can use Jensen's inequality, \"the average of a function of a value is greater than the function of the average\":\n", "\n", "$$\\phi\\left(\\frac{1}{T}\\sum_{t = 1}^T x_t\\right) - \\frac{1}{T}min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n", "\n", "We also note that the second term, having the sum of T constant terms divided by T can be reduced to just the min function:\n", "\n", "$$\\phi\\left(\\frac{1}{T}\\sum_{t = 1}^T x_t\\right) - min_{x \\in K} \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n", "\n", "This final statement puts a bound on the difference between the function at the average value and the function at the minimizing value. The bound is $\\frac{3GD}{2\\sqrt{T}}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probability Theory" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is a random variable? Discrete case.\n", "\n", "Technically speaking, a random variable $X : \\Omega \\to \\R$ is just a map from some *sample space* $\\Omega$ to the real numbers. We assume we have a *probability measure* $\\Omega$.\n", "\n", "$$\\def\\E{\\mathbb{E}}$$\n", "\n", "When $\\Omega$ is a finite or countably-infinite set, then we can assume our measure is just a probability $p(\\omega)$ assigned to every $\\omega \\in \\Omega$, where $\\sum_{\\omega \\in \\Omega} p_\\omega = 1$.\n", "* We often will write $P(X = a)$ which is precisely $\\sum_{\\omega \\in \\Omega : X(\\omega) = a} p(\\omega)$.\n", "* The expectation of a random variable $\\E[X]$ is defined as $\\sum_{\\omega \\in \\Omega} X(\\omega) p(\\omega)$.\n", "* Similarly, the expectation of a function $\\E[g(X)] = \\sum_{\\omega \\in \\Omega} g(X(\\omega)) p(\\omega)$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is a random variable? Continuous case.\n", "\n", "When the sample space $\\Omega$ is uncountably infinite, we usually need to go to full measure theory to talk about random variables in general. We won't do today, but consider a measure theory course!\n", "\n", "Countinuous random variables are the most common in Machine Learning. The best way to think about random variables is through their *probability density function* and *cumulative distribution function*\n", "* For random variable $X$, the CDF is $F(x) := P(X \\leq x)$\n", "* Also, the PDF of $X$ is the derivative of the CDF, $f(x) := F'(x)$\n", "* WARNING: not every random variable has a PDF! But it always has a CDF! (But CDF may not be diff'bl)\n", "* When a r.v. has a PDF $f(\\cdot)$, we can write\n", "$$ \\E[X] = \\int x f(x)\\, dx $$\n", "* When $\\mu$ is the mean of random variable $X$, then the variance $\\text{Var}(X)$ is $\\E[(X-\\mu)^2]$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### PDF vs CDF\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Independence\n", "\n", "* Strictly speaking, two random variables $X$ and $Y$ are independent if for any (measureable) sets $A, B \\subset \\R$ we have $P(X \\in A \\text{ and } Y \\in B) = P(X \\in A) P( Y \\in B)$.\n", "* Perhaps the most useful fact of independence of $X$ and $Y$ is that $\\E[XY] = \\E[X]\\E[Y]$\n", "
\n", "Problem: Show the following
\n", "Part A: Let $X$ be any random variable. Show that $\\text{Var}(X) = \\E[X^2] - (\\E[X])^2$
\n", "Part B: Let $X$ be some random variable. Then $\\text{Var}(X + X + X) = 9 \\text{Var}(X)$
\n", "Part C: Let $X_1, \\ldots, X_n$ be *independent* random variables. Then $\\text{Var}(X_1 + \\ldots + X_n) = \\text{Var}(X_1) + \\ldots + \\text{Var}(X_n)$
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Random functions\n", "\n", "In many fields, especially Machine Learning, we deal with functions that have randomized inputs. For example, assume we have some distribution $D$, we often consider functions of the form\n", "$$\n", "F(\\theta) := \\E_{\\xi \\sim D} [f(\\theta;\\xi)].\n", "$$\n", "Here, $\\theta$ are the parameters we want to optimize, and $\\xi$ is some random input, such as a datapoint. But we'd like to optimize $\\theta$ \"on average\", i.e. in expectation over a random sample of $\\xi \\sim D$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Stochastic Gradient Descent\n", "\n", "We can optimize functions, defined through expectations above, using *randomized gradients*. \n", "\n", "- Initialize $\\theta_1$\n", "- For $t=1,2, \\ldots, T$:\n", " + Sample $\\xi_t \\sim D$\n", " + $\\theta_{t+1} = \\Pi_K(\\theta_t - \\eta_t \\nabla f(\\theta_t; \\xi_t))$\n", "- Output: $\\bar \\theta_T := \\frac 1 T \\sum_{t=1}^T \\theta_t$\n", "\n", "**Theorem** (Hazan book): As long as $f(\\cdot;\\xi)$ is convex, $\\|\\nabla f(\\cdot; \\cdot) \\| \\leq G$ and $\\text{diam}(K) \\leq D$, then we have\n", "$$\\E[F(\\bar \\theta_T)] - \\min_{\\theta \\in K} F(\\theta) \\leq \\frac{3GD}{2\\sqrt{T}}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## SGD versus standard GD\n", "\n", "Let $F(\\theta) := \\frac 1 n \\sum_{i=1}^n f(\\theta; \\xi_i)$ be some objective function, defined as an average over samples $\\xi_1, \\ldots \\xi_n$. Consider two algorithms for minimizing $F$:\n", "\n", "1. Run (average-iterate) Gradient Descent on $F$, for $n$ steps\n", "2. Run SGD on $F$, by sampling a random $\\xi_i$ on each round, for $n$ steps\n", "\n", "#### Problem\n", "\n", "1. What is the convergence guarantee for each algorithm? (I'm only interested in the $T$-dependence!)\n", "1. What is the computational complexity of each? Which is better?\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }