{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\LaTeX \\text{ commands here}\n",
    "\\newcommand{\\R}{\\mathbb{R}}\n",
    "\\newcommand{\\im}{\\text{im}\\,}\n",
    "\\newcommand{\\norm}[1]{||#1||}\n",
    "\\newcommand{\\inner}[1]{\\langle #1 \\rangle}\n",
    "\\newcommand{\\span}{\\mathrm{span}}\n",
    "\\newcommand{\\proj}{\\mathrm{proj}}\n",
    "\\newcommand{\\OPT}{\\mathrm{OPT}}\n",
    "\\newcommand{\\grad}{\\nabla}\n",
    "\\newcommand{\\eps}{\\varepsilon}\n",
    "$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<hr style=\"border: 5px solid black\">\n",
    "\n",
    "**Georgia Tech, CS 4540**\n",
    "\n",
    "# L12: SGD and Online-to-Batch\n",
    "\n",
    "Jake Abernethy\n",
    "\n",
    "Quiz password: sgd\n",
    "\n",
    "*Tuesday, October 1, 2019*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Introduction to Online Convex Optimization\n",
    "\n",
    "It became clear in the early 2000s that a lot of the tools that were being developed in sequential (online) learning problems were part of a generic framework. Researchers realized that you could generalize all of these ideas to a master setting. The basic setup is that you're given a convex and bounded decision set $K \\subset \\mathbb{R}^d$, and you will execute the following protocol.\n",
    "\n",
    "* For $t=1, \\ldots, T$\n",
    "    + Alg selects $x_t \\in K$\n",
    "    + Alg observes convex loss function $f_t : K \\to \\mathbb{R}$\n",
    "\n",
    "*Objective*: Minimize the regret, given by\n",
    "$$\n",
    "\\text{Regret}_T := \\sum_{t=1}^T f_t(x_t) - \\min_{x \\in K} \\sum_{t=1}^T f_t(x)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Online Gradient Descent (OGD)\n",
    "\n",
    "You've now read about the OGD algorithm:\n",
    "* Initialize $x_1 \\in K$ to be any point\n",
    "* For $t=1, \\ldots, T$:\n",
    "    + Observe $f_t(\\cdot)$\n",
    "    + Update $x_{t+1} = \\Pi_K(x_t - \\eta_t \\nabla f_t(x_t))$\n",
    "\n",
    "**Theorem**: One can show that, as long as the functions $f_t$ are $G$-lipschitz, and the set $K$ has diameter $D$, then\n",
    "$$\\text{Regret}_T(\\text{OGD}) \\leq \\frac 3 2 GD\\sqrt{T}$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Perceptron Algorithm as OGD\n",
    "\n",
    "Recall the Perceptron algorithm. We have a sample of data $\\{(x_1, y_1), \\ldots, (x_T, y_T)\\}$ where $x_t \\in \\mathbb{R}^d$ and $y_t = \\pm 1$. We assume there exists some $w^* \\in \\mathbb{R}^d$ such that $y_t(w^*\\cdot x_t) > \\gamma$.\n",
    "\n",
    "- Initialize: $w_1 = \\vec{0}$\n",
    "- For $t=1,2,\\ldots$:\n",
    "    + If $y_t(w_t\\cdot x_t) > 0$ then $w_{t+1} = w_t$\n",
    "    + Otherwise, $w_{t+1} = w_t + y_t x_t$\n",
    "    \n",
    "#### Problem\n",
    "\n",
    "Show that the Perceptron algorithm is a special case of Online Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Convergence bound for Gradient Descent\n",
    "\n",
    "You will notice that the OGD algorithm looks very similar to the standard GD algorithm for minimizing a fixed convex lipschitz function $\\Phi(\\cdot)$. But keep in mind the differences here: OGD uses a sequence of functions, whereas GD operates on only a single function.\n",
    "\n",
    "#### Problem\n",
    "\n",
    "How can we used the regret bound for OGD to get a convergence bound for (iterate-averaged) gradient descent? *Hint*: You can use $\\Phi(\\cdot)$ more than once! And you might find Jensen useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Solution\n",
    "\n",
    "We can start with the regret bound for OGD:\n",
    "$$\\sum_{t = 1}^T f_t(x_t) - min_{x \\in K} \\sum_{t = 1}^T f_t(x) \\leq \\frac{3}{2}GD\\sqrt{T}$$\n",
    "\n",
    "Our function $\\phi(x_t)$ is convex, so we may plug it into this regret bound:\n",
    "\n",
    "$$\\sum_{t = 1}^T \\phi(x_t) - min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3}{2}GD\\sqrt{T}$$\n",
    "\n",
    "Note that $\\phi(x_t)$, as a function does not change with time, t. Since we are trying to find a bound for iterate-averaged GD, we should find the average of the function's output by dividing by the total time T.\n",
    "\n",
    "$$ \\frac{1}{T}\\sum_{t = 1}^T \\phi(x_t) - \\frac{1}{T}min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n",
    "\n",
    "We can use Jensen's inequality, \"the average of a function of a value is greater than the function of the average\":\n",
    "\n",
    "$$\\phi\\left(\\frac{1}{T}\\sum_{t = 1}^T x_t\\right) - \\frac{1}{T}min_{x \\in K} \\sum_{t = 1}^T \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n",
    "\n",
    "We also note that the second term, having the sum of T constant terms divided by T can be reduced to just the min function:\n",
    "\n",
    "$$\\phi\\left(\\frac{1}{T}\\sum_{t = 1}^T x_t\\right) - min_{x \\in K} \\phi(x) \\leq \\frac{3GD}{2\\sqrt{T}} $$\n",
    "\n",
    "This final statement puts a bound on the difference between the function at the average value and the function at the minimizing value. The bound is $\\frac{3GD}{2\\sqrt{T}}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Probability Theory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### What is a random variable? Discrete case.\n",
    "\n",
    "Technically speaking, a random variable $X : \\Omega \\to \\R$ is just a map from some *sample space* $\\Omega$ to the real numbers. We assume we have a *probability measure* $\\Omega$.\n",
    "\n",
    "$$\\def\\E{\\mathbb{E}}$$\n",
    "\n",
    "When $\\Omega$ is a finite or countably-infinite set, then we can assume our measure is just a probability $p(\\omega)$ assigned to every $\\omega \\in \\Omega$, where $\\sum_{\\omega \\in \\Omega} p_\\omega = 1$.\n",
    "* We often will write $P(X = a)$ which is precisely $\\sum_{\\omega \\in \\Omega : X(\\omega) = a} p(\\omega)$.\n",
    "* The expectation of a random variable $\\E[X]$ is defined as $\\sum_{\\omega \\in \\Omega} X(\\omega) p(\\omega)$.\n",
    "* Similarly, the expectation of a function $\\E[g(X)] = \\sum_{\\omega \\in \\Omega} g(X(\\omega)) p(\\omega)$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### What is a random variable? Continuous case.\n",
    "\n",
    "When the sample space $\\Omega$ is uncountably infinite, we usually need to go to full measure theory to talk about random variables in general. We won't do today, but consider a measure theory course!\n",
    "\n",
    "Countinuous random variables are the most common in Machine Learning. The best way to think about random variables is through their *probability density function* and *cumulative distribution function*\n",
    "* For random variable $X$, the CDF is $F(x) := P(X \\leq x)$\n",
    "* Also, the PDF of $X$ is the derivative of the CDF, $f(x) := F'(x)$\n",
    "* WARNING: not every random variable has a PDF! But it always has a CDF! (But CDF may not be diff'bl)\n",
    "* When a r.v. has a PDF $f(\\cdot)$, we can write\n",
    "$$ \\E[X] = \\int x f(x)\\, dx $$\n",
    "* When $\\mu$ is the mean of random variable $X$, then the variance $\\text{Var}(X)$ is $\\E[(X-\\mu)^2]$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### PDF vs CDF\n",
    "\n",
    "<img src=\"images/pdf_cdf.gif\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Independence\n",
    "\n",
    "* Strictly speaking, two random variables $X$ and $Y$ are independent if for any (measureable) sets $A, B \\subset \\R$ we have $P(X \\in A \\text{ and } Y \\in B) = P(X \\in A) P( Y \\in B)$.\n",
    "* Perhaps the most useful fact of independence of $X$ and $Y$ is that $\\E[XY] = \\E[X]\\E[Y]$\n",
    "<div style=\"padding:10px; margin:10px; border: 1px solid black\">\n",
    "<b>Problem:</b> Show the following<br>\n",
    "<b>Part A:</b> Let $X$ be any random variable. Show that $\\text{Var}(X) = \\E[X^2] - (\\E[X])^2$<br>\n",
    "<b>Part B:</b> Let $X$ be some random variable. Then $\\text{Var}(X + X + X) = 9 \\text{Var}(X)$<br>\n",
    "<b>Part C:</b> Let $X_1, \\ldots, X_n$ be *independent* random variables. Then $\\text{Var}(X_1 + \\ldots + X_n) = \\text{Var}(X_1) + \\ldots + \\text{Var}(X_n)$<br>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Random functions\n",
    "\n",
    "In many fields, especially Machine Learning, we deal with functions that have randomized inputs. For example, assume we have some distribution $D$, we often consider functions of the form\n",
    "$$\n",
    "F(\\theta) := \\E_{\\xi \\sim D} [f(\\theta;\\xi)].\n",
    "$$\n",
    "Here, $\\theta$ are the parameters we want to optimize, and $\\xi$ is some random input, such as a datapoint. But we'd like to optimize $\\theta$ \"on average\", i.e. in expectation over a random sample of $\\xi \\sim D$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stochastic Gradient Descent\n",
    "\n",
    "We can optimize functions, defined through expectations above, using *randomized gradients*. \n",
    "\n",
    "- Initialize $\\theta_1$\n",
    "- For $t=1,2, \\ldots, T$:\n",
    "    + Sample $\\xi_t \\sim D$\n",
    "    + $\\theta_{t+1} = \\Pi_K(\\theta_t - \\eta_t \\nabla f(\\theta_t; \\xi_t))$\n",
    "- Output: $\\bar \\theta_T := \\frac 1 T \\sum_{t=1}^T \\theta_t$\n",
    "\n",
    "**Theorem** (Hazan book): As long as $f(\\cdot;\\xi)$ is convex, $\\|\\nabla f(\\cdot; \\cdot) \\| \\leq G$ and $\\text{diam}(K) \\leq D$, then we have\n",
    "$$\\E[F(\\bar \\theta_T)] - \\min_{\\theta \\in K} F(\\theta) \\leq \\frac{3GD}{2\\sqrt{T}}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SGD versus standard GD\n",
    "\n",
    "Let $F(\\theta) := \\frac 1 n \\sum_{i=1}^n f(\\theta; \\xi_i)$ be some objective function, defined as an average over samples $\\xi_1, \\ldots \\xi_n$. Consider two algorithms for minimizing $F$:\n",
    "\n",
    "1. Run (average-iterate) Gradient Descent on $F$, for $n$ steps\n",
    "2. Run SGD on $F$, by sampling a random $\\xi_i$ on each round, for $n$ steps\n",
    "\n",
    "#### Problem\n",
    "\n",
    "1. What is the convergence guarantee for each algorithm? (I'm only interested in the $T$-dependence!)\n",
    "1. What is the computational complexity of each? Which is better?\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}