{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\LaTeX \\text{ commands here}\n",
    "\\newcommand{\\R}{\\mathbb{R}}\n",
    "\\newcommand{\\im}{\\text{im}\\,}\n",
    "\\newcommand{\\norm}[1]{||#1||}\n",
    "\\newcommand{\\inner}[1]{\\langle #1 \\rangle}\n",
    "\\newcommand{\\span}{\\mathrm{span}}\n",
    "\\newcommand{\\proj}{\\mathrm{proj}}\n",
    "\\newcommand{\\OPT}{\\mathrm{OPT}}\n",
    "\\newcommand{\\grad}{\\nabla}\n",
    "\\newcommand{\\eps}{\\varepsilon}\n",
    "$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<hr style=\"border: 5px solid black\">\n",
    "\n",
    "**Georgia Tech, CS 4540**\n",
    "\n",
    "# L8: Least Squares, Support Vector Machines, and Logistic Regression\n",
    "\n",
    "Jake Abernethy and Benjamin Bray\n",
    "\n",
    "Quiz password: regress\n",
    "\n",
    "*Tuesday, September 17, 2019*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Review:  Objective Functions\n",
    "\n",
    "General Objective Function for *supervised* machine learning problems:\n",
    "$$\n",
    "J(\\theta; x, y) = \\frac{1}{n} \\sum_{n=1}^N \\ell(y_n, h_\\theta(x_n))\n",
    "$$\n",
    "Goal: find a $\\theta$ that minimizes this objective! (*Note*: we often use $w$ for the vector of parameters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Support Vector Machines\n",
    "\n",
    "In the SVM problem, we have $n$ *datapoints* $(x_1, y_1), \\ldots, (x_n, y_n)$, where $x_i \\in \\mathbb{R}^d$, and $y_i \\in \\{-1,1\\}$. We want to find a linear classifier $\\theta \\in \\mathbb{R}^d$ which classifies the datapoints with a *large margin* but allowing for some *slack*. The soft margin SVM is formulated as follows:\n",
    "\\begin{align*}\n",
    "\\mathop{\\text{minimize }}_{\\theta \\in \\mathbb{R}^d, \\xi_1,\\ldots, \\xi_n \\geq 0} \\quad & \\frac 1 2 \\|\\theta\\|^2 + C \\sum_{i=1}^n \\xi_i \\\\\n",
    "\\text{subject to } \\quad & y_i (\\theta^\\top x_i) \\geq 1 - \\xi_i \\quad \\text{ for } i=1, \\ldots, n\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Problem\n",
    "\n",
    "The above formulation, with $n + d$ variables, is actually equivalent to an unconstrained problem with only $d$ variables below. Why is that?\n",
    "$$\n",
    "\\mathop{\\text{minimize }}_{\\theta \\in \\mathbb{R}^d} \\quad \\frac 1 2 \\|\\theta\\|^2 + C\\sum_{i=1}^n \\max\\{0, 1 - y_i(\\theta^\\top x_i)\\}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SVM with only support vectors\n",
    "\n",
    "Let us assume we have found the $\\theta^*$ that solves the SVM objective:\n",
    "$$\n",
    "\\mathop{\\text{minimize }}_{\\theta \\in \\mathbb{R}^d} \\quad \\frac 1 2 \\|\\theta\\|^2 + C\\sum_{i=1}^n \\max\\{0, 1 - y_i(\\theta^\\top x_i)\\}.\n",
    "$$\n",
    "Now consider the set of so-called *support vectors* $S := \\{ i : y_i(\\theta^{*\\top} x_i) \\leq 1 \\}$.\n",
    "\n",
    "#### Problem\n",
    "Is $\\theta^*$ also a solution of the following modified objective? Why or why not?\n",
    "$$\n",
    "\\mathop{\\text{minimize }}_{\\theta \\in \\mathbb{R}^d} \\quad \\frac 1 2 \\|\\theta\\|^2 + C\\sum_{i \\in S} \\max\\{0, 1 - y_i(\\theta^\\top x_i)\\}.\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Least Squares\n",
    "\n",
    "##### Objective\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "J(\\theta; X, y)\n",
    "&= \\frac{1}{2} \\sum_{n=1}^N (y_n - \\theta^T x_n)^2 \\\\\n",
    "&= \\frac{1}{2} || y - \\theta^T X ||^2\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "* Input dataset  $X \\in \\mathbb{R}^{D \\times N}$ with columns $x_1,\\dots,x_N \\in \\R^D$\n",
    "* Output $y \\in \\mathbb{R}^N$ with entries $y_1,\\dots,y_n \\in \\mathbb{R}$\n",
    "* Hypothesis $h_\\theta : \\mathbb{R}^D \\rightarrow \\mathbb{R}$ defined as $h_\\theta(x) = \\theta^T x$\n",
    "* Squared loss $\\ell(y_1,y_2) = (y_1 - y_2)^2$\n",
    "* Weights / Parameters $\\theta \\in \\mathbb{R}^D$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Gradient Descent for Least Squares, convergence\n",
    "\n",
    "As shown in the reading, the (batch) gradient descent update for the least squares objective is:\n",
    "$$\n",
    "\\theta_{t+1}\n",
    "= \\theta_{t} + \\alpha \\sum_{n=1}^N (y_n - h_\\theta(x_n)) x_n\n",
    "$$\n",
    "\n",
    "#### Problem:  Convergence Rate\n",
    "\n",
    "Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? Is \"iterate averaging\" required?\n",
    "\n",
    "(*Hint:* From the reading, you know the objective is convex. Are any stronger properties true?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Problem:  Regularization\n",
    "\n",
    "Suppose we add a 2-norm regularizer to the objective function for least-squares linear regression.  How does the convergence rate change?\n",
    "$$\n",
    "J_{new}(\\theta) = J_{old}(\\theta) + \\lambda || \\theta ||^2\n",
    "$$\n",
    "(here, $\\lambda > 0$ is a parameter which controls the relative importance of the regularizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Logistic Regression\n",
    "\n",
    "In the notes, you saw an objective function for computing maximum likelihood estimates of the parameters to a logistic regression model.  The probabilistic interpretation is nice, but we can also view it as a variant of our \"formula\" above, where we choose to use the **binary cross-entropy loss**.\n",
    "\n",
    "##### Objective\n",
    "\n",
    "$$\n",
    "J(\\theta)\n",
    "= NLL(\\theta)\n",
    "= - \\sum_{n=1}^N \\bigg[ y_n \\log h_\\theta(x_n) + (1-y_n) \\log (1-h_\\theta(x_n)) \\bigg]\n",
    "$$\n",
    "\n",
    "\n",
    "\n",
    "- Input dataset  $X \\in \\mathbb{R}^{D \\times N}$ with columns $x_1,\\dots,x_N \\in \\R^D$\n",
    "- Binary Output $y \\in \\{ 0,1 \\}^N$ with entries $y_1,\\dots,y_n \\in \\{0,1\\}$\n",
    "- Hypothesis $h_\\theta : \\mathbb{R}^D \\rightarrow \\mathbb{R}$ defined as $h_\\theta(x) = \\sigma(\\theta^T x) = \\frac{1}{1 + \\exp(-\\theta^T x)}$\n",
    "- Logistic function $\\sigma(z) = \\frac{1}{1+\\exp(-z)}$ with the special property that $\\sigma'(z) = \\sigma(z) (1 - \\sigma(z))$\n",
    "- Cross-entropy loss $\\ell(y,z) = -y \\log z - (1-y) \\log (1-z)$\n",
    "- Weights / Parameters $\\theta \\in \\mathbb{R}^D$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Gradient Descent for Logistic Regression\n",
    "\n",
    "As shown in the notes, the (batch) gradient descent update rule is\n",
    "$$\n",
    "\\theta_{t+1} = \\theta_t + \\alpha \\sum_{n=1}^N (y_n - h_\\theta(x_n)) x_n\n",
    "$$\n",
    "\n",
    "#### Problem: Convergence Rate\n",
    "\n",
    "Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares?\n",
    "\n",
    "(*Hint:* From the reading, you know the objective is convex.  is the objective strongly convex?  strongly smooth?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Problem: Regularizing Logistic Regression\n",
    "\n",
    "Suppose we add a 2-norm regularizer to the objective function for logistic regression.  How does the convergence rate change?\n",
    "$$\n",
    "J_{new}(\\theta) = J_{old}(\\theta) + \\lambda || \\theta ||^2\n",
    "$$\n",
    "(here, $\\lambda > 0$ is a parameter which controls the relative importance of the regularizer)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}