{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\LaTeX \\text{ commands here}\n",
    "\\newcommand{\\R}{\\mathbb{R}}\n",
    "\\newcommand{\\im}{\\text{im}\\,}\n",
    "\\newcommand{\\norm}[1]{||#1||}\n",
    "\\newcommand{\\inner}[1]{\\langle #1 \\rangle}\n",
    "\\newcommand{\\span}{\\mathrm{span}}\n",
    "\\newcommand{\\proj}{\\mathrm{proj}}\n",
    "\\newcommand{\\OPT}{\\mathrm{OPT}}\n",
    "$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<hr style=\"border: 5px solid black\">\n",
    "\n",
    "**Georgia Tech, CS 4540**\n",
    "\n",
    "# L17: Boosting + AdaBoost\n",
    "\n",
    "Jake Abernethy\n",
    "\n",
    "*Tuesday, October 29, 2019*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Recall: The Minimax Theorem\n",
    "\n",
    "Proved by von Neumann in the 1920s!\n",
    "\n",
    "**Theorem**: For any matrix $M \\in \\R^{n \\times m}$ we have\n",
    "$$\\min_{p \\in \\Delta_n} \\max_{j \\in [m]} p^\\top M_{:j} = \\max_{q \\in \\Delta_m} \\min_{i \\in [n]} M_{i:} q$$\n",
    "Equivalently:\n",
    "$$\\min_{p \\in \\Delta_n} \\max_{q \\in \\Delta_m} p^\\top M q = \\max_{q \\in \\Delta_m} \\min_{p \\in \\Delta_n} p^\\top M q$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Base/Weak Learners\n",
    "\n",
    "- Let $(\\mathbf{x}_1, y_1), ..., (\\mathbf{x}_n, y_n)$ be the training data.\n",
    "- Let $\\mathscr{F}$ be a fixed set of classifiers called the base class.\n",
    "- A base learner for $\\mathscr{F}$ is a rule that takes as input a set of weights $\\mathbf{w} = (w_1, ..., w_n)$ such that $w_i \\geq 0, \\sum w_i = 1$, and outputs a classifier $f \\in \\mathscr{F}$ such that the weighted empirical risk $$e_w(f) = \\sum \\limits_{i = 1}^n w_i \\mathbb{1}_{\\{f(\\mathbf{x}_i) \\neq y_i\\}}$$ is (approximately) minimized."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Boosting (2 Class Scenario)\n",
    "\n",
    "- Assume class labels are -1 and +1.\n",
    "- The final classifier then has the form: \n",
    " - $h_T(\\mathbf{x}) = \\text{sgn}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x})\\right)$ where $f_1, ..., f_T$ are called base (or weak) classifiers and $\\alpha_1, ..., \\alpha_T > 0$ reflect the confidence of the various base classifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Examples of Base (Weak) Learners\n",
    "\n",
    "- Decision Stumps, i.e., decision trees with depth 1\n",
    "- Decision Trees\n",
    "- Polynomial thresholds, i.e., $$f(\\vec{x}) = \\pm \\text{sign}((\\vec{w}^\\top \\vec{x})^2 - b)$$ where $b \\in \\mathbb{R}$ and $\\vec{w} \\in \\mathbb{R}^d$ is a radial kernel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### AdaBoost (Adaptive Boosting)\n",
    "\n",
    "- The first concrete algorithm to successfully realize the boosting principle.\n",
    "\n",
    "<img src=\"./images/adaboost.gif\" width=55%>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### AdaBoost Algorithm\n",
    "\n",
    "An *iterative* algorithm for \"ensembling\" base learners\n",
    "\n",
    "- Input: $\\{(\\mathbf{x}_i, y_i)\\}_{i = 1}^n, T, \\mathscr{F}$, base learner\n",
    "- Initialize: $\\mathbf{w}^{1} = (\\frac{1}{n}, ..., \\frac{1}{n})$\n",
    "- For $t = 1, ..., T$\n",
    " - $\\mathbf{w}^{t} \\rightarrow \\boxed{\\text{base learner}} \\rightarrow f_t$\n",
    " - $\\alpha_t = \\frac{1}{2}\\text{ln}\\left(\\frac{1 - r_t}{r_t}\\right)$\n",
    "     - where $r_t := e_{\\mathbf{w}^t}(f_t) = \\frac 1 n \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}_{\\{f(\\mathbf{x}_i) \\neq y_i\\}} $\n",
    " - $w_i^{t + 1} = \\frac{\\mathbf{w}_i^t \\exp \\left(- \\alpha_ty_if_t(\\mathbf{x}_i)\\right)}{z_t}$ where $z_t$ normalizes.\n",
    "- Output: $h_T(\\mathbf{x}) = \\text{sign}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x})\\right)$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Weak Learning Assumption\n",
    "\n",
    "For Boosting to work, one needs the *weak learning hypothesis*: there is a $\\gamma > 0$ such that given any vector of weights $\\mathbf{w} \\in \\Delta_n$ that forms a distribution over your training set, you can always find a base hypothesis $f$ such that\n",
    "$$e_{\\mathbf{w}^t}(f) :=  \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}[f(\\mathbf{x}_i) = y_i] \\leq \\frac 1 2 - \\gamma$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Strong Learning Assumption\n",
    "\n",
    "Weak learning isn't good enough. What I want is to be able to combine the weak learners in a way that correctly predicts every example. That is, you want to find $\\alpha_1, \\ldots, \\alpha_T$ and base learners $f_1, \\ldots, f_T$ so that for every $i = 1, \\ldots, n$:\n",
    "$$\n",
    "h_T(\\mathbf{x}_i) := \\text{sign}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x}_i)\\right) = y_i\n",
    "$$\n",
    "Equivalently, assuming $f_i$ outputs $\\pm 1$, if there are $m$ base learners:\n",
    "$$\n",
    "\\exists \\vec{\\alpha} \\in \\mathbf{R}^m_+ \\; \\forall i \\in [n]: \\quad y_i \\sum_{j = 1}^m \\alpha_j f_j(\\mathbf{x}_i) > 0\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## This two are primal/dual statements!\n",
    "\n",
    "It turns out that we can formulate this as a game. Let $M \\in \\{+1, -1\\}^{n \\times m}$ defined by:\n",
    "$$ M_{i,j} := y_i f_j(x_i)$$\n",
    "\n",
    "#### Problem\n",
    "How does the minimax value and the maximin value of the game relate to the Strong Learning and Weak Learning hypotheses? What does the minimax theorem tell us?\n",
    "\\begin{align*}\n",
    "\\text{Weak Learning}: &\\quad  \\forall \\mathbf{w} \\in \\Delta_n \\exists j: \\quad \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}[f_j(\\mathbf{x}_i) \\ne y_i] \\leq \\frac 1 2 - \\gamma\n",
    "\\\\\n",
    "\\text{Strong Learning}:&\\quad \\exists \\vec{\\alpha} \\in \\mathbf{R}^m_+ \\forall i \\in [n]: \\quad y_i \\sum_{j = 1}^m \\alpha_j f_j(\\mathbf{x}_i) > 0\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Adaboost through Coordinate Descent\n",
    "\n",
    "It is often said that we can view Adaboost as \"Coordinate Descent\" on the exponential loss function. Coordinate descent is a simple algorithm: to minimize $L(\\vec \\alpha)$, first pick some appropriate coordiante  $i$, then optimize *only this coordinate*. Repeat.\n",
    "\n",
    "#### Problem\n",
    "\n",
    "Show that the AdaBoost algorithm can be viewed as coordinate descent on *exponential loss function* \n",
    "$$\\text{Loss}(\\vec{\\alpha}) = \\sum_{i=1}^n \\exp \\left( - y_i \\left(\\sum_{j=1}^m \\alpha_j h_j(\\vec{x}_i)\\right)\\right)$$\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}