{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "$\\LaTeX \\text{ commands here}\n", "\\newcommand{\\R}{\\mathbb{R}}\n", "\\newcommand{\\im}{\\text{im}\\,}\n", "\\newcommand{\\norm}[1]{||#1||}\n", "\\newcommand{\\inner}[1]{\\langle #1 \\rangle}\n", "\\newcommand{\\span}{\\mathrm{span}}\n", "\\newcommand{\\proj}{\\mathrm{proj}}\n", "\\newcommand{\\OPT}{\\mathrm{OPT}}\n", "$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "**Georgia Tech, CS 4540**\n", "\n", "# L17: Boosting + AdaBoost\n", "\n", "Jake Abernethy\n", "\n", "*Tuesday, October 29, 2019*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Recall: The Minimax Theorem\n", "\n", "Proved by von Neumann in the 1920s!\n", "\n", "**Theorem**: For any matrix $M \\in \\R^{n \\times m}$ we have\n", "$$\\min_{p \\in \\Delta_n} \\max_{j \\in [m]} p^\\top M_{:j} = \\max_{q \\in \\Delta_m} \\min_{i \\in [n]} M_{i:} q$$\n", "Equivalently:\n", "$$\\min_{p \\in \\Delta_n} \\max_{q \\in \\Delta_m} p^\\top M q = \\max_{q \\in \\Delta_m} \\min_{p \\in \\Delta_n} p^\\top M q$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Base/Weak Learners\n", "\n", "- Let $(\\mathbf{x}_1, y_1), ..., (\\mathbf{x}_n, y_n)$ be the training data.\n", "- Let $\\mathscr{F}$ be a fixed set of classifiers called the base class.\n", "- A base learner for $\\mathscr{F}$ is a rule that takes as input a set of weights $\\mathbf{w} = (w_1, ..., w_n)$ such that $w_i \\geq 0, \\sum w_i = 1$, and outputs a classifier $f \\in \\mathscr{F}$ such that the weighted empirical risk $$e_w(f) = \\sum \\limits_{i = 1}^n w_i \\mathbb{1}_{\\{f(\\mathbf{x}_i) \\neq y_i\\}}$$ is (approximately) minimized." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Boosting (2 Class Scenario)\n", "\n", "- Assume class labels are -1 and +1.\n", "- The final classifier then has the form: \n", " - $h_T(\\mathbf{x}) = \\text{sgn}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x})\\right)$ where $f_1, ..., f_T$ are called base (or weak) classifiers and $\\alpha_1, ..., \\alpha_T > 0$ reflect the confidence of the various base classifiers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Examples of Base (Weak) Learners\n", "\n", "- Decision Stumps, i.e., decision trees with depth 1\n", "- Decision Trees\n", "- Polynomial thresholds, i.e., $$f(\\vec{x}) = \\pm \\text{sign}((\\vec{w}^\\top \\vec{x})^2 - b)$$ where $b \\in \\mathbb{R}$ and $\\vec{w} \\in \\mathbb{R}^d$ is a radial kernel." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### AdaBoost (Adaptive Boosting)\n", "\n", "- The first concrete algorithm to successfully realize the boosting principle.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### AdaBoost Algorithm\n", "\n", "An *iterative* algorithm for \"ensembling\" base learners\n", "\n", "- Input: $\\{(\\mathbf{x}_i, y_i)\\}_{i = 1}^n, T, \\mathscr{F}$, base learner\n", "- Initialize: $\\mathbf{w}^{1} = (\\frac{1}{n}, ..., \\frac{1}{n})$\n", "- For $t = 1, ..., T$\n", " - $\\mathbf{w}^{t} \\rightarrow \\boxed{\\text{base learner}} \\rightarrow f_t$\n", " - $\\alpha_t = \\frac{1}{2}\\text{ln}\\left(\\frac{1 - r_t}{r_t}\\right)$\n", " - where $r_t := e_{\\mathbf{w}^t}(f_t) = \\frac 1 n \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}_{\\{f(\\mathbf{x}_i) \\neq y_i\\}} $\n", " - $w_i^{t + 1} = \\frac{\\mathbf{w}_i^t \\exp \\left(- \\alpha_ty_if_t(\\mathbf{x}_i)\\right)}{z_t}$ where $z_t$ normalizes.\n", "- Output: $h_T(\\mathbf{x}) = \\text{sign}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x})\\right)$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Weak Learning Assumption\n", "\n", "For Boosting to work, one needs the *weak learning hypothesis*: there is a $\\gamma > 0$ such that given any vector of weights $\\mathbf{w} \\in \\Delta_n$ that forms a distribution over your training set, you can always find a base hypothesis $f$ such that\n", "$$e_{\\mathbf{w}^t}(f) := \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}[f(\\mathbf{x}_i) = y_i] \\leq \\frac 1 2 - \\gamma$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Strong Learning Assumption\n", "\n", "Weak learning isn't good enough. What I want is to be able to combine the weak learners in a way that correctly predicts every example. That is, you want to find $\\alpha_1, \\ldots, \\alpha_T$ and base learners $f_1, \\ldots, f_T$ so that for every $i = 1, \\ldots, n$:\n", "$$\n", "h_T(\\mathbf{x}_i) := \\text{sign}\\left(\\sum \\limits_{t = 1}^T \\alpha_t f_t(\\mathbf{x}_i)\\right) = y_i\n", "$$\n", "Equivalently, assuming $f_i$ outputs $\\pm 1$, if there are $m$ base learners:\n", "$$\n", "\\exists \\vec{\\alpha} \\in \\mathbf{R}^m_+ \\; \\forall i \\in [n]: \\quad y_i \\sum_{j = 1}^m \\alpha_j f_j(\\mathbf{x}_i) > 0\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## This two are primal/dual statements!\n", "\n", "It turns out that we can formulate this as a game. Let $M \\in \\{+1, -1\\}^{n \\times m}$ defined by:\n", "$$ M_{i,j} := y_i f_j(x_i)$$\n", "\n", "#### Problem\n", "How does the minimax value and the maximin value of the game relate to the Strong Learning and Weak Learning hypotheses? What does the minimax theorem tell us?\n", "\\begin{align*}\n", "\\text{Weak Learning}: &\\quad \\forall \\mathbf{w} \\in \\Delta_n \\exists j: \\quad \\sum \\limits_{i = 1}^n \\mathbf{w}_i \\mathbf{1}[f_j(\\mathbf{x}_i) \\ne y_i] \\leq \\frac 1 2 - \\gamma\n", "\\\\\n", "\\text{Strong Learning}:&\\quad \\exists \\vec{\\alpha} \\in \\mathbf{R}^m_+ \\forall i \\in [n]: \\quad y_i \\sum_{j = 1}^m \\alpha_j f_j(\\mathbf{x}_i) > 0\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Adaboost through Coordinate Descent\n", "\n", "It is often said that we can view Adaboost as \"Coordinate Descent\" on the exponential loss function. Coordinate descent is a simple algorithm: to minimize $L(\\vec \\alpha)$, first pick some appropriate coordiante $i$, then optimize *only this coordinate*. Repeat.\n", "\n", "#### Problem\n", "\n", "Show that the AdaBoost algorithm can be viewed as coordinate descent on *exponential loss function* \n", "$$\\text{Loss}(\\vec{\\alpha}) = \\sum_{i=1}^n \\exp \\left( - y_i \\left(\\sum_{j=1}^m \\alpha_j h_j(\\vec{x}_i)\\right)\\right)$$\n", "\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }