{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# <img src=\"https://img.icons8.com/dusk/64/000000/mind-map.png\" style=\"height:50px;display:inline\"> CS 236756 - Technion - Intro to Machine Learning\n",
    "---\n",
    "#### Tal Daniel\n",
    "\n",
    "\n",
    "## Tutorial 14 - PAC Learning & VC Dimension\n",
    "---\n",
    "\n",
    "<img src=\"./assets/tut_13_pacman.png\" style=\"height:200px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/bubbles/50/000000/checklist.png\" style=\"height:50px;display:inline\"> Agenda\n",
    "---\n",
    "* [The PAC (**P**robably **A**pproximately **C**orrect) Learning Framework](#-The-PAC-Learning-Framework)\n",
    "    * [Empirical Risk Minimization (ERM)](#-Empirical-Risk-Minimization-(ERM))\n",
    "    * [The Fundamental Theorem of Statistical Learning](#-The-Fundamental-Theorem-of-Statistical-Learning)\n",
    "* [The VC Dimension](#-VC-Dimension)\n",
    "    * [Theory](#-VC-Dimension---Formal-Definition)\n",
    "    * [Examples](#-VC-Dimension---Examples)\n",
    "* [Recommended Videos](#-Recommended-Videos)\n",
    "* [Credits](#-Credits)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## <img src=\"https://img.icons8.com/cotton/64/000000/pacman.png\" style=\"height:50px;display:inline\"> The PAC Learning Framework\n",
    "---\n",
    "PAC stands for \"probably approximately correct\", which is a framework and set of assumptions under which numerous results on learning theory were proven."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/dusk/64/000000/class.png\" style=\"height:50px;display:inline\"> Classification Learning Problem\n",
    "---\n",
    "* The learner's *input*:\n",
    "    * **Domain Set - $\\mathcal{X}$**: the set of objects we wish to label.\n",
    "    * **Label Set - $\\mathcal{Y}$**: possible outcomes of an experiment.\n",
    "    * **Training Data - $S=\\{(x^{(i)}, y^{(i)}); i=1,...,m\\}$**: a finite sequence of pairs in $\\mathcal{X} \\times \\mathcal{Y}$ \n",
    "        * Drawn iid from some probability distribution $\\mathcal{D}$\n",
    "* The learner's *output*:\n",
    "    * **Prediction Rule - hypothesis** - $h: \\mathcal{X} \\to \\mathcal{Y}$: a function that must predict a label for new domain points.\n",
    "        * The function is also called: predictor, hypothesis or classifier.\n",
    "* Sample generating model\n",
    "    * We assume the instances are generated by an **unknown** probability distribution over $\\mathcal{X}$ denoted $\\mathcal{D}$.\n",
    "    * **i.i.d.**: each $x^{(i)}$ is sampled independently from $\\mathcal{D}$.\n",
    "    * **Realizability**: we also assume: $\\exists f, f: \\mathcal{X} \\to \\mathcal{Y}$ such that $y^{(i)} = f(x^{(i)}), \\forall i$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Measures of success\n",
    "    * **Training Error** (also called the **empirical risk** or **empirical error**): $$ \\hat{\\epsilon}(h) = \\hat{L}(h) = \\frac{1}{m} \\sum_{i=1}^m \\mathbb{1} \\{h(x^{(i)}) \\neq y^{(i)} \\}$$\n",
    "    * **Classifier Error** (also called the **generalization error**, the **risk** or the **true error**): the error of $h$ is the probability to draw a random sample $(x, y) \\sim \\mathcal{D}$ such that $h(x) \\neq y$: $$ \\epsilon(h) = L(h) = P_{(x,y) \\sim \\mathcal{D}}(h(x) \\neq y)$$\n",
    "        * This is the probability that, if we now draw a new example $(x,y)$ from $\\mathcal{D}$, $h$ will misclassify it.\n",
    "        * We assume that the training data was drawn from the *same* distribution $\\mathcal{D}$ with which we are going to evaluate our hypothesis (the assumption of training and testing on the same distribution is part of the **PAC assumptions**)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <img src=\"https://img.icons8.com/bubbles/100/000000/classroom.png\" style=\"height:30px;display:inline\"> Classifier Error Example\n",
    "---\n",
    "\n",
    "* Assume binary features of *papayas* (the fruit...)\n",
    "\n",
    "| Softness | Color | $Pr(x) $ (Probability)|  $h(x)$ | $f(x)$ |\n",
    "|------|------|------|------|------|\n",
    "| Soft | Green | 0.1 | Tasty | Not-Tasty|\n",
    "| Hard | Green | 0.1 | Not-Tasty | Not-Tasty|\n",
    "| Soft | Orange | 0.7 | Tasty | Tasty|\n",
    "| Hard | Orange | 0.1 | Tasty | Not-Tasty|\n",
    "\n",
    "* $\\hat{L}(h) = \\hat{\\epsilon}(h) = 0.5$\n",
    "* $L(h) = \\epsilon(h) = 0.2$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* What is $L_D(h)$?\n",
    "    * We can only approximate it with some probability.\n",
    "* Why can it only be **approximately** correct?\n",
    "    * **Claim**: we can't hope to find $h \\in \\mathcal{H}, \\text{s.t. } L_{D,f}(h) = 0$\n",
    "    * **Proof**:\n",
    "        * For every $\\epsilon \\in (0,1)$ take $X = \\{x_1, x_2\\}, P(x_1) = 1 - \\epsilon, P(x_2) = \\epsilon$\n",
    "        * The probability not to see $x_2$ at all among $m$ i.i.d. examples is $(1-\\epsilon)^m \\approx e^{-\\epsilon m}$\n",
    "        * So, if $\\epsilon << \\frac{1}{m}$ we are likely not to see $x_2$ at all, but then we can't know its label!\n",
    "    * **Relaxation**: we would be happy with $L_{D,f}(h) \\leq \\epsilon$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Why can it only be **probably** correct?\n",
    "    * Recall that the input to the learner is *randomly generated*.\n",
    "    * There is always a (very small) chance to see the same example again and again.\n",
    "    * **Claim**: no algorithm can guarantee $L_{D,f}(h) \\leq \\epsilon$ for sure, that is, with absolute certainty ($P=1$)\n",
    "    * **Relaxation**: we would allow the algorithm to fail with probability $\\delta$ where $\\delta \\in (0,1)$ is *user-specified*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/ultraviolet/80/000000/approximately-equal-2.png\" style=\"height:50px;display:inline\"> Probably Approximately Correct (PAC) Learning\n",
    "---\n",
    "* The learner doesn't know $\\mathcal{D}$ and $f$.\n",
    "* The learner receives 2 parameters:\n",
    "    1. $\\epsilon$ - *accuracy* parameter.\n",
    "    2. $\\delta$ - *confidence* parameter.\n",
    "* The learner can ask for training data, $S$ containing <a style=\"color:red\"> $m(\\epsilon, \\delta)$</a> examples.\n",
    "* The learner should output a hypothesis $h$ such that with probability of **at least** $1-\\delta$ it holds that $L_{D,f} \\leq \\epsilon$.\n",
    "    * That is, the learner should be **P**robably (with probability at least $1-\\delta$) **A**pproximately (up to accuracy $\\epsilon$) **C**orrect.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/dusk/64/000000/thin-test-tube.png\" style=\"height:50px;display:inline\"> Empirical Risk Minimization (ERM)\n",
    "---\n",
    "* Consider the setting of *linear classification* and let $h_{\\theta}(x) = \\mathbb{1}\\{\\theta^Tx \\geq 0\\}$.\n",
    "* Algorithm goal:\n",
    "    * Find a hypothesis $h_s$ that minimizes the error (risk) with respect to $\\mathcal{D}$ and $f$.\n",
    "    * But $\\mathcal{D}$ and $f$ are **unknown**!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* An alternative goal and a reasonable way to fit the parameters $\\theta$ would be to try and minimize the training error: $$ \\hat{L}(h) = L_s(h) = \\frac{|\\{ i \\in [m]: h(x^{(i)}) \\neq y^{(i)} \\}|}{m}, [m]=\\{1,...,m\\} $$ and pick $$ \\hat{\\theta} = \\underset{\\theta}{\\mathrm{argmin}} \\hat{\\epsilon}(h_{\\theta}) = \\underset{\\theta}{\\mathrm{argmin}} \\hat{L}(h_{\\theta}) $$\n",
    "    * This process is called **empirical risk minimization** (ERM).\n",
    "    * The resulting hypothesis output by the algorithm is $\\hat{h} = h_{\\hat{\\theta}}$.\n",
    "    * ERM can be thought of as the most basic learning algorithm.\n",
    "        * Algorithms like Logistic Regression can also be viewed as approximations to ERM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* We will leave out the specific parameterization of the hypothesis $\\theta$ and will define the **hypothesis class** $\\mathcal{H}$ used by the learning algorithm to be the set of all classifiers considered by it.\n",
    "* ERM can now be thought of as a **minimization over the class of functions** $\\mathcal{H}$, in which the learning algorithm picks the hypothesis: $$ \\hat{h} = \\underset{h \\in \\mathcal{H}}{\\mathrm{argmin}} \\hat{\\epsilon}(h) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Overfitting**:\n",
    "    * ERM may result in overfitting for the obvious reasons.\n",
    "    * Assuming the following distribution: <img src=\"./assets/tut_13_overfitting.jpg\" style=\"height:200px\">\n",
    "    * We may build a trivial estimator with 0 (empirical) error: $$ h_s(x) = \\begin{cases}y^{(i)}, \\text{if } \\exists i \\in [m] \\text{ s.t. } x^{(i)} = x \\\\ 0, \\text{ otherwise} \\end{cases} $$\n",
    "    * In order to avoid overfitting, we induce bias."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **ERM with Inductive Bias**:\n",
    "    * A common solution to overfitting is to restrict the hypothesis search space.\n",
    "    * The learner chooses in advance a set of predictors (the hypothesis class $\\mathcal{H}$).\n",
    "    * The choice of $\\mathcal{H}$ imposes an *inductive* bias (prior knowledge).\n",
    "    * In the following we will assume **realizability**: $$ \\exists h^{*} \\in \\mathcal{H}, \\text{ s.t. } L_{D,f}(h^{*})=\\epsilon(h^{*}) = 0$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/color/96/000000/master.png\" style=\"height:50px;display:inline\"> The Fundamental Theorem of Statistical Learning\n",
    "---\n",
    "* Let $\\mathcal{H}$ denote a hypothesis class of binary classifiers.\n",
    "* Then, there are absolute **constants** $C_1, C_2$ such that the *sample complexity* (how many samples to draw, roughly) of PAC learning $\\mathcal{H}$ is: $$ C_1 \\frac{d(\\mathcal{H}) + \\log(\\frac{1}{\\delta})}{\\epsilon} \\leq m_{\\mathcal{H}}(\\epsilon, \\delta) \\leq C_2 \\frac{d(\\mathcal{H})\\log(\\frac{1}{\\epsilon}) + \\log(\\frac{1}{\\delta})}{\\epsilon} $$\n",
    "    * $d(\\mathcal{H})$ - the *VC Dimension* (which will be introduced shortly) of hypotheses class $\\mathcal{H}$.\n",
    "* Furthermore, this sample complexity is achieved by the ERM learning rule"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/dusk/64/000000/question-mark.png\" style=\"height:50px;display:inline\"> What Is Learnable and How to Learn?\n",
    "---\n",
    "* From the fundamental theorem of statistical learning:\n",
    "    * The sample complexity is characterized by the **VC Dimension**.\n",
    "    * The ERM learning rule is generic (near) optimal learner."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## <img src=\"https://img.icons8.com/dusk/64/000000/v.png\" style=\"height:40px;display:inline\"> VC Dimension\n",
    "---\n",
    "\n",
    "### <img src=\"https://img.icons8.com/cotton/64/000000/party-baloons.png\" style=\"height:50px;display:inline\"> Motivation\n",
    "---\n",
    "* **Complexity of a learner** - representational power, the ability to generalize.\n",
    "    * The usual **trade-off**:\n",
    "        * More power - represent more complex systems $\\to$ may lead to **overfitting**.\n",
    "        * Less power - won't overfit, but may not find the \"best\" learner.\n",
    "    * How to quantify the representational power? Not easily...\n",
    "        * One solution is the **VC Dimension**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **No Free Lunch**\n",
    "    * Suppose that $|\\mathcal{X}| = \\infty$\n",
    "    * For any finite subset $\\mathcal{C} \\subset \\mathcal{X}$ take $\\mathcal{D}$ to be *uniform* distribution over $\\mathcal{C}$\n",
    "    * If the number of training examples is $m \\leq \\frac{\\mathcal{C}}{2}$, then the learner has no knowledge on at least half the elements in $\\mathcal{C}$\n",
    "    * Formally: **No Free Lunch Theorem**\n",
    "        * Fix $\\delta \\in (0,1), \\epsilon < \\frac{1}{2}$. For every learner $\\mathcal{A}$ and training set size $m$, there exists $\\mathcal{D}, f$ such that with probability of at least $\\delta$ over the generation of training data $S$ of $m$ examples, it holds that $$ L_{\\mathcal{D}, f}(A(S)) \\geq \\epsilon $$\n",
    "            * For a *random guess*, $ L_{\\mathcal{D}, f} = \\frac{1}{2}$, so the theorem states that you can't be better than a random guess."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Suppose we got a **training** set $S=\\{(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})\\}$, and we choose classifiers or hypotheses from a hypotheses class $\\mathcal{H}$.\n",
    "    * We try to explain the labels using a hypothesis from $\\mathcal{H}$\n",
    "    * It turned out that the labels we received were *incorrect* and now we get the same instances with different labels: $S' = \\{(x^{(1)}, y'^{(1)}), ..., (x^{(m)}, y'^{(m)})\\}$\n",
    "    * We try again to explain the labels using a hypothesis from $\\mathcal{H}$\n",
    "    * If we succeed in doing so (that is, find a hypothesis that explains these labels), then something is fishy...\n",
    "    * Conclusion: if the classifier is able to explain everything, then it is useless...\n",
    "    * Formally, if $\\mathcal{H}$ allows all functions over some set $\\mathcal{C}$ of size $m$, then based on the **No Free Lunch** theorem, we can't learn from a subset of size $\\frac{m}{2}$, for example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/cotton/64/000000/tie.png\" style=\"height:50px;display:inline\"> VC Dimension - Formal Definition\n",
    "---\n",
    "* Let $\\mathcal{C} = \\{x_1, ..., x_{|C|} \\} \\subset \\mathcal{X}$\n",
    "* Let $\\mathcal{H}_C$ be the restriction of $\\mathcal{H}$ to $\\mathcal{C}$, namely, $\\mathcal{H}_C = \\{h_C: h \\in \\mathcal{H} \\}$ where $h_C: \\mathcal{C} \\to \\{0,1\\}$ or $\\{-1,+1\\} $ is s.t. $h_C(x_i) = h(x_i)$ for every $x_i \\in C$\n",
    "* Observation: we can represent each $h_c$ as the vector: $$ \\begin{bmatrix} h(x_1) \\\\ \\vdots \\\\ h(x_{|C|}) \\end{bmatrix} \\in \\{ \\pm 1\\}^{|C|} $$\n",
    "* Therfore: $\\mathcal{H}_C \\leq 2^{|C|}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* We say that $\\mathcal{H}$ <a style=\"color:red\">**shatters** </a> $\\mathcal{C}$ if $|\\mathcal{H}_C| = 2^{|C|}$\n",
    "    * That is, $\\mathcal{H}$ can realize any labeling on $\\mathcal{C}$, i.e., if for *any* set of labels $\\{y^{(1)}, ..., y^{(m)} \\}$ there exists some $h \\in \\mathcal{H}$ so that $h(x^{(i)}) = y^{(i)}$ for **all** $i = 1,..., m$ \n",
    "* $VCdim(\\mathcal{H})= sup\\{|C| : \\mathcal{H} \\text{ shatters } \\mathcal{C} \\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* The VC dimension  is the maximal size of a set $\\mathcal{C}$ such that $\\mathcal{H}$ gives no prior knowledge w.r.t. $\\mathcal{C}$, or, the size of the largest set that is shattered by $\\mathcal{H}$.\n",
    "* In other words, the VC dimension is the maximum number of points that can be arranged such that $h \\in \\mathcal{H}$ can shatter them.\n",
    "* **Dichotomy**: a possible seperation of the sample space into sub-samples.\n",
    "    * For example: $\\{(x_1, 1), (x_2, 0), (x_3, 1)\\}$ is a dichotomy, and also $\\{(x_1, 0), (x_2, 0), (x_3, 1)\\}$ (a total of $2^3$ for this example)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Theorem**: Let $\\mathcal{H}$ be given, and let $d = VCdim(\\mathcal{H})$. Then with probability at least $1-\\delta$, we have that for all $h \\in \\mathcal{H}$: $$ |\\epsilon(h) - \\hat{\\epsilon}(h)| \\leq O(\\sqrt{\\frac{d}{m}\\log\\frac{m}{d} + \\frac{1}{m}\\log\\frac{1}{\\delta}}) $$\n",
    "Thus, with probability at least $1-\\delta$ we also have that: $$ \\epsilon(\\hat{h}) \\leq \\epsilon(h^{*}) + O(\\sqrt{\\frac{d}{m}\\log\\frac{m}{d} + \\frac{1}{m}\\log\\frac{1}{\\delta}}) $$\n",
    "    * $\\epsilon(h)$ is the real (test) error and $\\hat{\\epsilon}(h)$ is the training error (empirical risk).\n",
    "    * In other words, if a hypothesis class has finite VC dimension, then uniform convergence occurs as $m$ becomes large.\n",
    "    * **This is a very strong result because we can make a statement on data we have not seen!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "###  <img src=\"https://img.icons8.com/cute-clipart/50/000000/controller.png\" style=\"height:50px;display:inline\"> Finding VC Dimension\n",
    "---\n",
    "* To show that $VCdim(\\mathcal{H}) = d$ we need to show that:\n",
    "    1. There **exists** a set $\\mathcal{C}$ of size $d$ which is shattered by $\\mathcal{H}$\n",
    "        * That is, show that for some ordering of the points, **any** kind of labeling can be attained by hypothesis from $\\mathcal{H}$\n",
    "    2. **Every** set $\\mathcal{C}$ of size $d + 1$ is not shattered by  $\\mathcal{H}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Can be thought of as a **2-player game**:\n",
    "    * Fix the definition of $h_{\\theta} = f(x;\\theta)$ (the hypotheses class, e.g. linear classifiers)\n",
    "    * **Player 1**: choose locations $x^{(1)},...,x^{(d)}$\n",
    "    * *Player 2*: choose target labels $y^{(1)},...,y^{(d)}$\n",
    "    * **Player 1**: choose a hypothesis $h \\in \\mathcal{H}$, e.g., choose $\\theta$ in the linear classifier\n",
    "    * If $f(x;\\theta)$ can reproduce the target labeles, **Player 1** wins.\n",
    "        * $\\exists \\{ x^{(1)}, ..., x^{(d)}\\} \\text{ s.t. } \\forall  \\{ y^{(1)}, ..., y^{(d)}\\} \\exists \\theta \\text{ s.t. } \\forall i, f(x^{(i)}) = y^{(i)}$\n",
    "    * The VC dimension would be the value $d$ if *Player 2* covered all the possibles labels and **Player 1** won every game."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/doodle/96/000000/children-pyramid.png\" style=\"height:50px;display:inline\"> VC Dimension - Examples\n",
    "---\n",
    "#### <img src=\"https://img.icons8.com/dusk/64/000000/1.png\" style=\"height:30px;display:inline\"> Example 1 - Toy Example\n",
    "---\n",
    "Consider 9 samples, and 8 hypotheses as follows:\n",
    "\n",
    "|  | $x_1$ |$x_2$| $x_3$ | $x_4$ |$x_5$ |$x_6$ | $x_7$ | $x_8$ |$x_9$ |\n",
    "|------|------|------|------|------|------|------|------|------|------|\n",
    "| $h_1$ | 0 | 0 | 1 | 0|0|0|1|0|0|\n",
    "| $h_2$ | 0 | 1 | 0 | 0|0|1|0|0|0|\n",
    "| $h_3$ | 1 | 0 | 0 | 0|1|1|0|0|0|\n",
    "| $h_4$ | 0 | 0 | 0 | 1|1|0|0|0|1|\n",
    "| $h_5$ | 0 | 0 | 1 | 0|0|0|0|1|0|\n",
    "| $h_6$ | 0 | 1 | 0 | 0|0|0|1|0|0|\n",
    "| $h_7$ | 1 | 0 | 0 | 0|0|1|0|0|0|\n",
    "| $h_8$ | 0 | 0 | 0 | 0|0|0|0|0|0|\n",
    "\n",
    "* The first thing to notice is that the whole sample set (1-9) cannot be shattered as we don't have enough hypotheses. In order to shatter the whole set we would need at least $2^9$ hypotheses."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Excercise**: Are the following sets shattered?\n",
    "    * $\\{x_1\\}$\n",
    "    * $\\{x_5, x_6\\}$\n",
    "    * $\\{x_1, x_2\\}$\n",
    "    * $\\{x_5, x_6, x_7\\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Solution**:\n",
    "    * $\\{x_1\\}$ - **yes**, by $\\{h_2, h_3\\}$\n",
    "    * $\\{x_5, x_6\\}$ - **yes**, by $\\{h_1, h_2, h_3, h_4\\}$\n",
    "    * $\\{x_1, x_2\\}$ - **no**, can't get the classification: $x_1 = 1$ and $x_2 = 1$\n",
    "    * $\\{x_5, x_6, x_7\\}$ - **no**, can't get the classification: $x_5=x_6=x_7=1$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Excercise**: What is the VC dimension of $\\mathcal{X}$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* **Solution**:\n",
    "    * The only 3 points with the dichotomy $\\{1, 1, 1\\}$ are $\\{x_1, x_5, x_6 \\}$\n",
    "    * But the dichotomy $\\{1,0,0\\}$ isn't achievable.\n",
    "        * $\\to$ No 3 points can be shattered\n",
    "    * $\\to VCdim(\\mathcal{H}) = 2 $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <img src=\"https://img.icons8.com/dusk/64/000000/2.png\" style=\"height:30px;display:inline\"> Example 2 -Threshold Functions\n",
    "---\n",
    "* Threshold functions - $f \\in \\mathcal{H}$ is a single-parametric threshold classifier on real numbers, i.e., for a certain threshold $\\theta$, the classifier $f_{\\theta}$ returns 1 if the input number is larger than $\\theta$ and 0 otherwise. Formally: $$ \\mathcal{X} = \\mathbb{R}, \\mathcal{H} = \\{ x \\to sign(x-\\theta): \\theta \\in \\mathbb{R} \\} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "* Let's \"prove\" that $VCdim(\\mathcal{H}) = 1$:\n",
    "    1. One ($n=1$) point can be shattered because for every point $x$, a classifier $f_{\\theta}(x)$ labels it as 0 if $\\theta > x$ and 1 if $\\theta < x$. For example, for $(x=0, label=0), \\theta= 1$ and for $(x=0, label=1), \\theta= -1$.\n",
    "    2. No two ($n+1=2$) points can be shattered - because for every set of 2 points, if the smaller is labeled 1, then the larger must also be labeled 1, so not all labelings are possible.\n",
    "    \n",
    "<img src=\"./assets/tut_13_threshold.jpg\" style=\"height:100px\">\n",
    " \n",
    " <a href=\"http://work.caltech.edu/lectures.html\">Image Source</a> (CalTech's free machine Learning online course by Yaser Abu-Mostafa)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <img src=\"https://img.icons8.com/plasticine/100/000000/3.png\" style=\"height:30px;display:inline\"> Example 3 - Intervals Functions\n",
    "---\n",
    "* Intervals functions - $f \\in \\mathcal{H}$ is a single-parametric interval classifier on real numbers, i.e, for a certain parameter $\\theta$, the classifier $f_{\\theta}$ returns 1 if the input number is in the interval $[\\theta, \\theta+4]$ and 0 otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Let's \"prove\" that $VCdim(\\mathcal{H}) = 2$:\n",
    "    1. Two ($n=2$) points can be shattered because for every set $\\{x, x+2\\}$, a classifier $f_{\\theta}(x)$ labels it as:\n",
    "        * $(0,0)$ -  if $\\theta < x - 4$ or if $\\theta > x + 2$.\n",
    "        * $(1,0)$ -  if $\\theta \\in [x-4, x-2)$.\n",
    "        * $(1,1)$ - if $\\theta \\in [x-2, x]$.\n",
    "        * $(0,1)$ - if $\\theta \\in (x, x+2]$.\n",
    "    2. No three ($n+1=3$) points can be shattered - because for every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* This result can be generalized for a two-parametric interval classifier $h_{a,b}$: $$ \\mathcal{X} = \\mathbb{R}, \\mathcal{H} = \\{ h_{a,b}: a < b \\in \\mathbb{R} \\} $$ where $$ h_{a,b}(x) = 1 \\iff x \\in [a,b] $$\n",
    "\n",
    "<img src=\"./assets/tut_13_intervals.jpg\" style=\"height:100px\">\n",
    "\n",
    "<a href=\"http://work.caltech.edu/lectures.html\">Image Source</a> (CalTech's free machine Learning online course by Yaser Abu-Mostafa)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <img src=\"https://img.icons8.com/plasticine/100/000000/4.png\" style=\"height:30px;display:inline\"> Example 4 - Axis Aligned Rectangles\n",
    "---\n",
    "* Axis aligned rectangles: $$ \\mathcal{X} = \\mathbb{R}^2, \\mathcal{H} = \\{ h_{a_1,a_2,b_1, b_2}: a_1 < a_2 \\text{ and } b_1 < b_2 \\} $$, where $$ h_{a_1,a_2,b_1, b_2}(x_1, x_2) = 1 \\iff x_1 \\in [a_1, a_2] \\text{ and } x_2 \\in [b_1, b_2] $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Let's \"prove\" that $VCdim(\\mathcal{H}) = 4$:\n",
    "    \n",
    "1.Four ($n=4$) points can be shattered as seen in the following arrangement: <img src=\"./assets/tut_13_rectangles.jpg\" style=\"height:200px\">\n",
    "\n",
    "Image from Princeton's COS 511: Theoretical Machine Learning, Lecture on <a href=\"https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0220.pdf\"> VC-Dimension</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "2.No five ($n+1=5$) can be shattered - for any 5-point set, we can construct a data assignment in this way: pick the topmost, bottommost, leftmost and rightmost points and give them the label “+”. Because there are 5 points, there must be at least one point left to which we assign “−”. Any rectangle that contains all the “+” points must contain the “−” point, which is a case where shattering is not possible.\n",
    "\n",
    "<img src=\"./assets/tut_13_rectangles_2.jpg\" style=\"height:200px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <img src=\"https://img.icons8.com/plasticine/100/000000/5.png\" style=\"height:30px;display:inline\"> Example 5 - Halfspaces\n",
    "---\n",
    "* Halfspaces (linear classifiers): $$ \\mathcal{X} = \\mathbb{R}^2, \\mathcal{H} = \\{ x \\to sign(\\langle w, x \\rangle) \\}: w \\in \\mathbb{R}^2 $$\n",
    "    * For example: $h(x) = \\mathbb{1}\\{ \\theta_1 x_1 + \\theta_2 x_2 \\geq 0\\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Let's \"prove\" that $VCdim(\\mathcal{H}) = 3$:\n",
    "    \n",
    "1.Three ($n=3$) points can be shattered as seen in the following arrangement: <img src=\"./assets/tut_13_linear.jpg\" style=\"height:200px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "2.No four ($n+1=4$) can be shattered - We consider two cases:\n",
    "   1. The four points form a convex region, i.e., lie on the convex hull defined by the 4 points. \n",
    "   2. Three of the 4 points define the convex hull and the 4th point is internal. \n",
    "    \n",
    "* In the first case, the labeling which is positive for one diagonal pair and negative to the other pair cannot be realized by a separating line. \n",
    "* In the second case, a labeling which is positive for the three hull points and negative for the interior point cannot be realized.\n",
    "    \n",
    "<img src=\"./assets/tut_13_halfspaces.jpg\" style=\"height:100px\">\n",
    "\n",
    "* The results is generalized for hyperplanes: VC dimension of hyperplanes in $\\mathbb{R}^d$ is $d+1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/bubbles/50/000000/star.png\" style=\"height:50px;display:inline\"> VC Dimension - Special Cases\n",
    "---\n",
    "* $VCdim(\\mathcal{H}) = 0$ - When is the VC dimension equals to zero? Assume $\\mathcal{X} = \\mathbb{R}^2$. Let $\\mathcal{H}$ contain a **single** hypothesis $h_1$. Thus, the VC dimension of $\\mathcal{H}$ is **always** 0! A single hypothesis can impose only one classification, can only assign one labeling to a set of points.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* $VCdim(\\mathcal{H}) = \\infty$ - When does the VC dimension go to infinity? Assume $\\mathcal{X} = \\mathbb{R}^2$. Let $\\mathcal{A}$ be the **set of all convex polygons** in $\\mathcal{X}$. Define $\\mathcal{H}$ as the class of all hypotheses $h_p(x), p \\in \\mathcal{A}$: $$ h_p(x) = \\begin{cases} 1, \\text{ if } x \\text{ is contained within polygon } p \\\\ 0, \\text{ otherwise} \\end{cases} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Let's see why $VCdim(\\mathcal{H}) = \\infty$: for any positive integer $n$, take $n$ points from $\\mathcal{X}$. Place the $n$ points **uniformly spaced** on the **unit circle**. For each $2^n$ subset of this data, there is a convex polygon with vertices at these $n$ points. For each subset, the convex polygon contains the set and excludes its complement.\n",
    "<img src=\"./assets/tut_13_convex.jpg\" style=\"height:200px\">\n",
    "Image from <a href=\"http://www.mathematik.uni-muenchen.de/~deckert/teaching/SS17/ATML/media/VC_dimension.pdf\"> Learnability and VC Dimension</a> at LMU Munchen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### <img src=\"https://img.icons8.com/bubbles/50/000000/video-playlist.png\" style=\"height:50px;display:inline\"> Recommended Videos\n",
    "---\n",
    "#### <img src=\"https://img.icons8.com/cute-clipart/64/000000/warning-shield.png\" style=\"height:30px;display:inline\"> Warning!\n",
    "* These videos do not replace the lectures and tutorials.\n",
    "* Please use these to get a better understanding of the material, and not as an alternative to the written material.\n",
    "\n",
    "#### Video By Subject\n",
    "\n",
    "* VC Dimension - <a href=\"https://www.youtube.com/watch?v=puDzy2XmR5c\">VC Dimension - Alexander Ihler</a>\n",
    "* Learning Theory by Andrew Ng (Stanford)\n",
    "    * <a href=\"https://www.youtube.com/watch?v=tojaGtMPo5U\">Lecture 9 | Machine Learning (Stanford)</a>\n",
    "    * <a href=\"https://www.youtube.com/watch?v=0kWZoyNRxTY\">Lecture 10 | Machine Learning (Stanford)</a>\n",
    "* Learning Theory Lectures By Shai Ben-David\n",
    "    * <a href=\"https://www.youtube.com/watch?v=aILazXK059Y\"> Lecture 2</a>\n",
    "    * <a href=\"https://www.youtube.com/watch?v=PflkE9JmNLc\"> Lecture 3</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## <img src=\"https://img.icons8.com/dusk/64/000000/prize.png\" style=\"height:50px;display:inline\"> Credits\n",
    "---\n",
    "* Based on <a href=\"https://www.cs.huji.ac.il/~shais/Lectures2014/lecture2.pdf\">slides by Shai Shalev-Schwarz</a>\n",
    "* Great (!) Reading Resource - <a href=\"http://cs229.stanford.edu/notes-spring2019/cs229-notes4.pdf\">CS229 - Stanford - Machine Learning - Learning Theory</a>\n",
    "    * It covers everything and goes into much more details\n",
    "* Icons from <a href=\"https://icons8.com/\">Icon8.com</a> - https://icons8.com"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}