{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "meta",
     "toc_en"
    ]
   },
   "source": [
    "# Introduction to Machine Learning\n",
    "\n",
    "<img src=\"logo.jpg\" style=\"float: left; width: 15%\" />\n",
    "\n",
    "[CSE204-2018](https://moodle.polytechnique.fr/course/view.php?id=6784) Lab session #02\n",
    "\n",
    "Jérémie DECOCK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeremiedecock/polytechnique-cse204-2018/blob/master/lab_session_02_correction.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://mybinder.org/v2/gh/jeremiedecock/polytechnique-cse204-2018/master?filepath=lab_session_02_correction.ipynb\"><img align=\"left\" src=\"https://mybinder.org/badge.svg\" alt=\"Open in Binder\" title=\"Open and Execute in Binder\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives\n",
    "\n",
    "- Implement the *k-Nearest Neighbors* algorithm\n",
    "- Use it to solve classification and regression problems\n",
    "- Define the decision boundaries\n",
    "- Explain the weaknesses of this algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports and tool functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "import pandas as pd\n",
    "import sklearn.neighbors\n",
    "from sklearn.utils import shuffle\n",
    "\n",
    "from scipy.spatial import Voronoi, voronoi_plot_2d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gen_2d_classification_samples(n_samples = 20):\n",
    "\n",
    "    cov = np.diag([2., 2.])\n",
    "\n",
    "    x1 = np.random.multivariate_normal(mean=[0., 0.], cov=cov, size=n_samples)\n",
    "    y1 = np.full(n_samples, 1, dtype=np.int)\n",
    "\n",
    "    x2 = np.random.multivariate_normal(mean=[4., 0.], cov=cov, size=n_samples)\n",
    "    y2 = np.full(n_samples, 2, dtype=np.int)\n",
    "\n",
    "    x3 = np.random.multivariate_normal(mean=[2., 4.], cov=cov, size=n_samples)\n",
    "    y3 = np.full(n_samples, 3, dtype=np.int)\n",
    "\n",
    "    X = np.concatenate([x1, x2, x3])\n",
    "    y = np.concatenate([y1, y2, y3])\n",
    "\n",
    "    df = pd.DataFrame(X, columns=['x1', 'x2'])\n",
    "    df['y'] = y\n",
    "\n",
    "    df = shuffle(df).reset_index(drop=True)\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
    "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
    "\n",
    "def plot_2d_classification_samples(dataframe, model=None, voronoi=False):\n",
    "    plt.figure(figsize=(8, 8))\n",
    "    \n",
    "    df = dataframe  # make an alias\n",
    "    \n",
    "    ERROR_MSG1 = \"The `dataframe` parameter should be a Pandas DataFrame having the following columns: ['x1', 'x2', 'y']\"\n",
    "    assert df.columns.values.tolist() == ['x1', 'x2', 'y'], ERROR_MSG1\n",
    "    \n",
    "    ERROR_MSG2 = \"The `dataframe` parameter should be a Pandas DataFrame having the following labels (in column 'y'): [1, 2, 3]\"\n",
    "    labels = pd.unique(df.y).tolist()\n",
    "    labels.sort()\n",
    "    assert labels == [1, 2, 3] or labels == [1, 3], ERROR_MSG2\n",
    "    \n",
    "    if model is not None:\n",
    "        if voronoi:\n",
    "            # Compute the Voronoi cells\n",
    "            \n",
    "            vor = Voronoi(df[['x1', 'x2']])\n",
    "\n",
    "            # Plot the Voronoi diagram\n",
    "            \n",
    "            fig = voronoi_plot_2d(vor, show_vertices=False, show_points=False)\n",
    "            fig.set_size_inches(8, 8)\n",
    "        \n",
    "        # Compute the model's decision boundaries\n",
    "        \n",
    "        h = .02  # step size in the mesh\n",
    "\n",
    "        x_min, x_max = df.x1.min() - 1, df.x1.max() + 1\n",
    "        y_min, y_max = df.x2.min() - 1, df.x2.max() + 1\n",
    "\n",
    "        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
    "                             np.arange(y_min, y_max, h))\n",
    "\n",
    "        Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "        Z = Z.reshape(xx.shape)\n",
    "        \n",
    "        # Plot the model's decision boundaries\n",
    "\n",
    "        plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.5)\n",
    "\n",
    "    # Plot also the training points\n",
    "    plt.scatter(df.x1, df.x2, c=df.y, cmap=cmap_bold, edgecolor='k', s=30)\n",
    "    plt.xlabel(r\"$x_1$\", fontsize=16)\n",
    "    plt.ylabel(r\"$x_2$\", fontsize=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nearest Neighbor algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Today you will implement one of the simplest (but quite powerful) machine learning algorithm: the *Nearest Neighbor* algorithm and its extension the *k-Nearest Neighbors* algorithm (or *kNN*).\n",
    "Both can be used for classification and regression tasks.\n",
    "\n",
    "Considering a dataset $\\mathcal{D}=\\{(\\mathcal{x}_i, y_i)_{i=1,\\dots,n}\\}$ of $n$ labeled examples, the *Nearest Neighbor* model assigns an input vector $\\boldsymbol{x}$ to the label $y_{\\arg\\min_{i=1,\\dots, n}d(x, x_i)}$ of its closest neighbor in $\\mathcal{D}$.\n",
    "\n",
    "The distance function $d$ is used to define what is the closest neighbor. This can be any metric measure but the *Minkowski distance* (especially the classical Euclidian distance $d_2$) is the most common choice. It is defined as follow:\n",
    "\n",
    "$$d_q: \\mathbb{R}^p \\times \\mathbb{R}^p \\to \\mathbb{R}$$\n",
    "\n",
    "$$\\boldsymbol{u}, \\boldsymbol{v} \\mapsto ||\\boldsymbol{u} - \\boldsymbol{v}||_q = \\left( \\sum_{j=1}^p |u_j - v_j|^q \\right)^{1/q}$$\n",
    "\n",
    "When $q=2$, $d_q$ is the *Euclidian distance*\n",
    "\n",
    "$$d_2(\\boldsymbol{u}, \\boldsymbol{v}) = \\sqrt{\\sum_{j=1}^{p} (u_j - v_j)^2}$$\n",
    "\n",
    "When $q=1$, $d_q$ is the *Manhattan distance*\n",
    "\n",
    "$$d_1(\\boldsymbol{u}, \\boldsymbol{v}) = \\sum_{j=1}^{p} |u_j - v_j|$$\n",
    "\n",
    "When $q=\\infty$, $d_q$ is the  *Tchebychev distance*\n",
    "\n",
    "$$d_{\\infty}(\\boldsymbol{u}, \\boldsymbol{v}) = \\max_{j=1,\\dots,p} |u_j - v_j|$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We consider the following dataset (where 'x1' and 'x2' are examples features and where 'y' is the examples label):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [[0, 0, 1],\n",
    "        [0, 1, 1],\n",
    "        [1, 1, 2],\n",
    "        [1, 0, 3]]\n",
    "\n",
    "df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2d_classification_samples(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What label will be predicted by the Nearest Neighbor algorithm for the point $x = \\pmatrix{0 \\\\ 0.5}$ ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have $n$ examples in $p$ dimensions in the dataset, what it the training error of the Nearest Neighbor algorithm (in classification) ? Why ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'1' (the figure for the \"red\" class)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0 because the nearest neighbor of each example $x_i$ in the training dataset is the example $x_i$ itself. Thus there is no classification error for these examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider this new dataset (where 'volume (mL)' and 'caffeine (g)' are the examples features and where 'drink' is the label):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [[250, 0.025, 'tea'],\n",
    "        [100, 0.01,  'tea'],\n",
    "        [125, 0.05,  'coffe'],\n",
    "        [250, 0.1,   'coffe']]\n",
    "\n",
    "df = pd.DataFrame(data, columns=['volume (mL)', 'caffeine (g)', 'drink'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the Nearest Neighbor method to predict the label of a 125mL drink having 0.015g of caffeine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is wrong with this prediction ? How to solve this problem ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The nearest example is $(125 ; 0.050)$.\n",
    "\n",
    "Thus the label of a 125mL drink having 0.015g of caffeine is \"coffe\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actually, the predicted drink is more probably a \"tea\" (considering the caffeine feature).\n",
    "\n",
    "The prediction is made on the volume and the caffeine is not considered because these two variables are not normalized!\n",
    "\n",
    "Thus example's value of these two variables have to be normalized to be comparable (i.e. transformed to be contained in the same range)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nearest Neighbor method with Scikit Learn\n",
    "\n",
    "Let's play with the Scikit Learn implementation of the Nearest Neighbor algorithm.\n",
    "The official documentation is there: https://scikit-learn.org/stable/modules/neighbors.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin with a \"toy\" **classification problem**.\n",
    "\n",
    "Use the `gen_2d_classification_samples()` function (defined above) to generate a dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = gen_2d_classification_samples(n_samples=20)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, examples are defined in $\\mathbb{R}^2$ (features are stored in columns 'x1' and 'x2').\n",
    "Example's labels are defined in the 'y' column.\n",
    "\n",
    "The 'y' column contains three possible labels: \"1\", \"2\" and \"3\" respectively represented by the red, green and blue colors in the following figure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2d_classification_samples(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus this toy problem is a multiclass classification problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the dataset is ready, let's make the classifier and train it with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(X=df[['x1', 'x2']], y=df['y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `model.predict()` function to guess the class of the following points:\n",
    "\n",
    "$$x_{p1} = \\pmatrix{-2 \\\\ 2}, x_{p2} = \\pmatrix{2 \\\\ 6}, x_{p3} = \\pmatrix{6 \\\\ 0}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is the training step (`model.fit()` function) longer to execute than the prediction step (`model.predict()` function) ? Why ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell shows the decision boundary of the model. Explain what is a decision boundary in classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2d_classification_samples(df, model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell generates the *Voronoï diagram* of the dataset.\n",
    "What does this figure illustrates about the Nearest Neighbor method?\n",
    "\n",
    "The Voronoï diagram makes a partition of the feature space $\\mathcal{X}$.\n",
    "Each partition is a *cell*. What do cells represent?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2d_classification_samples(df, model=model, voronoi=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict([[-2., 2.],\n",
    "               [2., 6.],\n",
    "               [6., 0.]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "i.e. $y_{p1} = 1$ (red), $y_{p2} = 3$ (blue), $y_{p3} = 2$ (green)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No. `model.fit()` does nothing in Nearest Neighbor models as they have nothing to \"learn\" (i.e. these models have no parameter to optimize).\n",
    "\n",
    "The `model.predict()` function does all the job: compute distances with examples in the dataset and find the closest example to the predicted one. This function may be long to execute if there are many examples in the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a subspace of the feature space $\\mathcal{X}$ that partitions it into sets that defines the affiliations to each class:\n",
    "- all $\\boldsymbol{x}$ point in red areas will be labeled \"1\" (red) by the model\n",
    "- all $\\boldsymbol{x}$ point in green areas will be labeled \"2\" (green) by the model\n",
    "- all $\\boldsymbol{x}$ point in blue areas will be labeled \"3\" (blue) by the model\n",
    "\n",
    "The decision boundaries are the boundaries of these areas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is one \"cell\" per example of $\\mathcal{D}$. Each cell defines the area of influence of these examples (called *seeds*): any predicted point $\\boldsymbol{x}$ in a cell will receive the same label than its seed example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the \"toy\" classification problem, let's work on a toy **regression problem**.\n",
    "\n",
    "The next cell generates a dataset (where 'x' is the feature and 'y' the label to predict)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N_SAMPLES = 40\n",
    "x = np.random.uniform(low=-10., high=10., size=N_SAMPLES)\n",
    "y = 2. * x + 3. + np.random.normal(scale=3., size=x.shape)\n",
    "\n",
    "df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])\n",
    "#df.plot(x='x', y='y', style='o-')\n",
    "df.plot.scatter(x='x', y='y');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the dataset is ready, let's make the regressor and train it with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(df[['x']], df['y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `model.predict()` function to guess the class of the following points:\n",
    "\n",
    "$$x_{p1} = \\pmatrix{-2}, x_{p2} = \\pmatrix{2}, x_{p3} = \\pmatrix{6}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict([[-2.],\n",
    "               [2.],\n",
    "               [6.]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A random noise is added to examples label thus you will have different (but close) results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot the model's decision function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_pred = np.arange(-10, 10, 0.1).reshape(-1, 1)\n",
    "y_pred = model.predict(x_pred)\n",
    "\n",
    "df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])\n",
    "\n",
    "ax = df.plot.scatter(x='x', y='y')\n",
    "df_pred.plot(x='x', y='y', style='r--', ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "\n",
    "Do you think this model has good performances in *generalization* ? Why ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model learn the trend of the data (this is desired) but it also learn the noise contained in the dataset (this is **not** desired).\n",
    "\n",
    "Thus it will have poor results with new observations. In other words, this model does not generalize well on new data.\n",
    "\n",
    "It won't make good prediction on unknown data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## k-Nearest Neighbors algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The *Nearest Neighbor* method is very sensitive to noise: if an example in $\\mathcal{D}$ is wrongly labeled or positioned, all points in its Voronoï cell will be wrong too. The *k Neareast Neighbor* fix this weakness by considering for each prediction the label of several neighbors instead of just one.\n",
    "\n",
    "Considering a dataset $\\mathcal{D}=\\{(\\mathcal{x}_i, y_i)_{i=1,\\dots,n}\\}$ of $n$ labeled examples and a meta parameter $k \\in \\mathbb{N}*$, the *$k$ Nearest Neighbors* model assigns an input vector $\\boldsymbol{x}$ to the label $y_{\\arg\\min_{i=1,\\dots, n}d(x, x_i)}$ of its $k$ closest neighbor in $\\mathcal{D}$.\n",
    "Let's write $\\mathcal{N}_k(\\boldsymbol{x})$ the set of the $k$ nearest neighbors of $\\boldsymbol{x}$ in $\\mathcal{D}$.\n",
    "\n",
    "- For classification problems, the label assigned to $\\boldsymbol{x}$ is the **most represented label** among the nearest neighbors (vote)\n",
    "$$f(\\boldsymbol{x}) = \\arg\\max_c \\sum_{i: x_i \\in \\mathcal{N}_k(\\boldsymbol{x})} \\delta(y_i, c)$$\n",
    "\n",
    "- For regression problems, the label assigned to $\\boldsymbol{x}$ is computed based on the **mean** of the labels of its nearest neighbors $\\mathcal{N}_k(\\boldsymbol{x})$\n",
    "$$f(\\boldsymbol{x}) = \\frac{1}{k} \\sum_{i: x_i \\in \\mathcal{N}_k(\\boldsymbol{x})} y_i$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We consider the following dataset (where `x1` and `x2` are the example features and where `y` is the example label):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [[1, 2, '+'],\n",
    "        [2, 1, '+'],\n",
    "        [2, 2, '-'],\n",
    "        [2, 3, '+'],\n",
    "        [3, 1, '-'],\n",
    "        [3, 2, '+']]\n",
    "\n",
    "df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Draw this dataset on a sheet of paper."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Draw the decision boundary of a Nearest Neighbor model (i.e. 1NN)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Draw the decision boundary of a 3 Nearest Neighbor model (i.e. 3NN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many errors these two classifiers make on the training dataset ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What label these two classifiers predict for the point $x = \\pmatrix{4 \\\\ 0.5}$ ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = df.loc[df.y == '+'].plot.scatter(x='x1', y='x2', color='red', label=\"+\", figsize=(8,8))\n",
    "df.loc[df.y == '-'].plot.scatter(x='x1', y='x2', color='blue', label=\"-\", ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df.y == '-', \"y\"] = 3\n",
    "df.loc[df.y == '+', \"y\"] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)\n",
    "\n",
    "model.fit(df[['x1', 'x2']], df['y'])\n",
    "\n",
    "plot_2d_classification_samples(df, model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3)\n",
    "\n",
    "model.fit(df[['x1', 'x2']], df['y'])\n",
    "\n",
    "plot_2d_classification_samples(df, model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- with $k = 1$: no error\n",
    "- with $k = 3$: 4 errors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The nearest point is $\\pmatrix{4 \\\\ 0.5}$. Thus for $k = 1$, the predicted label is \"-\" (blue).\n",
    "\n",
    "The next two nearest points are $\\pmatrix{2 \\\\ 1}$ and $\\pmatrix{3 \\\\ 2}$. Thus for $k = 3$, the predicted label is \"+\" (red)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## k-Nearest Neighbor (kNN) with Scikit Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we make the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = gen_2d_classification_samples()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2d_classification_samples(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we make the classifier, train it and plot the decision boundaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)\n",
    "\n",
    "model.fit(df[['x1', 'x2']], df['y'])\n",
    "\n",
    "plot_2d_classification_samples(df, model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `n_neighbors` parameter provided to the model's constructor `KNeighborsClassifier` sets the number of neighbors to consider for each prediction (i.e. `n_neighbors` this is the '$k$' of kNN)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the value of this parameter and observe what happen.\n",
    "What is the influence of the number of neighbors on the boundaries ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you face a very noised dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is the Voronoi diagram useful for the kNN case (i.e. when $k>1$) ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the decision boundary with $k=2$ and describe what append in case of equal vote?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add the `weights = \"distance\"` parameter in `KNeighborsClassifier`'s constructor. What changes can you observe on the decision boundary? Explain how labels are computed with this new parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Large $k$ produce \"smoother\" boundaries. In general, a model with e.g. $k=5$ is less influenced by the noise contained in the dataset than a model with $k < 5$ i.e. it generalize better.\n",
    "\n",
    "If $k \\geq n$ (with $n$ the number of elements in the dataset $\\mathcal{D}$) then all predicted points have the label of the most represented class (see question 4 in case of equals representations)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's generally a good idea yes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An arbitrary choice is made.\n",
    "\n",
    "In Scikit Learn:\n",
    "- the first class (red) is always selected in case of equal vote with the other two\n",
    "- the second class (green) is always selected in case of equal vote with the third one (blue)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this parameter enabled, the influence of the k nearest neighbors is weighted by their proximity to the point $\\boldsymbol{x}$ to predict.\n",
    "\n",
    "It have the disadvantage to make models more sensitive to the noise contained in $\\mathcal{D}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we make the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N_SAMPLES = 40\n",
    "x = np.random.uniform(low=-10., high=10., size=N_SAMPLES)\n",
    "y = 2. * x + 3. + np.random.normal(scale=3., size=x.shape)\n",
    "\n",
    "df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])\n",
    "#df.plot(x='x', y='y', style='o-')\n",
    "df.plot.scatter(x='x', y='y');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we make the classifier, train it and plot the decision boundaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_neighbors = 10\n",
    "\n",
    "model = sklearn.neighbors.KNeighborsRegressor(n_neighbors)\n",
    "\n",
    "model.fit(df[['x']], df['y'])\n",
    "\n",
    "x_pred = np.arange(-10, 10, 1).reshape(-1, 1)\n",
    "y_pred = model.predict(x_pred)\n",
    "\n",
    "df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])\n",
    "\n",
    "ax = df.plot.scatter(x='x', y='y')\n",
    "df_pred.plot(x='x', y='y', style='r--', ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `n_neighbors` parameter provided to the model's constructor `KNeighborsClassifier` sets the number of neighbors to consider for each prediction (i.e. `n_neighbors` this is the '$k$' of kNN)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the value of this parameter and observe what happen.\n",
    "What is the influence of the number of neighbors on the decision function ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you face a very noised dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In one hand, a model with a small $k$ (e.g. $k=2$) is more influenced by the noise contained in the dataset i.e. it has poorer generalization performance (i.e. poor results on unknown data): this is an **overfitted** model.\n",
    "\n",
    "In the other hand, if $k$ is too large compared to the number of available examples in $\\mathcal{D}$ (e.g. if $k=20$ here) it won't be capable to fit local trends and will have poor results too: this is an **underfitted** model.\n",
    "\n",
    "Here, $k=10$ seems to be a good compromise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a tradeoff to find.\n",
    "But a model with a small $k$ is more influenced by the noise contained in the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9\n",
    "\n",
    "Solve the Titanic problem with the k Nearest Neighbors method. Reuse the code of the first lab session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 10\n",
    "\n",
    "Write your own implementation for the k Nearest Neighbor algorithm.\n",
    "Write a `knn()` function that takes two arguments:\n",
    "- *data*: the observed dataset (examples with their labels)\n",
    "- *xpred*: a list of examples to predict\n",
    "\n",
    "This function should return the sequence of predicted labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc-autonumbering": true
 },
 "nbformat": 4,
 "nbformat_minor": 2
}