{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Week 2+3. Regularization, Linear Classification, towards Logistic regression\n", "\n", "Augustin Cosse\n", "\n", "__Material covered:__ Ridge and LASSO regression, linear classification through Multiple Discriminant, OLS and Normal Equations, one-vs-rest and one-vs-one classifiers. Use of the meshgrid function to display the classification boundaries. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 1. Binary discriminant through OLS and Normal Equations\n", "\n", "In this first demo, we will solve the classification problem as follows.\n", "\n", "- We first assign 0/1 labels to each of the points (recall that we are in the classification framework so the target now take a finite set of values). That is we define our targets as +1 and 0 depending on whether our points are from the yellow or purple class. This gives a vector \n", "$$\\mathbf{t} = [t^{(1)}, t^{(2)}, \\ldots, t^{(N)}]$$\n", "\n", "if we have $n$ points\n", "\n", "- We then store the sample points in a matrix $\\mathbf{X}$ as we did for regression. In this case the points are 2D, we thus have \n", "\n", "$$\\mathbf{X} = \\left[\\begin{array}\n", "(\\mathbf{x}^{(1)})^T\\\\\n", "\\mathbf{x}^{(2)})^T\\\\\n", "\\vdots\\\\\n", "(\\mathbf{x}^{(N)})^T\n", "\\end{array}\\right]$$\n", "\n", "where each $\\mathbf{x}^{(i)} = \\left[\\begin{array}{c}\n", "x^{(i)}_1\\\\\n", "x^{(i)}_2\n", "\\end{array}\\right]$ now encodes the two coordinates of the corresponding point in the dataset below. We want to learn a model of the form \n", "\n", "$y(\\mathbf{x}) = \\beta_0 + \\beta_1X_1 + \\beta_2X_2$ \n", "\n", "that outputs a prediction $y(\\mathbf{x}^{(i)})$ that is as close as possible to the target of the point $t^{(i)}$. We will encode this model by adding an additional column of $1$'s to the matrix $\\mathbf{X}$ above to get\n", "\n", "$$\\mathbf{X} = \\left[\\begin{array}{cc}\n", "1 & (\\mathbf{x}^{(1)})^T\\\\\n", "1& \\mathbf{x}^{(2)})^T\\\\\n", "\\vdots & \\vdots\\\\\n", "1& (\\mathbf{x}^{(N)})^T\n", "\\end{array}\\right]$$\n", "we can then write the model as $\\mathbf{y} = \\mathbf{X}\\mathbf{\\beta}$ and we want $\\mathbf{y}$ as close as possible to $\\mathbf{t}$ (given that what we can achieve is limited by the linearity of the model)\n", "\n", "A natural approach, given what we learned so far, is thus to minimize the OLS criterion,\n", "\n", "$$\\min_{\\beta_0, \\beta_1, \\beta_2} \\sum_{i=1}^N \\frac{1}{N}|t^{(i)} - (\\beta_0 + \\beta_1X^{(i)}_1 + \\beta_2X^{(i)}_2)|^2$$\n", "\n", "As we saw in regression, this model can read in matrix form as \n", "\n", "$$\\mathbf{v} = \\mathbf{X}\\mathbf{\\beta} - \\mathbf{t}$$\n", "\n", "and then \n", "\n", "$$\\min_{\\mathbf{\\beta}}\\frac{1}{N}\\mathbf{v}^T\\mathbf{v} = \\min_{\\mathbf{\\beta}} \\left(\\mathbf{X}\\mathbf{\\beta} - \\mathbf{t}\\right)^T(\\mathbf{X}\\mathbf{\\beta} - \\mathbf{t})$$\n", "\n", "Instead of using a gradient descent approach, we could alternatively set the derivative of the loss with respect to the weights $\\mathbf{\\beta}$ to zero and solve the equations. In this case (you can verify it by computing the derivatives with respect to each of the $\\beta_j$ and setting them to zero), this gives the set of equations\n", "\n", "$$\\mathbf{X}^T\\left(\\mathbf{X}\\mathbf{\\beta} - \\mathbf{t}\\right) = 0 \\leftrightarrow \\mathbf{X}^T\\mathbf{X}\\mathbf{\\beta} = \\mathbf{X}^T\\mathbf{t}$$\n", "\n", "which can solve by using the inverse of $\\mathbf{X}^T\\mathbf{X}$ as $\\mathbf{\\beta} = \\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{t}$\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/png": text/plain