{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "$$ \\LaTeX \\text{ command declarations here.}\n", "\\newcommand{\\R}{\\mathbb{R}}\n", "\\renewcommand{\\vec}[1]{\\mathbf{#1}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# EECS 545: Machine Learning\n", "## Lecture 04: Linear Regression I\n", "* Instructor: **Jacob Abernethy, Benjamin Bray, Jia Deng and Chansoo Lee**\n", "* Date: 9/21/2016" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Notation\n", "\n", "- In this lecture, we will use\n", " - Let vector $\\vec{x}_n \\in \\R^D$ denote the $n\\text{th}$ data. $D$ denotes number of attributes in dataset.\n", " - Let vector $\\phi(\\vec{x}_n) \\in \\R^M$ denote features for data $\\vec{x}_n$. $\\phi_j(\\vec{x}_n)$ denotes the $j\\text{th}$ feature for data $x_n$.\n", " - Feature $\\phi(\\vec{x}_n)$ is the *artificial* features which represents the preprocessing step. $\\phi(\\vec{x}_n)$ is usually some combination of transformations of $\\vec{x}_n$. For example, $\\phi(\\vec{x})$ could be vector constructed by $[\\vec{x}_n^\\top, \\cos(\\vec{x}_n)^\\top, \\exp(\\vec{x}_n)^\\top]^\\top$. If we do nothing to $\\vec{x}_n$, then $\\phi(\\vec{x}_n)=\\vec{x}_n$.\n", " - Continuous-valued label vector $t \\in \\R^D$ (target values). $t_n \\in \\R$ denotes the target value for $i\\text{th}$ data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Linear Regression " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linear Regression (General Case)\n", "- The function $y(\\vec{x}_n, \\vec{w})$ is linear in parameters $\\vec{w}$.\n", " - **Goal:** Find the best value for the weights $\\vec{w}$.\n", " - For simplicity, add a **bias term** $\\phi_0(\\vec{x}_n) = 1$.\n", "$$\n", "\\begin{align}\n", "y(\\vec{x}_n, \\vec{w})\n", "&= w_0 \\phi_0(\\vec{x}_n)+w_1 \\phi_1(\\vec{x}_n)+ w_2 \\phi_2(\\vec{x}_n)+\\dots +w_{M-1} \\phi_{M-1}(\\vec{x}_n) \\\\\n", "&= \\sum_{j=0}^{M-1} w_j \\phi_j(\\vec{x}_n) \\\\\n", "&= \\vec{w}^\\top \\phi(\\vec{x}_n)\n", "\\end{align}\n", "$$\n", "of which $\\phi(\\vec{x}_n) = [\\phi_0(\\vec{x}_n),\\phi_1(\\vec{x}_n),\\phi_2(\\vec{x}_n), \\dots, \\phi_{M-1}(\\vec{x}_n)]^\\top$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Method I: Batch Gradient Descent\n", "- To minimize the objective function, take derivative w.r.t coefficient vector $\\vec{w}$ and descend: initialize $\\vec{w}^0$ to be any vector, and at each step $s$,\n", "$$\n", "\\vec{w}^{s+1} \\gets \\vec{w}^{s} - \\nabla_{\\vec{w}}E(\\vec{w}^s)\n", "$$\n", "\n", "Exercise: Compute the partial derivative\n", "$$\n", "(\\nabla_{\\vec{w}}E)_j = \\frac{\\partial E}{\\partial w_j}\n", "$$\n", "where\n", "$$\n", "E(\\vec{w}) = \\frac{1}{2} \\sum_{n=1}^N \\sum_{i=1}^{M} \\left( w_i \\phi_i(\\vec{x}_n) - t_n \\right)^2\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Solution\n", "In the summation over $i$, only $i = j$ term is a function of $w_j$. So, \n", "$$\n", "\\frac{\\partial E}{\\partial w_j}\n", "= \\frac{1}{2} \\sum_{n=1}^N \\frac{\\partial}{\\partial w_j} \\left( w_j \\phi_j(\\vec{x}_n) - t_n \\right)^2\n", "= \\sum_{n=1}^{N} (w_j \\phi_j(\\vec{x}_n) - t_n)\n", "$$\n", "\n", "*Tip*: If you find subscript notations confusing, just plug in $j = 1$, differentiate, and get $\\sum_{n=1}^{N} (w_1 \\phi_1(\\vec{x}_n) - t_n)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linear Regression: Matrix Notations\n", "The matrix $\\Phi \\in \\R^{N \\times M}$ is called **design matrix**. Each row represents one sample. Each column represents one feature\n", "$$\\Phi = \\begin{bmatrix}\n", "\\phi(\\vec{x}_1)^\\top\\\\ \n", "\\phi(\\vec{x}_2)^\\top\\\\ \n", "\\vdots\\\\\n", "\\phi(\\vec{x}_N)^\\top\n", "\\end{bmatrix}\n", "= \\begin{bmatrix}\n", "\\phi_0(\\vec{x}_1) & \\phi_1(\\vec{x}_1) & \\cdots & \\phi_{M-1}(\\vec{x}_1) \\\\\n", "\\phi_0(\\vec{x}_2) & \\phi_1(\\vec{x}_2) & \\cdots & \\phi_{M-1}(\\vec{x}_2) \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "\\phi_0(\\vec{x}_N) & \\phi_1(\\vec{x}_N) & \\cdots & \\phi_{M-1}(\\vec{x}_N) \\\\\n", "\\end{bmatrix}\n", "$$\n", "\n", "Target value vector is $\\vec{t} \\in \\mathbb{R}^M$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "$$\n", "E(\\vec{w}) \n", "= \\frac{1}{2} \\sum_{n=1}^N (y(\\vec{x}_n, \\vec{w}) - t_n)^2\n", "= \\frac{1}{2} \\sum_{n=1}^N \\left( \\sum_{j=0}^{M-1} w_j\\phi_j(\\vec{x}_n) - t_n \\right)^2\n", "= \\frac{1}{2} \\sum_{n=1}^N \\left( \\vec{w}^\\top \\phi(\\vec{x}_n) - t_n \\right)^2\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Batch Gradient Descent with Matrix Calculus\n", "Write the objective function in matrix-vector form:\n", "$\n", "\\begin{align*}\n", "E(\\vec{w}) &= \\frac{1}{2} \\sum_{n=1}^N \\sum_{i=1}^{M} \\left( w_i \\phi_i(\\vec{x}_n) - t_n \\right)^2 \\\\ \n", "&= \\frac{1}{2} \\sum_{n=1}^N \\left( \\phi(\\vec{x}_n)^\\top \\vec{w} - t_n \\right)^2 = \\frac{1}{2} \\|\\Phi \\vec{w} - \\vec{t}\\|_2^2\n", "\\end{align*}\n", "$\n", "\n", "Rewrite $E$ as a sum of three matrix-vector products. Hints:\n", "* $ \\vec{x}^\\top \\vec{x} = (x_1,\\ldots,x_M)^\\top (x_1,\\ldots,x_M) = x_1^2 + \\cdots + x_M^2 = \\left(\\sqrt{x_1^2 + \\cdots + x_M^2}\\right)^2 = \\|\\vec{x}\\|_2^2$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Distributive law: $(\\vec{a} + \\vec{b})^\\top(\\vec{c} + \\vec{d}) = \\vec{a}^\\top \\vec{c} + \\vec{a}^\\top \\vec{d} + \\vec{b}^\\top \\vec{c} + \\vec{b}^\\top\\vec{d}$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Transpose of a product: $(AB)^\\top = B^\\top A^\\top$ for matrix-vector multiplication." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Batch Gradient Descent with Matrix Calculus\n", "Treat $\\Phi \\vec{w}$ as a vector and we get from the distributive law\n", "$$\n", "E(\\vec{w}) = \\frac{1}{2} \\|\\Phi \\vec{w} - \\vec{t}\\|_2^2 = \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - \\vec{w}^\\top \\Phi^\\top \\vec{t} - \\vec{t}^\\top \\Phi \\vec{w} + \\vec{t}^\\top \\vec{t}\\right).\n", "$$\n", "\n", "Note that $\\vec{w}^\\top (\\Phi^\\top \\vec{t}) = (\\Phi^\\top\\vec{t})^\\top \\vec{w} = \\vec{t}^\\top \\Phi \\vec{w}$. So, the above simplifies to\n", "$$ \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - 2\\vec{w}^\\top \\Phi^\\top \\vec{t} + \\vec{t}^\\top \\vec{t}\\right).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Batch Gradient Descent with Matrix Calculus\n", "Write the objective function in matrix-vector form:\n", "$$\n", "E(\\vec{w}) = \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - 2\\vec{w}^\\top \\Phi^\\top \\vec{t} + \\vec{t}^\\top \\vec{t}\\right).\n", "$$\n", "\n", "Compute the gradient $\\nabla_\\vec{w} E(\\vec{w})$ with matrix calculus. Hints:\n", "* $\\nabla_\\vec{x} (\\vec{x}^\\top A \\vec{x}) = (A + A^\\top) \\vec{x}$ (Challenge: prove this!) \n", "* $\\nabla_\\vec{x} (\\vec{x}^\\top \\vec{y}) = \\nabla_\\vec{x} (\\vec{y}^\\top\\vec{x}) = \\vec{y}$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* $\\Phi^\\top \\Phi$ has a special property.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Batch Gradient Descent with Matrix Calculus\n", "\n", "\n", "* Since $\\Phi^\\top\\Phi$ is symmetric, $\\nabla_{\\vec{w}} \\vec{w}^\\top(\\Phi^\\top\\Phi)\\vec{w} = 2(\\Phi^\\top\\Phi)\\vec{w}$.\n", "* Treating $\\Phi^\\top t$ as a vector, $\\nabla_{\\vec{w}} \\vec{w}^\\top(\\Phi^\\top t) = \\Phi^\\top t$.\n", "* Finally, $t^\\top t$ is constant with respect to $\\vec{w}$.\n", "\n", "So,\n", "$\\nabla_{\\vec{w}} E(\\vec{w}) = (\\Phi^\\top\\Phi)\\vec{w} - \\Phi^\\top t$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Method I-2: Gradient Descent—Stochastic Gradient Descent\n", "**Main Idea:** Instead of computing batch gradient (over entire training data), just compute gradient for individual (or a small subset of) training sample and update." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " Exercise : How do you implement the update rule for a minibatch gradient descent (of size, let's say, 5% of the whole dataset)?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You randomly choose 5% of indices between 0 and $M$. Take the corresponding rows of $\\Phi$ and $t$. Compute the gradient on this subset of data and desend along." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Method II: Closed-Form solution, invertible case\n", "\n", "**Main Idea**, also **Exercise:** Solve $\\nabla_\\vec{w} E(\\vec{w}) = 0$, assuming $\\Phi^\\top\\Phi$ is invertible. Discuss why it is sufficent to solve this equation to find optimal $\\vec{w}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Answer*: It is sufficent to find a local minimum because $E(\\vec{w})$ is convex. The solution is $\\vec{w} = (\\Phi^\\top\\Phi)^{-1}\\Phi^\\top t$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " Exercise : Show that $\\Phi^\\top \\Phi$ is invertible if $\\Phi$ has *linearly independent columns*. Interpret its implications about our features." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Answer*: It implies that our features are linearly dependent" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " Challenge: Similarly, we can show $\\Phi\\Phi^\\top$ is invertible if $\\Phi$ has linearly independent rows. Why do we care/not care about this case?\n", "\n", " Challenge: Show that $\\vec{b}$ is in the column space of $A$ if and only if there exists a vector $\\vec{x}$ such that $A\\vec{x} = \\vec{b}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Digression: Moore-Penrose Pseudoinverse\n", "- When we have a matrix $A$ that is non-invertible or *not even square*, we might want to invert anyway\n", "- For these situations we use $A^\\dagger$, the **Moore-Penrose Pseudoinverse** of $A$\n", "- In general, we can get $A^\\dagger$ by SVD: if we write $A \\in \\R^{m \\times n} = U_{m \\times m} \\Sigma_{m \\times n} V_{n \\times n}^\\top$ then $A^\\dagger \\in \\R^{n \\times m} = V \\Sigma^\\dagger U^\\top$, where $\\Sigma^\\dagger \\in \\R^{n \\times m}$ is obtained by taking reciprocals of *non-zero entries* of $\\Sigma^\\top$.\n", "- Particularly, when $A$ has linearly independent columns then $A^\\dagger = (A^\\top A)^{-1} A^\\top$. When $A$ is invertible, then $A^\\dagger = A^{-1}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** Exercise **: One property of Psuedoinverse is that $A A^\\dagger A = A$. \n", "Show that $$(A^{\\top} A)^{-1}A^\\top$$ satisfies this property (assuming linearly independent columns of $A$)\n", "\n", "*Challenge: * Show that $$\\hat{\\vec{w}} = (\\Phi^\\top\\Phi)^\\dagger \\Phi^\\top \\vec{t} = \\Phi^\\dagger \\vec{t}$$\n", "satisfies $\\nabla_\\vec{w} E(\\vec{w}) = \\Phi^\\top\\Phi \\vec{w} - \\Phi^\\top \\vec{t} = 0$.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** Discuss **: What are the advantages and disadvtanges of each method we learned today (stochastic gradient descent, batch gradient descent, and closed-form solution)?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Answer*: There are no right answers, but you can say things like: matrix inversion is a cubic-time operation (technically $O(n^{2.37...})$). Performing better on your training data $\\Phi$ doesn't necessarily mean performing better on unseen test data." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }