{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "$$ \\LaTeX \\text{ command declarations here.}\n",
    "\\newcommand{\\R}{\\mathbb{R}}\n",
    "\\renewcommand{\\vec}[1]{\\mathbf{#1}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# EECS 545:  Machine Learning\n",
    "## Lecture 04:  Linear Regression I\n",
    "* Instructor:  **Jacob Abernethy, Benjamin Bray, Jia Deng and Chansoo Lee**\n",
    "* Date: 9/21/2016"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Notation\n",
    "\n",
    "- In this lecture, we will use\n",
    "    - Let vector $\\vec{x}_n \\in \\R^D$ denote the $n\\text{th}$ data. $D$ denotes number of attributes in dataset.\n",
    "    - Let vector $\\phi(\\vec{x}_n) \\in \\R^M$ denote features for data $\\vec{x}_n$. $\\phi_j(\\vec{x}_n)$ denotes the $j\\text{th}$ feature for data $x_n$.\n",
    "    - Feature $\\phi(\\vec{x}_n)$ is the *artificial* features which represents the preprocessing step. $\\phi(\\vec{x}_n)$ is usually some combination of transformations of $\\vec{x}_n$. For example, $\\phi(\\vec{x})$ could be vector constructed by $[\\vec{x}_n^\\top, \\cos(\\vec{x}_n)^\\top, \\exp(\\vec{x}_n)^\\top]^\\top$. If we do nothing to $\\vec{x}_n$, then $\\phi(\\vec{x}_n)=\\vec{x}_n$.\n",
    "    - Continuous-valued label vector $t \\in \\R^D$ (target values). $t_n \\in \\R$ denotes the target value for $i\\text{th}$ data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear Regression "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Linear Regression (General Case)\n",
    "- The function $y(\\vec{x}_n, \\vec{w})$ is linear in parameters $\\vec{w}$.\n",
    "    - **Goal:** Find the best value for the weights $\\vec{w}$.\n",
    "    - For simplicity, add a **bias term** $\\phi_0(\\vec{x}_n) = 1$.\n",
    "$$\n",
    "\\begin{align}\n",
    "y(\\vec{x}_n, \\vec{w})\n",
    "&= w_0 \\phi_0(\\vec{x}_n)+w_1 \\phi_1(\\vec{x}_n)+ w_2 \\phi_2(\\vec{x}_n)+\\dots +w_{M-1} \\phi_{M-1}(\\vec{x}_n) \\\\\n",
    "&= \\sum_{j=0}^{M-1} w_j \\phi_j(\\vec{x}_n) \\\\\n",
    "&= \\vec{w}^\\top \\phi(\\vec{x}_n)\n",
    "\\end{align}\n",
    "$$\n",
    "of which $\\phi(\\vec{x}_n) = [\\phi_0(\\vec{x}_n),\\phi_1(\\vec{x}_n),\\phi_2(\\vec{x}_n), \\dots, \\phi_{M-1}(\\vec{x}_n)]^\\top$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Method I: Batch Gradient Descent\n",
    "- To minimize the objective function, take derivative w.r.t coefficient vector $\\vec{w}$ and descend: initialize $\\vec{w}^0$ to be any vector, and at each step $s$,\n",
    "$$\n",
    "\\vec{w}^{s+1} \\gets \\vec{w}^{s} - \\nabla_{\\vec{w}}E(\\vec{w}^s)\n",
    "$$\n",
    "\n",
    "<b>Exercise</b>: Compute the partial derivative\n",
    "$$\n",
    "(\\nabla_{\\vec{w}}E)_j = \\frac{\\partial E}{\\partial w_j}\n",
    "$$\n",
    "where\n",
    "$$\n",
    "E(\\vec{w}) = \\frac{1}{2} \\sum_{n=1}^N \\sum_{i=1}^{M} \\left( w_i \\phi_i(\\vec{x}_n) - t_n \\right)^2\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Solution\n",
    "In the summation over $i$, only $i = j$ term is a function of $w_j$. So, \n",
    "$$\n",
    "\\frac{\\partial E}{\\partial w_j}\n",
    "= \\frac{1}{2} \\sum_{n=1}^N \\frac{\\partial}{\\partial w_j} \\left( w_j \\phi_j(\\vec{x}_n) - t_n \\right)^2\n",
    "= \\sum_{n=1}^{N} (w_j \\phi_j(\\vec{x}_n) - t_n)\n",
    "$$\n",
    "\n",
    "*Tip*: If you find subscript notations confusing, just plug in $j = 1$, differentiate, and get $\\sum_{n=1}^{N} (w_1 \\phi_1(\\vec{x}_n) - t_n)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Linear Regression: Matrix Notations\n",
    "The matrix $\\Phi \\in \\R^{N \\times M}$ is called **design matrix**. Each row represents one sample. Each column represents one feature\n",
    "$$\\Phi = \\begin{bmatrix}\n",
    "\\phi(\\vec{x}_1)^\\top\\\\ \n",
    "\\phi(\\vec{x}_2)^\\top\\\\ \n",
    "\\vdots\\\\\n",
    "\\phi(\\vec{x}_N)^\\top\n",
    "\\end{bmatrix}\n",
    "= \\begin{bmatrix}\n",
    "\\phi_0(\\vec{x}_1) & \\phi_1(\\vec{x}_1) & \\cdots & \\phi_{M-1}(\\vec{x}_1) \\\\\n",
    "\\phi_0(\\vec{x}_2) & \\phi_1(\\vec{x}_2) & \\cdots & \\phi_{M-1}(\\vec{x}_2) \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "\\phi_0(\\vec{x}_N) & \\phi_1(\\vec{x}_N) & \\cdots & \\phi_{M-1}(\\vec{x}_N) \\\\\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n",
    "Target value vector is $\\vec{t} \\in \\mathbb{R}^M$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "$$\n",
    "E(\\vec{w}) \n",
    "= \\frac{1}{2} \\sum_{n=1}^N (y(\\vec{x}_n, \\vec{w}) - t_n)^2\n",
    "= \\frac{1}{2} \\sum_{n=1}^N \\left( \\sum_{j=0}^{M-1} w_j\\phi_j(\\vec{x}_n) - t_n \\right)^2\n",
    "= \\frac{1}{2} \\sum_{n=1}^N \\left( \\vec{w}^\\top \\phi(\\vec{x}_n) - t_n \\right)^2\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Batch Gradient Descent with Matrix Calculus\n",
    "Write the objective function in matrix-vector form:\n",
    "$\n",
    "\\begin{align*}\n",
    "E(\\vec{w}) &=  \\frac{1}{2} \\sum_{n=1}^N \\sum_{i=1}^{M} \\left( w_i \\phi_i(\\vec{x}_n) - t_n \\right)^2 \\\\ \n",
    "&= \\frac{1}{2} \\sum_{n=1}^N \\left(  \\phi(\\vec{x}_n)^\\top \\vec{w} - t_n \\right)^2 = \\frac{1}{2} \\|\\Phi \\vec{w} - \\vec{t}\\|_2^2\n",
    "\\end{align*}\n",
    "$\n",
    "\n",
    "Rewrite $E$ as a sum of three matrix-vector products. <i>Hints:</i>\n",
    "* $ \\vec{x}^\\top \\vec{x} = (x_1,\\ldots,x_M)^\\top (x_1,\\ldots,x_M) =  x_1^2 + \\cdots + x_M^2 = \\left(\\sqrt{x_1^2 + \\cdots + x_M^2}\\right)^2 = \\|\\vec{x}\\|_2^2$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Distributive law: $(\\vec{a} + \\vec{b})^\\top(\\vec{c} + \\vec{d}) = \\vec{a}^\\top \\vec{c} + \\vec{a}^\\top \\vec{d} + \\vec{b}^\\top \\vec{c} + \\vec{b}^\\top\\vec{d}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Transpose of a product: $(AB)^\\top = B^\\top A^\\top$ for matrix-vector multiplication."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Batch Gradient Descent with Matrix Calculus\n",
    "Treat $\\Phi \\vec{w}$ as a vector and we get from the distributive law\n",
    "$$\n",
    "E(\\vec{w}) = \\frac{1}{2} \\|\\Phi \\vec{w} - \\vec{t}\\|_2^2 = \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - \\vec{w}^\\top \\Phi^\\top \\vec{t} - \\vec{t}^\\top \\Phi \\vec{w} + \\vec{t}^\\top \\vec{t}\\right).\n",
    "$$\n",
    "\n",
    "Note that $\\vec{w}^\\top (\\Phi^\\top \\vec{t}) = (\\Phi^\\top\\vec{t})^\\top \\vec{w} = \\vec{t}^\\top \\Phi \\vec{w}$. So, the above simplifies to\n",
    "$$ \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - 2\\vec{w}^\\top \\Phi^\\top \\vec{t} + \\vec{t}^\\top \\vec{t}\\right).$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Batch Gradient Descent with Matrix Calculus\n",
    "Write the objective function in matrix-vector form:\n",
    "$$\n",
    "E(\\vec{w}) = \\frac{1}{2} \\left(\\vec{w}^\\top \\Phi^\\top \\Phi \\vec{w} - 2\\vec{w}^\\top \\Phi^\\top \\vec{t} + \\vec{t}^\\top \\vec{t}\\right).\n",
    "$$\n",
    "\n",
    "Compute the gradient $\\nabla_\\vec{w} E(\\vec{w})$ with matrix calculus. <i>Hints</i>:\n",
    "* $\\nabla_\\vec{x} (\\vec{x}^\\top A \\vec{x}) = (A + A^\\top) \\vec{x}$  <i> (Challenge: prove this!) </i>\n",
    "* $\\nabla_\\vec{x} (\\vec{x}^\\top \\vec{y}) = \\nabla_\\vec{x} (\\vec{y}^\\top\\vec{x}) = \\vec{y}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* $\\Phi^\\top \\Phi$ has a special property.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Batch Gradient Descent with Matrix Calculus\n",
    "\n",
    "\n",
    "* Since $\\Phi^\\top\\Phi$ is symmetric, $\\nabla_{\\vec{w}} \\vec{w}^\\top(\\Phi^\\top\\Phi)\\vec{w} = 2(\\Phi^\\top\\Phi)\\vec{w}$.\n",
    "* Treating $\\Phi^\\top t$ as a vector, $\\nabla_{\\vec{w}} \\vec{w}^\\top(\\Phi^\\top t) = \\Phi^\\top t$.\n",
    "* Finally, $t^\\top t$ is constant with respect to $\\vec{w}$.\n",
    "\n",
    "So,\n",
    "$\\nabla_{\\vec{w}} E(\\vec{w}) = (\\Phi^\\top\\Phi)\\vec{w} - \\Phi^\\top t$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Method I-2: Gradient Descent—Stochastic Gradient Descent\n",
    "**Main Idea:**  Instead of computing batch gradient (over entire training data), just compute gradient for individual (or a small subset of) training sample and update."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<b> Exercise </b>: How do you implement the update rule for a minibatch gradient descent (of size, let's say, 5% of the whole dataset)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "You randomly choose 5% of indices between 0 and $M$. Take the corresponding rows of $\\Phi$ and $t$. Compute the gradient on this subset of data and desend along."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Method II: Closed-Form solution, invertible case\n",
    "\n",
    "**Main Idea**, also **Exercise:**  Solve $\\nabla_\\vec{w} E(\\vec{w}) = 0$, assuming $\\Phi^\\top\\Phi$ is invertible. Discuss why it is sufficent to solve this equation to find optimal $\\vec{w}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "*Answer*: It is sufficent to find a local minimum because $E(\\vec{w})$ is convex. The solution is $\\vec{w} = (\\Phi^\\top\\Phi)^{-1}\\Phi^\\top t$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<b> Exercise </b>: Show that $\\Phi^\\top \\Phi$ is invertible if $\\Phi$ has *linearly independent columns*. Interpret its implications about our features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "*Answer*: It implies that our features are linearly dependent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<i> Challenge: </i> Similarly, we can show $\\Phi\\Phi^\\top$ is invertible if $\\Phi$ has linearly independent rows. Why do we care/not care about this case?\n",
    "\n",
    "<i> Challenge: </i> Show that $\\vec{b}$ is in the column space of $A$ if and only if there exists a vector $\\vec{x}$ such that $A\\vec{x} = \\vec{b}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### Digression: Moore-Penrose Pseudoinverse\n",
    "- When we have a matrix $A$ that is non-invertible or *not even square*, we might want to invert anyway\n",
    "- For these situations we use $A^\\dagger$, the **Moore-Penrose Pseudoinverse** of $A$\n",
    "- In general, we can get $A^\\dagger$ by SVD: if we write $A \\in \\R^{m \\times n} = U_{m \\times m} \\Sigma_{m \\times n} V_{n \\times n}^\\top$ then $A^\\dagger \\in \\R^{n \\times m} = V \\Sigma^\\dagger U^\\top$, where $\\Sigma^\\dagger \\in \\R^{n \\times m}$ is obtained by taking reciprocals of *non-zero entries* of $\\Sigma^\\top$.\n",
    "- Particularly, when $A$ has linearly independent columns then $A^\\dagger = (A^\\top A)^{-1} A^\\top$. When $A$ is invertible, then $A^\\dagger = A^{-1}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Exercise **: One property of Psuedoinverse is that $A A^\\dagger A = A$. \n",
    "Show that $$(A^{\\top} A)^{-1}A^\\top$$ satisfies this property (assuming linearly independent columns of $A$)\n",
    "\n",
    "*Challenge: * Show that $$\\hat{\\vec{w}} = (\\Phi^\\top\\Phi)^\\dagger \\Phi^\\top \\vec{t} = \\Phi^\\dagger \\vec{t}$$\n",
    "satisfies $\\nabla_\\vec{w} E(\\vec{w}) = \\Phi^\\top\\Phi \\vec{w} - \\Phi^\\top \\vec{t} = 0$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Discuss **: What are the advantages and disadvtanges of each method we learned today (stochastic gradient descent, batch gradient descent, and closed-form solution)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "*Answer*: There are no right answers, but you can say things like: matrix inversion is a cubic-time operation (technically $O(n^{2.37...})$). Performing better on your training data $\\Phi$ doesn't necessarily mean performing better on unseen test data."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}