{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Lab 1: Regression"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "11th February 2014\n",
      "Gaussian Process Road Show, Pereira, Colombia"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "written by Neil Lawrence"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\newcommand{\\inputScalar}{x}\n",
      "\\newcommand{\\inputVector}{\\mathbf{x}}\n",
      "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n",
      "\\newcommand{\\dataScalar}{y}\n",
      "\\newcommand{\\dataVector}{\\mathbf{y}}\n",
      "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n",
      "\\newcommand{\\lengthScale}{\\ell}\n",
      "\\newcommand{\\mappingScalar}{w}\n",
      "\\newcommand{\\mappingVector}{\\mathbf{w}}\n",
      "\\newcommand{\\mappingFunctionScalar}{f}\n",
      "\\newcommand{\\mappingFunctionVector}{\\mathbf{f}}\n",
      "\\newcommand{\\dataStd}{\\sigma}\n",
      "\\newcommand{\\numData}{n}\n",
      "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n",
      "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n",
      "\\newcommand{\\zerosVector}{\\mathbf{0}}\n",
      "\\newcommand{\\eye}{\\mathbf{I}}\n",
      "\\newcommand{\\noiseScalar}{\\epsilon}\n",
      "\\newcommand{\\noiseVector}{\\mathbf{\\epsilon}}\n",
      "\\newcommand{\\noiseMatrix}{\\mathbf{\\Epsilon}}\n",
      "\\newcommand{\\basisMatrix}{\\mathbf{\\Phi}}\n",
      "\\newcommand{\\basisVector}{\\mathbf{\\phi}}\n",
      "\\newcommand{\\basisScalar}{\\phi}\n",
      "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n",
      "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n",
      "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n",
      "\\newcommand{\\meanVector}{\\boldsymbol{\\mu}}\n",
      "\\newcommand{\\kernelScalar}{k}\n",
      "\\newcommand{\\kernelVector}{\\mathbf{\\kernelScalar}}\n",
      "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n",
      "\\newcommand{\\meanScalar}{\\mu}\n",
      "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Welcome to the IPython notebook! We will be using the IPython notebook for all our lab classes and assignments. It is a really convenient way to interact with data using python. In this first lab session we are going to familiarise ourselves with the notebook and start getting used to python whilst we review some of the material from the first lecture.\n",
      "\n",
      "Python is a generic programming language with 'numerical' and scientific capabilities added on through the numpy and scipy libraries. There are excellent 2-D plotting facilities available through matplotlib.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Importing Libraries"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The numpy library provides most of the manipulations we need for arrays in python. numpy is short for numerical python, but as well as providing the numerics, numpy provides contiguous array objects. These objects weren't available in the original python. The first step is to import numpy. We'll then use it to draw samples from a \"standard normal\". A standard normal is a Gaussian density with mean of zero and variance of one. We'll draw 10 samples from the standard normal."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "import numpy as np\n",
      "import pylab as pb\n",
      "import GPy\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To get help about any command in the notebook simply type that command followed by a question mark."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.random.normal?\n",
      "X = np.random.normal(0, 1, size=(10))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's look at the samples, we can show them using the print command."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print X"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Estimating Moments"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can compute the sample mean by adding all the samples together and dividing by the number of samples."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X.mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "which is easy to write in code as follows"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X.var()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We know in this case, because we sampled from a standard normal, that the mean and variance of the distribution should be 0 and 1. Why do you not get a mean of 0 and a variance of 1? Let's explore what happens as we increase the number of samples. To do this we are going to use for loops and python lists. We start by creating empty lists for the means and variances. Then we create a list of integers to iterate through. In Python, a for loop always iterates through a list (in some languages this is called a foreach loop, its counterpart the counter for loop only exists by creating a list of integers, see http://en.wikipedia.org/wiki/Foreach_loop#Python). We can use the range command to create the numbers of samples. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "means = []\n",
      "variances = []\n",
      "samples = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000] \n",
      "for n in samples:\n",
      "    x = np.random.randn(n)\n",
      "    mean = x.mean()\n",
      "    variance = (x**2).mean() - mean**2\n",
      "    means.append(mean)\n",
      "    variances.append(variance)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Plotting in Python"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll now plot the variance and the mean against the number of samples. To do this, we need to first convert the samples, varianes and means from Python lists, to numpy arrays."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "means = np.asarray(means)\n",
      "variances = np.asarray(variances)\n",
      "samples = np.asarray(samples)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we need to include the plotting functionality from matplotlib, and instruct IPython notebook to include the plots *inline* with the notebook, rather than in a different window. First we import the plotting library, matplotlib."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we plot the estimated mean against the number of samples. However, since the samples go up logarithmically it's better to use a logarithmic axis for the x axis, as follows."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.semilogx(samples, means)\n",
      "xlabel('$\\log_{10}n$')\n",
      "ylabel('mean')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can do the same for the variances, again using a logarithmic axis for the samples. This time, we're going to lavel the x axis using a latex formula."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.semilogx(samples, variances)\n",
      "xlabel('$\\log_{10}n$')\n",
      "ylabel('variance')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Linear Regression: Iterative Solution"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we will code linear regression in python. We will do it in two ways, once using iterative updates (coordinate ascent) and then using linear algebra. \n",
      "For this part we are going to load in some real data, we will use the example from the Olympics: the pace of Marathon winners. To load their data (which is in comma separated values [csv] format) we first need to download it from [this link](http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/dataset_mirror/olympic_marathon_men/olympicMarathonTimes.csv \"olympic marathon data\"), and then load it as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "data = GPy.util.datasets.olympic_marathon_men()\n",
      "x = data['X']\n",
      "y = data['Y']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can se what these values are by typing:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(x)\n",
      "print(y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Plotting the Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And you can make a plot of $y$ vs $x$ with the following command:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Maximum Likelihood: Iterative Solution"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we are going to fit a line, $y_i=mx_i + c$, to the data you've plotted. We are trying to minimize the error function:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$E(m, c, \\sigma^2) = \\frac{N}{2} \\log \\sigma^2 + \\frac{1}{2\\sigma^2} \\sum_{i=1}^N(y_i-mx_i-c)^2$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "with respect to $m$, $c$ and $\\sigma^2$. We can start with an initial guess for $m$, "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "m = -0.4\n",
      "c = 80 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then we use the maximum likelihood update, derived in the lecture, to find an estimate for the offset, $c$,"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$c^* = \\frac{\\sum_{i=1}^N(y_i-m^*x_i)}{N}$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "c = (y - m*x).mean()\n",
      "print c"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And now we can make an estimate for the gradient of the line,"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$m^* = \\frac{\\sum_{i=1}^N ((y_i - c)*x_i))}{\\sum_{i=1}^N x_i^2}$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m = ((y - c)*x).sum()/(x**2).sum()\n",
      "print m"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can have a look at how good our fit is by computing the prediction across the input space. First create a vector of 'test points',"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x_test = np.linspace(1890, 2020, 130)[:, None]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now use this vector to compute some test predictions,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f_test = m*x_test + c"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now plot those test predictions with a blue line on the same plot as the data,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(x_test, f_test, 'b-')\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The fit isn't very good, we need to iterate between these parameter updates in a loop to improve the fit, we have to do this several times,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for i in np.arange(10):\n",
      "    m = ((y - c)*x).sum()/(x*x).sum()\n",
      "    c = (y-m*x).sum()/y.shape[0]\n",
      "print(m)\n",
      "print(c)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And let's try plotting the result again"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f_test = m*x_test + c\n",
      "plt.plot(x_test, f_test, 'b-')\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Clearly we need more iterations than 10! Let's try add more iterations above to try and get closer to the solution. How many iterations does it need to get a reasonable answer?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Direct Solution with Linear Algebra"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Hopefully, you are now persuaded of the merits of solving the entire system, simultaneously, using linear algebra. To do that, we need to make a design matrix of the data, which includes the $x_0=1$ column, to represent the bias, remember (from the lecture notes) that we are now moving to a system where our prediction is given by an inner product:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$f(\\mathbf{x}_i) = \\mathbf{x}_i^\\top\\mathbf{w}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "where each vector $\\mathbf{x}_i$ is given by appending a 1 onto the original vector"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{x}_i = \n",
      "\\begin{bmatrix} \n",
      "1 \\\\\\\n",
      "x_i\n",
      "\\end{bmatrix}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can do this for the entire data set to form a design matrix $\\mathbf{X}$,"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{X} = \\begin{bmatrix} \n",
      "\\mathbf{x}_1^\\top \\\\\\ \n",
      "\\mathbf{x}_2^\\top \\\\\\ \n",
      "\\vdots \\\\\\\n",
      "\\mathbf{x}_N^\\top\n",
      "\\end{bmatrix} = \\begin{bmatrix}\n",
      "1 & x_1 \\\\\\\n",
      "1 & x_2 \\\\\\\n",
      "\\vdots & \\vdots \\\\\\\n",
      "1 & x_N \n",
      "\\end{bmatrix},$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "which in numpy is done with the following commands:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X = np.hstack((np.ones_like(x), x))\n",
      "print(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the multivariate regression solution we derived in the lecture, the maximum likelihood solution for $\\mathbf{w}^*$ is given by"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{w}^* = \\left[\\mathbf{X}^\\top \\mathbf{X}\\right]^{-1} \\mathbf{X}^\\top \\mathbf{y}$$ "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First let's persuade ourselves of a few things. We suggested in the lecture that"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\sum_{i=1}^N \\mathbf{x}_i\\mathbf{x}_i^\\top = \\mathbf{X}^\\top\\mathbf{X}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can show that this is, indeed the case for our data. First we need to know how to do matrix multiplication and transpose using numpy, this is done as"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.dot(X.T, X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we will compute the same thing with a for loop"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "store = np.zeros((2, 2))\n",
      "for i in np.arange(X.shape[0]):\n",
      "    store += np.outer(X[i, :], X[i, :])\n",
      "print store\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "I hope that you agree that the first version is a little more compact. But is it quicker? We'll use the `time` module to find out."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create a larger matrix for test\n",
      "A = np.random.normal(loc=0.0, scale=1.0, size=(10000, 4))\n",
      "import time\n",
      "start = time.time()\n",
      "np.dot(A.T, A)\n",
      "end = time.time()\n",
      "print \"Matrix multiplication: \", end - start\n",
      "\n",
      "start = time.time()\n",
      "store = np.zeros((A.shape[1], A.shape[1]))\n",
      "for i in np.arange(A.shape[0]):\n",
      "    store += np.outer(A[i, :], A[i, :])\n",
      "end = time.time()\n",
      "print \"For loop over outer products: \", end - start"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Solving the System"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The solution for $\\mathbf{w}$ is given in terms of a matrix inverse, but numerically this isn't the best way to compute it. What we actually want python to do is to *solve* the system of linear equations given by"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{X}^\\top\\mathbf{X} \\mathbf{w} = \\mathbf{X}^\\top\\mathbf{y}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "for $\\mathbf{w}$. This can be done in numpy using the command"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.linalg.solve?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "so we can obtain the solution using"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))\n",
      "print w"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Allowing us to plot the fit as follows"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m = w[1]; c=w[0]\n",
      "f_test = m*x_test + c\n",
      "print(m)\n",
      "print(c)\n",
      "plt.plot(x_test, f_test, 'b-')\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Quadratic Fit"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we will fit a quadratic model using basis functions. Given everything we've learnt above, this is now quite easy to do. Firstly, we need to create a design matrix that contains the quadratic basis, "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{\\Phi} = \\left[ \\mathbf{1} \\quad \\mathbf{x} \\quad \\mathbf{x}^2\\right]$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "where this notation means that each column of $\\mathbf{\\Phi}$ is derived from the entire set of input years."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Phi = np.hstack([np.ones(x.shape), x, x**2])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can solve this system for $\\mathbf{w}$ just as we did for the linear case, so we have,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))\n",
      "print(w)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can plot the solution in two different ways, either we take"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f_test = w[2]*x_test**2 + w[1]*x_test + w[0]\n",
      "plt.plot(x_test, f_test, 'b-')\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Or we can do the matrix form of this equation which first involves creating a design matrix for the test points,\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Phi_test = np.hstack((np.ones_like(x_test), x_test, x_test**2))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and then computing the value of the function using a matrix multiply"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f_test = np.dot(Phi_test,w)\n",
      "plt.plot(x_test, f_test, 'b-')\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the values of the coefficient $w_2$ in particular. It is relatively small, because it is multiplying a large number (square of 2000 is 4 million). This need to use small coefficients becomes worse as we increase the order of the fit. As an exercise for later, try fitting higher order polynomials to the data. See what happens as you increase the polynomial order."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Generalization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The aim of this notebook is to review the different methods of model selection: hold out validation, leave one out cross validation and cross validation. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Training Error"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first thing we'll do is plot the training error for the polynomial fit. To do this let's set up some parameters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "num_data = x.shape[0]\n",
      "num_pred_data = 100 # how many points to use for plotting predictions\n",
      "x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions\n",
      "order = 4 # The polynomial order to use.\n",
      "                 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "now let's build the basis matrices.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# build the basis set\n",
      "Phi = np.zeros((num_data, order+1))\n",
      "Phi_pred = np.zeros((num_pred_data, order+1))\n",
      "for i in range(0, order+1):\n",
      "    Phi[:, i:i+1] = x**i\n",
      "    Phi_pred[:, i:i+1] = x_pred**i\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "now we can solve for the regression weights and make predictions both for the training data points, and the test data points. That involves solving the linear system given by"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\basisMatrix^\\top \\basisMatrix \\mappingVector^* = \\basisMatrix^\\top \\dataVector$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# solve the linear system\n",
      "w_star = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and using the resulting vector to make predictions at the training points and test points,"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mathbf{f} = \\basisMatrix \\mappingVector.$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To implement this in practice we need to use basis matrices for both the predictions and the training points."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# predict at training and test points\n",
      "f = np.dot(Phi, w_star)\n",
      "f_pred = np.dot(Phi_pred, w_star)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These can be used to compute the error"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$E(\\mappingVector) =  \\frac{\\numData}{2} \\log \\dataStd^2 + \\frac{1}{2\\dataStd^2} \\sum_{i=1}^\\numData \\left(\\dataScalar_i - \\mappingVector^\\top \\phi(\\inputVector_i)\\right)^2 \\\\\\\n",
      "E(\\mappingVector) = \\frac{\\numData}{2} \\log \\dataStd^2 + \\frac{1}{2\\dataStd^2} \\sum_{i=1}^\\numData \\left(\\dataScalar_i - \\mappingFunctionScalar_i\\right)^2$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# compute the sum of squares term\n",
      "sum_squares = ((y-f)**2).sum()\n",
      "# fit the noise variance\n",
      "sigma2 = sum_squares/num_data\n",
      "error = 0.5*(num_data*np.log(sigma2) + sum_squares/sigma2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we have the fit and the error, let's plot the fit and the error."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# print the error and plot the predictions\n",
      "print(\"The error is: %2.4f\"%error)\n",
      "plt.plot(x_pred, f_pred)\n",
      "plt.plot(x, y, 'rx')\n",
      "ax = plt.gca()\n",
      "ax.set_title('Predictions for Order 5')\n",
      "ax.set_xlabel('year')\n",
      "ax.set_ylabel('pace (min/km)')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now use the loop structure below to compute the error for different model orders.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import the time model to allow python to pause.\n",
      "import time\n",
      "# import the IPython display module to clear the output.\n",
      "from IPython.display import clear_output\n",
      "\n",
      "num_data = len(x)\n",
      "error_list = []\n",
      "max_order = 6\n",
      "sigma2 = 1\n",
      "fig, axes = plt.subplots(nrows=1, ncols=2)\n",
      "for order in range(0, max_order+1):\n",
      "    # 1. build the basis set\n",
      "    Phi = np.zeros((num_data, order+1))\n",
      "    Phi_pred = np.zeros((num_pred_data, order+1))\n",
      "    for i in range(0, order+1):\n",
      "        Phi[:, i] = x.flatten()**i\n",
      "        Phi_pred[:, i] = x_pred.flatten()**i\n",
      "    \n",
      "    # 2. solve the linear system\n",
      "    w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))\n",
      "    # 3. make predictions at training and test points\n",
      "    f_pred = np.dot(Phi_pred, w)\n",
      "    f = np.dot(Phi, w)\n",
      "    # 4. compute the error and append it to a list.\n",
      "    error_list.append(((y-f)**2).sum() + num_data/2.*np.log(sigma2))\n",
      "    # 5. plot the predictions\n",
      "    axes[0].clear()\n",
      "    axes[1].clear()    \n",
      "    axes[0].plot(x_pred, f_pred)\n",
      "    axes[0].plot(x, y, 'rx')\n",
      "    axes[0].set_ylim((2.5, 5.5))\n",
      "    axes[0].set_title('Predictions for Order ' + str(order) + ' model.')\n",
      "    axes[1].plot(np.arange(0, order+1), np.asarray(error_list))\n",
      "    axes[1].set_xlim((0, max_order))\n",
      "    axes[1].set_ylim((0, 10))\n",
      "    axes[1].set_title('Training Error')\n",
      "    display(fig)\n",
      "    time.sleep(1)\n",
      "    clear_output()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
  ],
   "metadata": {}
  }
 ]
}