{
 "metadata": {
  "name": "",
  "signature": "sha256:ecd8cb1ffac00d7860d44112696ebe6127b1c13a81aaf3af3bbcd1c17a310477"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Machines and Intelligence Lab Class\n",
      "\n",
      "### 8th November 2013 Neil Lawrence\n",
      "\n",
      "Welcome to the IPython notebook! We will be using the IPython notebook for our lab class on robotic inference. The IPython notebook is a very convenient way to interact with data using python. We will start the lab session by familiarising ourselves  with the notebook and start getting used to python whilst we review some of the material from the first lecture.\n",
      "\n",
      "Python is a generic programming language with 'numerical' and scientific capabilities added on through the `numpy` and `scipy` libraries. There are excellent 2-D plotting facilities available through matplotlib.$$\\newcommand{\\inputScalar}{x}\n",
      "\\newcommand{\\lengthScale}{\\ell}\n",
      "\\newcommand{\\mappingVector}{\\mathbf{w}}\n",
      "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n",
      "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n",
      "\\newcommand{\\zerosVector}{\\mathbf{0}}\n",
      "\\newcommand{\\eye}{\\mathbf{I}}\n",
      "\\newcommand{\\dataStd}{\\sigma}\n",
      "\\newcommand{\\dataScalar}{y}\n",
      "\\newcommand{\\dataVector}{\\mathbf{y}}\n",
      "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n",
      "\\newcommand{\\noiseScalar}{\\epsilon}\n",
      "\\newcommand{\\noiseVector}{\\mathbf{\\epsilon}}\n",
      "\\newcommand{\\noiseMatrix}{\\mathbf{\\Epsilon}}\n",
      "\\newcommand{\\latentScalar}{x}\n",
      "\\newcommand{\\latentVector}{\\mathbf{x}}\n",
      "\\newcommand{\\latentMatrix}{\\mathbf{X}}\n",
      "\\newcommand{\\inputScalar}{x}\n",
      "\\newcommand{\\inputVector}{\\mathbf{x}}\n",
      "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n",
      "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n",
      "\\newcommand{\\basisMatrix}{\\mathbf{\\Phi}}\n",
      "\\newcommand{\\basisVector}{\\mathbf{\\phi}}\n",
      "\\newcommand{\\basisScalar}{\\phi}\n",
      "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n",
      "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n",
      "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n",
      "\\newcommand{\\numData}{n}\n",
      "\\newcommand{\\mappingScalar}{w}\n",
      "\\newcommand{\\mappingFunctionScalar}{f}\n",
      "\\newcommand{\\mappingFunctionVector}{\\mathbf{f}}\n",
      "\\newcommand{\\meanVector}{\\boldsymbol{\\mu}}\n",
      "\\newcommand{\\meanScalar}{\\mu}\n",
      "\\newcommand{\\errorFunction}{E}$$\n",
      "\n",
      "\n",
      "Since we only have one lab session, I thought to try and cover some of the material we've seen in the module again in the lab. You won't have time to do it all *now*, but you can use this to help keep the material fresh in your mind, and to review things we've talked about in the lectures. You can always come back to this lab sheet and use it to help review material from the lab sessions. Before we start on the material we've seen in the lectures though, let's introduce some simple python programming. You don't have to write any programs in the lab, they've all been written for you. But with luck you might want to play with the code a bit, change a few things and see what effect the change has (that's a really good way to learn!). So it will be helpful to start with a bit of background to the language the lab class is constructed in.\n",
      "\n",
      "## Python Programming\n",
      "\n",
      "Being a computer used to be a job. A hundred years ago a computer was a person, often a young woman, who performed computations for creating statistical tables. Each person would be given a particular piece of arithmatic (which was often fairly repetetive), or perhaps if they were lucky, slightly complicated mathematics. Automatic computers, or electronic computers, were machines that did this job instead of the young women. From the very start then, computers were seen as machines that could help humans avoid repetetive tasks. However, I'm sure a lot of people's experience of computers is that they are machines that make you do a lot of repetetive tasks. For example, data entry. Sometimes data entry involves copying and pasting lots of different data from the web to an Excel spreadsheet. That's incredibly boring! Computer scientists hate repetetive tasks, because they know that computers should be doing them for us. This can be taken to extremes, as this `xkcd` comic illustrates, but generally if you find yourself doing something repetetive, you *really* want to find a way of making the computer do this.\n",
      "\n",
      "[![the general problem][2]][1]\n",
      "[1]: http://xkcd.com/974/\n",
      "[2]: http://imgs.xkcd.com/comics/the_general_problem.png\n",
      "\n",
      "\n",
      "When I first started using computers, the prevailing idea was that a language called `BASIC` was the right way to do this. As time evolved, the idea of writing scripts, short sets of instructions to avoid repetetive tasks, emerged and a new class of languages known as scripting languages (http://en.wikipedia.org/wiki/Scripting_language). Python is a scripting language that is also designed to be a general purpose language. It's designed to allow you to write small scripts, but also its a fully functioned language, like java, allowing it to also be used for larger applications.\n",
      "\n",
      "Python is a programming language, just like Java. It is also a *high level programming language*, just like Java. And it has object orientated programming However, it's design philosophy is quite different. Java is a programming language designed for software engineering: it has its origins in a language called C. Because Python arises from scripting languages, it was designed to be easy to operate 'straight out of the box'. So whereas your first Java program is of the form \n",
      "\n",
      "```\n",
      "/* \n",
      "A simple Java program \n",
      "Written by: Guy J. Brown \n",
      "First written: 19/8/02 \n",
      "Last rewritten: 24/8/02 \n",
      "*/ \n",
      "public class Simple { \n",
      "  public static void main(String[] args) { \n",
      "    System.out.print(\"Running a Java application\"); \n",
      "    System.out.println(\"...finished.\"); \n",
      "  } \n",
      "}```\n",
      "\n",
      "Your first python program can be of the form:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\"\"\"A symple Python program\n",
      "Written by: Neil D. Lawrence\n",
      "First written: 2013/10/11\n",
      "Last written: 2013/10/11\"\"\"\n",
      "\n",
      "print \"Running a Python application\",\n",
      "print \"... finished.\"\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Python is freely available and easy to install. I like (love) running programs on Linux. On ubuntu, python comes preinstalled (and also on the Raspberry Pi). On Windows and OSX machines we are recommending the [Enthought Python distribution](https://www.enthought.com/products/epd/) which is freely available for academic use. However, if you don't want to install Python on your local machine you can also do what I'm doing here, and run it on [Wakari](http://www.wakari.io/). \n",
      "\n",
      "We are currently running the IPython Notebook. This is a web driven interface to python that allows us to treat programming as a conversation with a machine, rather than as a series of instructions to the machine. This means I can use Python more like a calculator. Java doesn't have facilities like this."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = 2*24\n",
      "print a"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = a + 10\n",
      "print a"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note, that there is no compilation stage here, we are calling the python interpreter directly to get these results. To do something a little more exciting than this though, it is useful to import some modules.\n",
      "\n",
      "## Importing Modules\n",
      "\n",
      "Modules in Python are like Java class libraries. They contain code that someone else has written that you can include and call yourself. Let's import a library that allows us to play with values sampled from uniform distribution between 0 and 1."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy\n",
      "x = numpy.random.uniform() # This gives us a random number between 0 and 1, uniformly distributed\n",
      "print x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Most of machine learning comes down to manipulation of matrices and vectors: linear algebra. Python wasn't originally a language designed for this, but it The numpy module provides most of the manipulations we need for arrays in python. We may not want to keep writing `numpy` at the beginning of our commands, so very often we use a special type of import that allows us to abbreviate the module name:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "x = np.random.uniform()\n",
      "print x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "numpy is short for numerical python, but as well as providing the numerics, numpy provides *contiguous array* objects. From our perspective, we can think of these objects as matrices or vectors. We can sample a vector of values from a uniform distribution as well"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = np.random.uniform(size=10)\n",
      "print x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To get help about any command in the notebook simply type that command followed by a question mark."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.random.uniform?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As well as sampling from a uniform distribution, we can sample from the Gaussian or 'normal' distribution. The command for doing this in python is"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# sample 10 values from a normal (Gaussian) distribution with mean 0.0 (loc) and standard deviation 1.0 (scale) \n",
      "x=np.random.normal(loc=0.0, scale=1.0, size=10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `numpy` module also allows us to perform linear algebra operations, like the inner product (or dot product). We can try a few different ways of doing it. First we'll create two vectors of the same length, then we'll compute their inner product in a couple of different ways."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = np.random.normal(size=4)\n",
      "y = np.random.normal(size=4)\n",
      "# let's see what we generated\n",
      "print x\n",
      "print y\n",
      "# inner product (or dot product) in python can be computed with np.dot\n",
      "ip = np.dot(x, y)\n",
      "print \"Inner product is \", ip\n",
      "# the function np.inner also does the same thing\n",
      "print \"Inner product is also \", np.inner(x, y)\n",
      "# Or we can do it explicitly, by multiplying all array elements together and summing\n",
      "print \"Inner product is also \", np.sum(x*y)\n",
      "# which can also be written as\n",
      "print \"Finally, inner product is also \", (x*y).sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Computing these inner products in python can be very fast because it is all optimized code, whereas doing it with a *loop* in python is a lot slower. We'll show this by introducing python for loops. In python, a for loop always iterates through a *list* (in some languages this is called a foreach loop (see http://en.wikipedia.org/wiki/Foreach_loop#Python). To create a for loop a bit more like that available in java we first need to create a list of integers, its counterpart the counter for loop only exists by creating a list of integers. We can use the `xrange` command to create the numbers of samples. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# here we create a long array for x and y\n",
      "length = 1000000\n",
      "x = np.random.normal(size=length)\n",
      "y = np.random.normal(size=length)\n",
      "\n",
      "# computing with a for loop for 1,000,000 values takes a noticeable amount of time\n",
      "ip = 0\n",
      "for i in xrange(len(x)):\n",
      "    ip+=x[i]*y[i]\n",
      "print ip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# inner product computed as a built in\n",
      "np.dot(x, y)\n",
      "# this takes no noticeable amount of time"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "TODO Need to introduce indexing in mathematics how it maps to indexing in python.\n",
      "\n",
      "Now we've explored python a bit, let's compute some *standard* statistics with it, by sampling from a Gaussian density and computing means and variances.\n",
      "\n",
      "## Estimating the Mean\n",
      "\n",
      "The sample mean of a set of values is given by the following formula:\n",
      "\n",
      "$$\\mu = \\frac{1}{N}\\sum_{i=1}^N x_i$$\n",
      "\n",
      "We can compute the sample mean in python by adding all the samples together and dividing by the total number of samples."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = np.random.normal(size=10)\n",
      "mean = x.sum()/10\n",
      "print mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Or we can extract the dimension of the array to compute the mean."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mean = x.sum()/x.size\n",
      "print mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Or finally we can just compute the mean directly."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mean = x.mean()\n",
      "print mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The variance of a set of samples is given by "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\sigma^2 = \\frac{1}{N}\\sum_{i=1}^N x_i^2 - \\mu^2$$ "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "variance = (x*x).mean() - mean*mean\n",
      "print variance"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Convergence of Sample Moments to True Moments\n",
      "\n",
      "Underlying these samples is an actual distribution: the Gaussian (or normal) distribution. Because we sampled from a standard normal (i.e. one with a mean of 0 and a variance of 1) we know that in theory the mean and variance of the samples should be 0 and 1. In practice they aren't. Let's use matplotlib: python's plotting functionality to explore what happens as we increase the number of samples. We will increase the number of samples using *loops* and Python *lists*. \n",
      "\n",
      "We start by creating empty lists for the means and variances. Then we create a list of integers of sample size to iterate through. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "means = []\n",
      "variances = []\n",
      "samples = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000] \n",
      "for N in samples:\n",
      "    x = np.random.normal(loc=0.0, scale=1.0, size=N)\n",
      "    mean = x.mean()\n",
      "    variance = (x**2).mean() - mean**2\n",
      "    means.append(mean)\n",
      "    variances.append(variance)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Plotting in Python\n",
      "\n",
      "We'll now plot the variance and the mean against the number of samples. To do this, we need to first convert the samples, varianes and means from Python lists, to numpy arrays."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "means = np.array(means)\n",
      "variances = np.array(variances)\n",
      "samples = np.array(samples)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we need to include the plotting functionality from matplotlib, and instruct IPyhton notebook to include the plots *inline* with the notebook, rather than in a different window. First we import the plotting library, matplotlib."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib as plt\n",
      "# This is an IPython magic command to make plots appear inline. \n",
      "# These \"magic\" Commands start with % and are sent to the IPython \n",
      "# interpreter rather than the Python language interpreter.\n",
      "%pylab inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(samples, means)\n",
      "xlabel('samples')\n",
      "ylabel('mean')\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we plot the estimated mean against the number of samples. However, since the samples go up logarithmically it's better to use a logarithmic axis for the x axis, as follows."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.semilogx(samples, means)\n",
      "xlabel('log samples')\n",
      "ylabel('mean')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The Gaussian Density\n",
      "\n",
      "Finally, let's try plotting a Gaussian distribution. The Gaussian density is given by the following formula:\n",
      "\n",
      "$$p(y) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp\\left(-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right)$$\n",
      "\n",
      "where $\\sigma$ is the standard deviation ($\\sigma^2$ is the variance) and $\\mu$ is the mean. First let's set values for the location and scale"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "scale = 1.0 # set the standard deviation\n",
      "loc = 0.0 # set the mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's choose some $x$ points where to plot our density"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x_plot = np.linspace(loc - 4*scale, loc + 4*scale, 100) # the x-values to use in the plot"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we use those $x$ values to plot the form of the normal density. In this equation we use the python *operator* ** to square a value,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# compute the values of this density at the locations given by x_plot\n",
      "py = 1/np.sqrt(2*np.pi*scale**2)*np.exp(-0.5*(x_plot-loc)**2/(scale**2))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally let's sample some values from the same density using `np.random.normal`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# sample some random values from this density\n",
      "x_samps = np.random.normal(loc=loc, scale=scale, size=100)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's plot the results. We'll plot the form of the density, and then we'll plot where the samples landed below."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Plot the density\n",
      "plt.plot(x_plot, py)\n",
      "\n",
      "# Plot crosses where the samples landed. Here we plot their x-location a\n",
      "# gainst an array filled with 0.05, just to show them of the bottom of the plot. \n",
      "plt.plot(x_samps, 0.05*np.ones_like(x_samps), 'gx')\n",
      "plt.title('The Normal Density')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Quadratic Error Function\n",
      "\n",
      "Remember in class we introduced the normal density and then said that probability distributions were related to error functions by taking the negative log of the probability. We showed that the normal density was related to a quadratic error function. Let's review that here, by taking the log of this density, and plotting it. The result *should* look like a quadratic."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(x_plot, -np.log(py), 'b-')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that we get a quadratic form, with the minimum at the same point that the maximum was at for the probability density. This is the essence of the idea that minimizing the error with respect to a paramter is the same as maximizing the probability with respect to the parameter (although normally we call it maximum likelihood, not maximum probability). \n",
      "\n",
      "## Supervised Learning\n",
      "\n",
      "### Classification\n",
      "\n",
      "The first approach to machine learning we are going to look at is classification. In classification we take in a *feature matrix* and make predictions of class labels given the features. The model for classification we saw in class was one where we predicte the class label, $\\dataScalar_i$, given the features associated with that data point, $\\inputVector_i$, using the following formula: \n",
      "\n",
      "$$\\dataScalar_i = \\text{sign}\\left(\\mappingVector^\\top \\inputVector_i + b\\right)$$\n",
      "\n",
      "This implies that the decision boundary for the classification is given by a *hyperplane*. The vector $\\mappingVector$ is the normal vector to the hyperplane (<http://en.wikipedia.org/wiki/Normal_(geometry)>). The hyperplane is described by the formula $\\mappingVector^\\top \\inputVector = -b$ which uniquely defines the hyperplane in the input dimensions.\n",
      "\n",
      "### Perceptron Algorithm\n",
      "\n",
      "Now we have the `for` loop and some Python basics, let's see how the perceptron algorithm looks in python. First of all, we will create a data set to play with. We'll do this by sampling from a Gaussian."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# let's sample first from the positive class. To separate \n",
      "# it from the negative, we'll add 1.5 to the samples.\n",
      "x_plus = np.random.normal(loc=1.5, size=(50, 2))\n",
      "# And now the negative class, to separate from the positive\n",
      "# we subtract 1.5 from the samples\n",
      "x_minus = np.random.normal(loc=-1.5, size=(50, 2))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# now let's plot our data. We'll make the\n",
      "# positive class red crosses and the negative\n",
      "# class green circles\n",
      "plt.plot(x_plus[:, 0], x_plus[:, 1], 'rx')\n",
      "plt.plot(x_minus[:, 0], x_minus[:, 1], 'go')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now the perceptron algorithm. The way we taught it in class, the first step involves initializing the the weight vector (which remember is *normal* to the decision boundary) with a data point. First we decide which data point to use, positive or negative, then we set the normal vector to be equal to that point multiplied by the class label\n",
      "\n",
      "$$\\mappingVector = \\dataScalar_i \\inputVector_i$$\n",
      "\n",
      "As we saw in class, the predicted label of the $i$th point is given by\n",
      "\n",
      "$$\\text{sign}(\\mappingVector^\\top\\inputVector_i)$$\n",
      "\n",
      "So setting the inital vector to $\\dataScalar_i\\inputVector_i$ means that the $i$th point's prediction will be given by\n",
      "\n",
      "$$\\text{sign}(\\mappingVector^\\top\\inputVector_i) = \\text{sign}(\\dataScalar_i\\inputVector_i^\\top \\inputVector_i)$$\n",
      "\n",
      "Because $\\inputVector_i^\\top \\inputVector_i$ is the inner product between a vector and itself, it is always positive, so the sign of the argument will be determined only by the sign of $\\dataScalar_i$. So setting the inital vector to this value ensures that the $i$th data point is correctly classified. Let's see the code now for doing this."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# flip a coin (i.e. generate a random number and check if it is greater than 0.5)\n",
      "choose_plus = np.random.rand(1)>0.5\n",
      "if choose_plus:\n",
      "    # generate a random point from the positives\n",
      "    index = np.random.randint(0, x_plus.shape[1])\n",
      "    w = x_plus[index, :] # set the normal vector to that point.\n",
      "    b = 1\n",
      "else:\n",
      "    # generate a random point from the negatives\n",
      "    index = np.random.randint(0, x_minus.shape[1])\n",
      "    w = -x_minus[index, :] # set the normal vector to minus that point.\n",
      "    b = -1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Decision Boundary\n",
      "\n",
      "Now we will try and plot the decision boundary. To do this we need to do a little bit of maths. It's not part of the perceptron algorithm as such, it's just to help you understand how the blot below is made. \n",
      "\n",
      "The decision boundary is defined as the point at which a data point moves from being classified as -1 to being classified as +1. For the model which predicts the label as $\\text{sign}(\\inputVector^\\top \\mappingVector)$, using the definition of the inner product, we can see that if the input dimension is two dimensional, the model we use for prediction can be written as\n",
      "\n",
      " $$\\text{sign}(\\mappingScalar_0 + \\mappingScalar_1\\inputScalar_{i,1} + \\mappingScalar_2 \\inputScalar_{i, 2})$$\n",
      "\n",
      "where $\\inputScalar_{i, 1}$ is the first feature from the $i$th data point and $\\inputScalar_{i, 2}$ is the second feature from the $i$th data point. Here we are using the trick that includes an extra $\\inputScalar_{0,i}=1$ to account for the bias. So if we think of $\\mappingScalar_0 = b$ we have\n",
      "\n",
      "$$\\text{sign}\\left(b + \\mappingScalar_1 \\inputScalar_{i, 1} + \\mappingScalar_2 \\inputScalar_{i, 2}\\right)$$\n",
      "\n",
      "The decision boundary is where the output of the function changes from -1 to +1 (or vice versa) so it's the point at which the argument of the $\\text{sign}$ function is zero. So in other words, the decision boundary is given by the *line* defined by $\\inputScalar_1 \\mappingScalar_1 + \\inputScalar_2 \\mappingScalar_2 = -b$ (where we have dropped the index $i$ for convenience). In this two dimensional space the decision boundary is defined by a line. In a three dimensional space it would be defined by a *plane*  and in higher dimensional spaces it is defined by something called a *hyperplane* (http://en.wikipedia.org/wiki/Hyperplane). This equation is therefore often known as the *separating hyperplane* because it defines the hyperplane that separates the data. To draw it in 2-D we can choose some values to plot from $\\inputScalar_1$ and then find the corresponding values for $\\inputScalar_2$ to plot using the rearrangement of the hyperplane formula as follows\n",
      "\n",
      "$$\\inputScalar_2 = -\\frac{(b+\\inputScalar_1\\mappingScalar_1)}{\\mappingScalar_2}$$\n",
      "\n",
      "Of course, we can also choose to specify the values for $\\inputScalar_2$ and compute the values for $\\inputScalar_1$ given the values for $\\inputScalar_2$,\n",
      "\n",
      "$$\\inputScalar_1 = -\\frac{b + \\inputScalar_2\\mappingScalar_2}{\\mappingScalar_1}$$\n",
      "\n",
      "It turns out that sometimes we need to use the first formula, and sometimes we need to use the second. Which formula we use depends on how the separating hyperplane leaves the plot. \n",
      "\n",
      "We want to draw the separating hyperplane in the bounds of the plot which is showing our data. To think about which equation to use, let's consider two separate situations (actually there are a few more). \n",
      "\n",
      "1. If the separating hyperplane leaves the top and bottom of the plot then we want to plot a line with values in the $y$ direction (given by $x_2$) given by the upper and lower limits of our plot. The values in the $x$ direction can then be computed from the formula for the plane. \n",
      "\n",
      "2. Conversely if the line leaves the sides of the plot then we want to plot a line with values in the $x$ direction given by the limits of the plot. Then the values in the $y$ direction can be computed from the formula. Whether the line leaves the top/bottom or the sides of the plot is dependent on the relative values of $w_1$ and $w_2$. \n",
      "\n",
      "This motivates a simple `if` statement to check which situation we're in."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute coordinates for the separating hyperplane\n",
      "plot_limits = np.asarray([-5, 4])\n",
      "if abs(w[1])>abs(w[0]):\n",
      "    # If w[1]>[w[0] in absolute value, plane is likely to be leaving tops of plot.\n",
      "    x0 = plot_limits\n",
      "    x1 = -(b + x0*w[0])/w[1]\n",
      "else:\n",
      "    # otherwise plane is likely to be leaving sides of plot.\n",
      "    x1 = plot_limits\n",
      "    x0 = -(b + x1*w[1])/w[0]\n",
      "\n",
      "# Plot the data again\n",
      "plot(x_plus[:, 0], x_plus[:, 1], 'rx')\n",
      "plot(x_minus[:, 0], x_minus[:, 1], 'go')\n",
      "# plot a line to represent the separating 'hyperplane'\n",
      "plot(x0, x1, 'b-')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This little drawing routine we've defined is going to be useful for updating the hyperplane when we update the parameters (to visualize how things change). We don't want paste in the code *every* time we need it, so let's define it as a function which computes the coordinates of the hyperplane. To define a function we declare it as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def hyperplane_coordinates(w, b, plot_limits):\n",
      "    \n",
      "    if abs(w[1])>abs(w[0]):\n",
      "        # If w[1]>w[0] in absolute value, plane is likely to be leaving tops of plot.\n",
      "        x0 = plot_limits\n",
      "        x1 = -(b + x0*w[0])/w[1]\n",
      "    else:\n",
      "        # otherwise plane is likely to be leaving sides of plot.\n",
      "        x1 = plot_limits\n",
      "        x0 = -(b + x1*w[1])/w[0]\n",
      "    return x0, x1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Right, we are now almost ready to write some code that initializes the first value of the weight vector and draws the associated hyperplane.  The final component we need is the perceptron weight update. For the perceptron algorithm, the next step is to test the weights we have. We can select a data point at random, and test to see if it is classified correctly. If it is classified correctly, we do nothing. However, if it isn't we need to try and change the weight vector to make the data point classified correctly. We already know that if the weight vector is set to $\\mappingVector = \\dataScalar_i \\inputVector_i$ then the $i$th data point will be correctly classified, but that's no good, because it means forgetting everything we've learnt so far. Instead, we choose to move the weight vector by adding some portion of $\\dataScalar_i \\inputVector_i$ to $\\mappingVector$, "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\mappingVector_\\text{new} \\leftarrow \\mappingVector_\\text{old} + \\eta\\dataScalar_i \\inputVector_i$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "so that the new classification is given by "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\text{sign}(\\mappingVector_\\text{old}^\\top\\inputVector_i + \\eta\\dataScalar_i \\inputVector_i^\\top\\inputVector_i)$$ "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So in other words, because $\\inputVector_i^\\top\\inputVector_i$ is positive, then the argument of $\\text{sign}()$ will be adapted in the direction of $\\dataScalar_i$. I.e. the weight vector is pulled in a direction which makes the data vector $\\inputVector_i$ more likely to be classified correctly. In fact, if $\\eta$ is large enough then $\\inputVector_i$ is guaranteed to be correctly classified. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's put it all together. To animate the drawing in the ipython notebook, we need two further modules: `time` which will allow us to pause the algorithm for updating the hyperplane drawing and `IPython.display.clear_output` which will allow us to clear the plot between updates."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import the time model to allow python to pause.\n",
      "import time\n",
      "# import the IPython display module to clear the output.\n",
      "from IPython.display import clear_output\n",
      "\n",
      "learn_rate = 0.1\n",
      "num_iterations = 1000\n",
      "for i in xrange(num_iterations):\n",
      "    w_updated = False\n",
      "    # select a point at random from the data\n",
      "    choose_plus = np.random.uniform(size=1)>0.5\n",
      "    if choose_plus:\n",
      "        # choose a point from the positive data\n",
      "        index = np.random.random_integers(x_plus.shape[0]-1)\n",
      "        if not np.sign(np.dot(w, x_plus[index, :])+b) == 1:\n",
      "            # point is currently incorrectly classified\n",
      "            w = w + learn_rate*x_plus[index, :]\n",
      "            b = b + learn_rate\n",
      "            w_updated = True\n",
      "    else:\n",
      "        # choose a point from the negative data\n",
      "        index = np.random.random_integers(x_minus.shape[0]-1)\n",
      "        if not np.sign(np.dot(w, x_minus[index, :])+b) == -1:\n",
      "            # point is currently incorrectly classified\n",
      "            w = w - learn_rate*x_minus[index, :]\n",
      "            b = b - learn_rate\n",
      "            w_updated = True\n",
      "    if w_updated or i==0: \n",
      "        # Plot the data again\n",
      "        plt.gca().clear()\n",
      "        plt.plot(x_plus[:, 0], x_plus[:, 1], 'rx')\n",
      "        plt.plot(x_minus[:, 0], x_minus[:, 1], 'go')\n",
      "        # plot a line to represent the separating 'hyperplane'\n",
      "        x0, x1 = hyperplane_coordinates(w, b, np.asarray([-5, 4]))\n",
      "        plt.plot(x0, x1, 'b-')\n",
      "        display(plt.gcf())\n",
      "        time.sleep(1)\n",
      "        clear_output()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Perceptron Reflection"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How could we improve this algorithm? Think about the following issues. What happens when the classes aren't linearly separable? Why are we only using 1000 iterations, how do we know we have the right result when we stop? How could we select the learning rate, $\\eta$?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Regression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Classification is about mapping an input point to a class label which is either positive or negative. Regression is mapping an input point to a real value. One example of a *model* for regression is a linear regression involving a slope and an intercept."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\dataScalar_i = m\\inputScalar_i + c$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now need an *algorithm* to fit this model. We are going to use regression to introduce the idea of an error function. The error function is an objective function that we *minimize* to fit our model. One commonly used error function for regression is least squares. The idea in least squares is to minimize the sum of squared differences between our prediction and the observed data. We can write our error function in the following form"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$\\errorFunction(m, c) = \\sum_{i=1}^\\numData (\\dataScalar_i - m\\inputScalar_i - c)^2$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's create an artifical data set and plot what the contours of this error function look like in python. Artificial data sets are a really useful way to understand a model. By creating an artifical data set that is sampled from the model, we are forced to think about the type of data we are expecting. To create an artificial data set we first sample some inputs, let's sample 4 points. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = np.random.normal(size=4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now need to decide on a *true* value for $m$ and a *true* value for *c* to use for generating the data. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m_true = 1.4\n",
      "c_true = -3.1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use these values to create our artificial data. The formula $\\dataScalar_i = m\\inputScalar_i + c$ is translated to code as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = m_true*x+c_true"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now plot the artifical data we've created."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(x, y, 'rx') # plot data as red crosses"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These points lie exactly on a straight line, that's not very realistic, let's corrupt them with a bit of Gaussian 'noise'."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise = np.random.normal(scale=0.5, size=4) # standard deviation of the noise is 0.5\n",
      "y = m_true*x + c_true + noise\n",
      "plt.plot(x, y, 'rx')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Contour Plot of Error Function\n",
      "\n",
      "Now we are going to do something a little bit complicated. We are going to compute the error function for a grid of different values for $m$ and $c$. Doing this will help us visualize the shape of the error function. What we need is to compute a grid of all different values of $m$ and $c$ which we think are sensible. Since we know the true values, let's centre our grid around these. In python, a special command is provided for making the grid. First of all, we use the `linspace` command for creating an array of linearly spaced values of $m$ and $c$, then we use `meshgrid` to create a grid from those two vectors."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m_vals = np.linspace(m_true-3, m_true+3, 100) # create an array of linearly separated values around m_true\n",
      "c_vals = np.linspace(c_true-3, c_true+3, 100) # create an array of linearly separated values around c_true"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we create the grid on which to compute the error function. `meshgrid` creates a matrix of values that form the grid, for both $c$ and $m$. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m_grid, c_grid = np.meshgrid(m_vals, c_vals) \n",
      "# we can see that this is now a 100x100 matrix of values with the print command\n",
      "print \"The dimensionality of m_grid is\", m_grid.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we compute the error function at each of those combinations of $c$ and $m$.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "E_grid = np.zeros((100, 100))\n",
      "for i in xrange(100):\n",
      "    for j in xrange(100):\n",
      "        E_grid[i, j] = ((y - m_grid[i, j]*x - c_grid[i, j])**2).sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This loop computes the error function for each value in `m_grid` and `c_grid`, storing the result in `E_grid`. This allows us to make a contour function plot of the error. We can do this using the `matplotlib` module and the command `contour`. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.contour(m_vals, c_vals, E_grid)\n",
      "plt.xlabel('$m$')\n",
      "plt.ylabel('$c$')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We fit a regression model by minimizing this sum of squares error function. One way of doing that is gradient descent. We initialize with a guess for $m$ and $c$ and we update that guess by subtracting a portion of the gradient from the guess. Like walking down a hill in the steepest direction of the hill in order to get to the bottom. So let's start with a guess for $m$ and $c$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m_star = 3.0\n",
      "c_star = -5.0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we need to compute the gradient of the error function, firstly with respect to $c$,\n",
      "\n",
      "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} c} = -2\\sum_{i=1}^\\numData (\\dataScalar_i - m\\inputScalar_i - c)$$\n",
      "\n",
      "We'll review how this gradient is derived in a moment, but first let's just see how it's computed in numpy."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "c_grad = -2*(y-m_star*x - c_star).sum()\n",
      "print \"Gradient with respect to c is \", c_grad"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To see how the gradient was derived, first note that the $c$ appears in every term in the sum. So we are just differentiating $(\\dataScalar_i - m\\inputScalar_i - c)^2$ for each term in the sum. The gradient of this term with respect to $c$ is simply the gradient of the outer quadratic, multiplied by the gradient with respect to $c$ of the part inside the quadratic. The gradient of a quadratic is two times the argument of the quadratic, and the gradient of the inside linear term is just minus one. This is true for all terms in the sum, so we are left with the sum in the gradient. The gradient with respect tom $m$ is similar, but now the gradient of the quadratic's argument is $-\\inputScalar_i$ so the gradient with respect to $m$ is\n",
      "\n",
      "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} m} = -2\\sum_{i=1}^\\numData \\inputScalar_i(\\dataScalar_i - m\\inputScalar_i - c)$$\n",
      "\n",
      "which can be implemented in python (numpy) as"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "m_grad = -2*(x*(y-m_star*x - c_star)).sum()\n",
      "print \"Gradient with respect to m is \", m_grad"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This gives us the gradients with respect to $m$ and $c$. We can now update our inital guesses for $m$ and $c$ using the gradient. However, we don't want to just add the gradient to $m$ and $c$, we need to take a *small* step in the gradient direction, otherwise we might overshoot the minimum. We want to follow the gradient to get to the minimum, and the gradient changes all the time. The step size has already been introduced, it's again known as the learning rate and is denoted by $\\eta$. \n",
      "\n",
      "$$c_\\text{new} \\leftarrow c_{\\text{old}} - \\eta \\frac{\\text{d}\\errorFunction(m, c)}{\\text{d}c}$$ \n",
      "\n",
      "gives us an update for our estimate of $c$ (which in the code we've been calling `c_star` to represent a common way of writing a parameter estimate, $c^*$) and \n",
      "\n",
      "$$m_\\text{new} \\leftarrow m_{\\text{old}} - \\eta \\frac{\\text{d}\\errorFunction(m, c)}{\\text{d}m}$$\n",
      "\n",
      "gives us an update for our estimate of $m$. These two udpates can be coded as"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Original m was \", m_star, \" and original c was \", c_star\n",
      "learn_rate = 0.01\n",
      "c_star = c_star - learn_rate*c_grad\n",
      "m_star = m_star - learn_rate*m_grad\n",
      "print \"New m is \", m_star, \" and new c is \", c_star\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Of course, to fit the model we need to keep computing the gradients (they change as we change $m$ and $c$) and keep doing the parameter updates. Let's watch a gradient descent in action with the following code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# first let's plot the error surface\n",
      "f, (ax1, ax2) = plt.subplots(1, 2) # this is to create 'side by side axes'\n",
      "ax1.contour(m_vals, c_vals, E_grid) # this makes the contour plot on axes 1.\n",
      "plt.xlabel('$m$')\n",
      "plt.ylabel('$c$')\n",
      "m_star = 3.0\n",
      "c_star = -5.0\n",
      "for i in xrange(100): # do 100 iterations (parameter updates)\n",
      "    # compute the gradients\n",
      "    c_grad = -2*(y-m_star*x - c_star).sum()\n",
      "    m_grad = -2*(x*(y-m_star*x - c_star)).sum()\n",
      "    \n",
      "    # update the parameters\n",
      "    m_star = m_star - learn_rate*m_grad\n",
      "    c_star = c_star - learn_rate*c_grad\n",
      "    \n",
      "    # update the location of our current best guess on the contour plot\n",
      "    ax1.plot(m_star, c_star, 'g*')\n",
      "    \n",
      "    # show the current status on the plot of the data\n",
      "    ax2.plot(x, y, 'rx')\n",
      "    plt.ylim((-9, -1)) # set the y limits of the plot fixed\n",
      "    x_plot = np.asarray(plt.xlim()) # get the x limits of the plot for plotting the current best line fit.\n",
      "    y_plot = m_star*x_plot + c_star\n",
      "    ax2.plot(x_plot, y_plot, 'b-')\n",
      "    display(plt.gcf())\n",
      "    time.sleep(0.25) # pause between iterations to see update\n",
      "    ax2.cla()\n",
      "    clear_output()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Stochastic Gradient Descent\n",
      "\n",
      "This approach of steepest gradient descent works fine if the number of data points, $\\numData$ isn't too large. But what do Google do when they have 1 billion data points? We can get an algorithm that is a bit more similar to the perceptron (it looks at one data point at a time rather than summing across *all* data points) by using *stochastic* gradient descent. In this case, instead of computing the true gradient of $\\errorFunction(m, c)$ we just look at one term from the full sum over $\\numData$. The real gradient with respect to $m$ is given by \n",
      "\n",
      "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} m} = -2\\sum_{i=1}^\\numData \\inputScalar_i(\\dataScalar_i - m\\inputScalar_i - c)$$\n",
      "\n",
      "but it has $\\numData$ terms in the sum. Substituting in the gradient we can see that the full update is of the form\n",
      "\n",
      "$$m_\\text{new} \\leftarrow m_\\text{old} + 2\\eta \\left[\\inputScalar_1 (\\dataScalar_1 - m_\\text{old}\\inputScalar_1 - c_\\text{old}) + (\\inputScalar_2 (\\dataScalar_2 - m_\\text{old}\\inputScalar_2 - c_\\text{old}) + \\dots + (\\inputScalar_\\numData (\\dataScalar_\\numData - m_\\text{old}\\inputScalar_\\numData - c_\\text{old})\\right]$$\n",
      "\n",
      "This could be split up into lots of individual updates\n",
      "\n",
      "$$m_1 \\leftarrow m_\\text{old} + 2\\eta \\left[\\inputScalar_1 (\\dataScalar_1 - m_\\text{old}\\inputScalar_1 - c_\\text{old})\\right]$$\n",
      "$$m_2 \\leftarrow m_1 + 2\\eta \\left[\\inputScalar_2 (\\dataScalar_2 - m_\\text{old}\\inputScalar_2 - c_\\text{old})\\right]$$\n",
      "$$m_3 \\leftarrow m_2 + 2\\eta \\left[\\dots\\right]$$\n",
      "$$m_\\numData \\leftarrow m_{\\numData-1} + 2\\eta \\left[\\inputScalar_\\numData (\\dataScalar_\\numData - m_\\text{old}\\inputScalar_\\numData - c_\\text{old})\\right]$$\n",
      "\n",
      "which would lead to the same final update, but note that we aren't changing the $m$ and $c$ we use for computing the gradient term at each update. In stochastic gradient descent we *do* update $m$ and $c$ we use. This means that we aren't quite following the steepest gradient, but a stochastic approximation to the gradient. However, it also means we can just present each data point in a random order, like we did for the perceptron. This makes the algorithm suitable for large scale web use (recently this domain is know as 'Big Data') and algorithms like this are widely used by Google, Microsoft, Yahoo! and Facebook.\n",
      "\n",
      "$$m_1 \\leftarrow m_\\text{old} + 2\\eta \\left[\\inputScalar_1 (\\dataScalar_1 - m_\\text{old}\\inputScalar_1 - c_\\text{old})\\right]$$\n",
      "$$m_2 \\leftarrow m_1 + 2\\eta \\left[\\inputScalar_2 (\\dataScalar_2 - m_1\\inputScalar_2 - c_1)\\right]$$\n",
      "$$m_3 \\leftarrow m_2 + 2\\eta \\left[\\dots\\right]$$\n",
      "$$m_\\numData \\leftarrow m_{\\numData-1} + 2\\eta \\left[\\inputScalar_\\numData (\\dataScalar_\\numData - m_{\\numData - 1}\\inputScalar_\\numData - c_{\\numData-1})\\right]$$\n",
      "\n",
      "Or more accurate, since the data is normally presented in a random order we just can write\n",
      "\n",
      "$$m_\\text{new} = m_\\text{old} + 2\\eta\\left[\\inputScalar_i (\\dataScalar_i - m_\\text{old}\\inputScalar_i - c_\\text{old})\\right]$$\n",
      "\n",
      "which is easily implemented as code "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# choose a random point for the update\n",
      "i = np.random.randint(x.shape[0]-1)\n",
      "# update m\n",
      "m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))\n",
      "# update c\n",
      "c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# first let's plot the error surface\n",
      "f, (ax1, ax2) = plt.subplots(1, 2) # this is to create 'side by side axes'\n",
      "ax1.contour(m_vals, c_vals, E_grid) # this makes the contour plot on axes 1.\n",
      "plt.xlabel('$m$')\n",
      "plt.ylabel('$c$')\n",
      "m_star = 3.0\n",
      "c_star = -5.0\n",
      "for i in xrange(100): # do 100 iterations (parameter updates)\n",
      "    # choose a random point\n",
      "    i = np.random.randint(x.shape[0]-1)\n",
      "\n",
      "    # update m\n",
      "    m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))\n",
      "    # update c\n",
      "    c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)# compute the gradients\n",
      "    \n",
      "    # update the location of our current best guess on the contour plot\n",
      "    ax1.plot(m_star, c_star, 'g*')\n",
      "    \n",
      "    # show the current status on the plot of the data\n",
      "    ax2.plot(x, y, 'rx')\n",
      "    plt.ylim((-9, -1)) # set the y limits of the plot fixed\n",
      "    x_plot = np.asarray(plt.xlim()) # get the x limits of the plot for plotting the current best line fit.\n",
      "    y_plot = m_star*x_plot + c_star\n",
      "    ax2.plot(x_plot, y_plot, 'b-')\n",
      "    display(plt.gcf())\n",
      "    time.sleep(0.25)  # pause between iterations to see update\n",
      "    ax2.cla()\n",
      "    clear_output()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Reflection on Linear Regression and Supervised Learning\n",
      "\n",
      "1. *Easy* What effect does the learning rate have in the optimization? What's the effect of making it too small, what's the effect of making it too big? Do you get the same result for both stochastic and steepest gradient descent?\n",
      "\n",
      "2. *Harder* The stochastic gradient descent doesn't help very much for such a small data set. It might be fun to try it with more than 4 data points. Can you modify the code above to run with thousands of data points? Do you also need to change the number of iterations? Which code base is faster? Does it make sense to summarize both algorithms in terms of the number of iterations?\n",
      "\n",
      "3. *Very Hard* Can you use what you've learnt to come up with a batch algorithm for the perceptron? We justified the perceptron by geometric arguments to construct the algorithm. Does it have an error function associated with it? What is the error function? Is the perceptron doing gradient descent on this error function?\n",
      "\n",
      "#### Nonlinear Regression with Basis Functions\n",
      "\n",
      "In the lectures, we also looked at the idea of doing non linear regression with the same algorithms. The way we chose to do this was to introduce the concept of a basis function. A basis function is a feature space that is computed from our original data. A very common one is a polynomial basis, for example a quadratic basis. In a quadratic basis we assume our regression is given by\n",
      "\n",
      "$$\\dataScalar_i = \\mappingScalar_2 \\inputScalar_i^2 + \\mappingScalar_1 \\inputScalar_i + \\mappingScalar_0.$$\n",
      "\n",
      "This gives a *nonlinear* relationship between our observation and the inputs. Here the basis vector is given by $\\basisVector = \\left[1\\ \\inputScalar_i\\ \\inputScalar_i^2\\right]$. Although the set up is non-linear we can still use the same type of algorithm we used above. For convenience we have replace $c$ with $\\mappingScalar_0$, $m$ with $\\mappingScalar_1$ and we have added $\\mappingScalar_2$ as the coefficient associated with the quadratic term. This approach to making linear algorithms nonlinear is very common, all sorts of different nonlinearities can be used (not just polynomials). The algorithms stay quite simple as long as they remain *linear* in the parameters. Here *linear* in the parameters means that the parameters can only appear in multiplications and sums, not inside *nonlinear* functions (like the basis functions). In other words they need to be writable as\n",
      "\n",
      "$$\\dataScalar_i = \\mappingScalar_3 \\basisScalar_3 + \\mappingScalar_2 \\basisScalar_2 + \\mappingScalar_1 \\basisScalar_1 + \\mappingScalar_0 \\basisScalar_0.$$\n",
      "\n",
      "This is what makes linear algebra so important (we've already seen the importance of differential calculus in finding a minimum!). Linear algebra gives us a short-cut to writing down functions like the one above, and makes algorithm derivation with such models much easier. For example, I would always write such an model down as an *inner product*,\n",
      "\n",
      "$$\\dataScalar_i = \\mappingVector^\\top \\basisVector$$\n",
      "\n",
      "which is a much more compact form. Here I've written the inner product in the matrix notation. This tuns out to be necessary to use when we are using more advanced methods to design algorithms to fit the models described above. We need to use linear algebra (i.e. matrices, determinants, inverses etc.) and *multivariate calculus*.\n",
      "\n",
      "## Unsupervised Learning\n",
      "\n",
      "The examples of *regression* and *classification* assume that we are given some data, $\\inputVector_i$ and some associated labels for the data $\\dataScalar_i$. Whether we use regression or classification is dependent on whether the label information is discrete (like in classifying whether someone is a friend of yours on Facebook) or whether the label information is continuous (like predicting the demand for bandwidth you might have on a particular day). Because there are labels present, these two examples of learning are both known as *supervised learning*. Another type of learning you might be interested in is when there are no labels present. That is known as *unsupervised learning*. Broadly speaking there are two types of unsupervised learning. Firstly, when you are looking to automatically categorize your data into different groups (like political parties, or types of animal: mammal, reptile, bird etc). This is known as clustering or in some contexts vector quantisation (we will refer to it as clustering). It is the unsupervised equivalent of classification. \n",
      "\n",
      "### Clustering\n",
      "\n",
      "We will look at one clustering algorithm: $k$-means clustering. In $k$-means clustering the idea is to come up with a representative 'centre' associated with each of the clusters. This is the *mean* of the cluster. Points are allocated to that cluster centre if they are closer to that cluster centre than any of the other cluster centres. To start the algorithm, you first need a number of centres (these might be chosen randomly from the data). You then allocate each data point to the centre it's closest too. Then you update the values for each centre by setting it to the mean of the points that are allocated to it. To illustrate this in practice, we will first create a data set containing three clusters that we'll create by sampling from Gaussian densities. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Generate data from three Gaussian densities\n",
      "# X0 is Gaussian centred at 2.5, 2.5\n",
      "X0 = np.random.normal(loc=2.5, size=(30, 2))\n",
      "# X1 is Gaussian centred at -2.5, -2.5\n",
      "X1 = np.random.normal(loc=-2.5, size=(30, 2))\n",
      "# X2 empty for moment\n",
      "X2 = zeros((30, 2))\n",
      "\n",
      "# put first column as Gaussian centred at 2.5\n",
      "X2[:, 0] = np.random.normal(loc=2.5, size=30)\n",
      "\n",
      "# put second column as Gaussian centred at -2.5\n",
      "X2[:, 1] = np.random.normal(loc=-2.5, size=30)\n",
      "\n",
      "plt.plot(X0[:, 0], X0[:, 1], 'rx')\n",
      "plt.plot(X1[:, 0], X1[:, 1], 'go')\n",
      "plt.plot(X2[:, 0], X2[:, 1], 'b+')\n",
      "\n",
      "        "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Put all the data together in one matrix.\n",
      "X=np.vstack((X0, X1, X2))\n",
      "np.random.sample?\n",
      "num_centres = 3\n",
      "num_data = X.shape[0]\n",
      "# choose three points to be the centres of the clusters\n",
      "center_indices = np.random.random_integers(0, X.shape[0], num_centers)\n",
      "centers = X[center_indices, :]\n",
      "allocation = np.zeros(num_data)\n",
      "\n",
      "for i in xrange(X.shape[0]):\n",
      "    dists = ((centres - X[i, :])**2).sum(1)\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.random.random_sample(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Dimensionality Reduction\n",
      "\n",
      "Dimensionality reduction is the other main approach to unsupervised learning. We haven't looked closely at any dimensionality reduction algorithms, but the idea with them is to try and summarize your data set with a reduced number of 'latent' variables. Like a codebook. Dimensionality reduction algorithms were first studied in the social sciences, in particular in psychology. The IQ measurement is an attempt to reduce all facets of your intelligence down to one single value. Talking about how left wing or right wing someone is is an attempt to reduce all their political opinions down to one value. In practice, do make good models, we often need more than one value to summarize the complexity of a data set. Common approaches for dimensionality reduction include principal components analysis and factor analysis.\n",
      "\n",
      "## Probability: The Calculus of Uncertainty\n",
      "\n",
      "We have spent most of our introduction to artificial intelligence and machine learning focussing on particular problem types which we have categorized into supervised and unsupervised learning. The wider field of artificial intelligence contains many challenges that aren't so easy to categorize. In fact there are times when such categorizations can become a hindrance to progress. For example, is robot navigation supervised or unsupervised learning? I'm not sure, but I think it depends on whether the robot has a map or not. But how does the map appear as labels? \n",
      "\n",
      "In general modern artificial intelligence is about modelling your data and propagating the uncertainty around your system. The main way we do this is through probability and in particular Bayes rule. All the algorithms we have introduced have a probabilistic intepretation. The common way of finding it is to look at the error function, and see what the equivalent maximum likelihood model is. Remember the error function can often be derived as the negative log of the probability distribution we use for defining the model.\n",
      "\n",
      "When using probabilities for artificial intelligence, we think about two things: the state of the world (often represented by $\\dataMatrix$) and our belief about the state of the world, which is represented by a probability distribution. The model involves specifying a probability distribution over what we think the most likely state of the world is, making observations, and updating that probability distribution. A nice example is robot navigation. \n",
      "\n",
      "A robot's state includes it's $x$, $y$ position, the direction it's facing, and perhaps a number of other variables such as where it's arms are legs etc. You can see quite quickly how the state space of the robot increases. Maybe the robot doesn't know exactly where it is: it is uncertain. The robot expresses this uncertainty by placing a probability distribution over its location: $p(\\latentMatrix)$. This is known as the prior distribution. When the robot makes an observation, for example of landmark, like a wall, or perhaps if the robot is in the Peak District, a particular hill, the robot can absorb that information through probability, it does this by modelling the observations, $\\dataMatrix$, with a probability distribution that says what the observations should be given the state of the robot, $p(\\dataMatrix|\\latentMatrix)$. This distribution is known as the *likelihood*. What we want to know is what effect the observation has had on the robots updated belief about its new location. This distribution is known as the *posterior* distribution. \n",
      "\n",
      "These three distributions are related by Bayes' rule. Bayes' rule of probability comes about from combining the two main rules of probability. \n",
      "\n",
      "#### Product Rule of Probability\n",
      "\n",
      "The product rule of probability says that the joint distribution, $p(\\latentMatrix, \\dataMatrix)$ can be computed from the conditional and marginal distributions as follows \n",
      "\n",
      "$$p(\\latentMatrix, \\dataMatrix) = p(\\dataMatrix | \\latentMatrix) p(\\latentMatrix)$$\n",
      "\n",
      "or conversely\n",
      "\n",
      "$$p(\\latentMatrix, \\dataMatrix) = p(\\latentMatrix |\\dataMatrix) p(\\dataMatrix).$$\n",
      "\n",
      "The product rule gives the relationship between the marginal probability, the conditional probability and the joint probability.\n",
      "\n",
      "#### Sum Rule of Probability\n",
      "\n",
      "The sum rule of probability gives the relationship between the marginal probabability and the joint probability. \n",
      "\n",
      "$$p(\\dataMatrix) = \\sum_{\\latentMatrix} p(\\dataMatrix, \\latentMatrix)$$\n",
      "\n",
      "where here the sum is over all possible values that $\\latentMatrix$ can take (in the robot example, over all possible states of the robot! For continuous density functions, the sum has the form of an integral.\n",
      "\n",
      "#### Bayes' Rule\n",
      "\n",
      "Bayes' rule is so simple to derive from these two rules, that it doesn't really deserve a name (in fact I don't like the name Bayes rule, because it was first used by other people including Laplace). However, that's the common name for it. Bayes' rule gives us the relationship between what we new before we observed the location of the hill ($p(\\latentMatrix)$) and what we know after ($p(\\latentMatrix|\\dataMatrix)$. The relationship is given by equating the two possible forms of the joint distribution as given by the product rule, so we have\n",
      "\n",
      "$$p(\\latentMatrix|\\dataMatrix)p(\\dataMatrix) = p(\\dataMatrix|\\latentMatrix) p(\\latentMatrix)$$\n",
      "\n",
      "With a little rearrangement we can have\n",
      "\n",
      "$$p(\\latentMatrix|\\dataMatrix) = \\frac{p(\\dataMatrix|\\latentMatrix) p(\\latentMatrix)}{p(\\dataMatrix)}$$\n",
      "\n",
      "Everything in this rule is specified by our model of the world: our prior belief, $p(\\latentMatrix)$, and the relationship between our measurements and the state of the world, known as the likelihood, $p(\\dataMatrix|\\latentMatrix)$. The only thing we're missing is the marginal likelihood, $p(\\dataMatrix)$, but this can be derived through the sum rule and product rules,\n",
      "\n",
      "$$p(\\dataMatrix) = \\sum_{\\latentMatrix} p(\\dataMatrix, \\latentMatrix) = \\sum_{\\latentMatrix} p(\\dataMatrix|\\latentMatrix) p(\\latentMatrix).$$\n",
      "\n",
      "Now we can express everything we need to get our updated belief about the state of the world through knowing the likelihood and the prior, i.e. our initial belief about the state of the world, and the relationship between the state of the world and the types of measurements we can make. The major difficulty is in computing the sum: it can involve many terms and therefore be intractable. Or in continuous systems, it involves a high dimensional integral, which is also difficult to do. \n",
      "\n",
      "Despite the challenges associated with computing the marginal likelihood, *Bayesian inference* (as it's widely known, although I prefer the simpler term 'the calculus of uncertainty') is the most promising approach we have for practical Aritificial Intelligence. Many of the advances you may have heard about in the media (such as Sebastian Thrun and the Google driverless car) rely on application of this formula. Amazing what can emerge from a fairly simple set of rules. Although, of course, practical application requires a great deal of engineering knowledge and approximations. In robot navigation the formula is applied repeatedly to absorb information as it arrives. When this is done the algorithm is known as a Bayes filter. "
     ]
    }
   ],
   "metadata": {}
  }
 ]
}