{ "metadata": { "name": "", "signature": "sha256:eeb4c569cb7ab3aedb2e10fb7a19a3555b751cfc190daad085140985dfce1c23" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to IPython Notebook. We will be using these notebooks to run the labs. If you're reading this it likely means that you have successfully setup IPython on your computer.\n", "\n", "As many of you likely already know python, feel free to skip to some of the harder stuff in sections 2.3, 3, 4 and 5. For those that have never seen python before please see these links:\n", "\n", "https://groklearning.com/course/intro-python-1/\n", "\n", "https://groklearning.com/course/intro-python-2/. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# To allow compatibility between python 2 and python 3\n", "from __future__ import print_function" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Basic Python Syntax\n", "======================\n", "\n", "If you just need a quick refresher here is some simple python syntax. Remember that indentation is important in python. To run the *cell* you can press `Ctrl-Enter` or hit the *Play* button at the top. Please edit the code and experiment." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# This is a comment line\n", "# numbers and variables\n", "age = 26\n", "pi = 3.14159" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# strings and methods\n", "s = 'Hugh F Durrant-Whyte'\n", "\n", "# Strings have a method `split`\n", "tokens = s.split()\n", "# tokens is now a list of strings\n", "print(tokens)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "firstName = tokens[0]\n", "middleName = tokens[1]\n", "lastName = tokens[2]\n", "s2 = firstName + ' ' + middleName + ' ' + lastName" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# 'if' statement - indentation matters\n", "if s == s2:\n", " print('yes the strings are equal')\n", "else:\n", " print('no')\n", "\n", "# if statements can also be inline\n", "answer = 'yes' if s == s2 else 'no' " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# list (mutable ordered sequence)\n", "beatles = ['John', 'Paul', 'George']\n", "beatles.append('Ringo')\n", "print(beatles)\n", "print('Ringo' in beatles)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# 'for' loop - indentation matters\n", "# Note that name is defined inside the for loop\n", "for name in beatles:\n", " print('Hello ' + name)\n", "\n", "# Iterating over a range of numbers is easy\n", "# range has the following arguments (start, stop, step) where stop isn't included\n", "for number in range(2, 10, 2):\n", " print(number)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# tuple (immutable ordered sequence)\n", "ages = (18, 21, 28, 21, 22, 18, 19, 34, 9)\n", "\n", "# Note you can't change the contents of a tuple\n", "\n", "# set (mutable, unordered, no duplicates)\n", "uniqueAges = set(ages)\n", "uniqueAges.add(18) # already in set, no effect\n", "uniqueAges.remove(21)\n", "\n", "# testing set membership (very fast)\n", "if 18 in uniqueAges:\n", " print('There is an 18-year-old present!')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# sorting a list\n", "sorted(beatles) # returns a new sorted list\n", "beatles.sort() # in-place - changes beatles list\n", "\n", "# Sorting a set returns a list\n", "orderedUniqueAges = sorted(uniqueAges)\n", "\n", "# There is no guaranteed order when iterating over a set\n", "for thisAge in uniqueAges:\n", " print(thisAge)\n", "\n", "# Instead iterate over the sorted set:\n", "for age in sorted(uniqueAges):\n", " print(age)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# dict - mapping unique keys to values\n", "netWorth = {}\n", "netWorth['Donald Trump'] = 3000000000\n", "netWorth['Bill Gates'] = 58000000000\n", "netWorth['Tom Cruise'] = 40000000\n", "netWorth['Joe Postdoc'] = 20000\n", "\n", "# Access the value associated with a key\n", "print(netWorth['Donald Trump'])\n", "\n", "# iterating over a dict gives keys\n", "for personName in netWorth:\n", " print(personName + \" is worth: \")\n", " print(netWorth[personName])\n", "\n", "# You can also iterate over key-value pairs:\n", "for (person, worth) in netWorth.items():\n", " if worth < 1000000:\n", " print('haha ' + person + ' is not a millionaire')\n", "\n", "# testing dict membership is the same as with a set\n", "if 'Tom Cruise' in netWorth:\n", " print('show me the money!')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def some_func(arg1, arg2, n=5):\n", " \"\"\"\n", " Triple quoted strings act as function documentation.\n", " Arguments can be optional by including a default\n", " \"\"\"\n", " # Inline comment\n", " ranges = [0, 1, 2, 3] # Create a list\n", " # Create a range object, similar to a list of numbers\n", " # starting at 0 going up to but not including n\n", " ranges = range(0, n)\n", " \n", " print('Fancy result formating using .format')\n", " for i in ranges:\n", " result = arg1 + arg2 \n", " print('result ({}) = arg1 ({}) + arg2 ({})'.format(result, arg1, arg2))\n", " \n", "# Call the function with arguments (change me)\n", "some_func(4, 5, 3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import additional modules, including the builtin `math` module:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import math\n", "print(math.sqrt(1125899906842624))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 File I/O\n", "\n", "Now for a basic, but still very useful, data science example: a csv file parser." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -n1 \"sample-clusters.csv\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "csvfilename = \"sample-clusters.csv\"\n", "all_data = []\n", "with open(csvfilename) as f:\n", " for linenumber, line in enumerate(f, start=1):\n", " stringData = line.strip().split()\n", " floatData = list(map(float, stringData))\n", " if linenumber < 5: \n", " print(linenumber, *floatData)\n", " all_data.append(floatData)\n", "\n", "print(\"Parsed CSV comprising {} rows by {} columns\".format(len(all_data), len(floatData)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on the file size you may not be able to fit everything in memory, so instead of appending each row to the `all_data` list you would do some more processing to extract just the feature you require. In more extreme cases you would `open` an output file and write the relevant features in a row by row fashion.\n", "\n", "If you want a bit of practise writing Python before continuing, write code to sum up each column in the given csv file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Linear Algebra in Python\n", "=========================\n", "\n", "2.1 Linear Algebra Examples\n", "----------------------------\n", "\n", "To perform linear algebra in python use the `numpy` library, you can import the numpy library like so:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have imported the numpy library with the alias `np`. With `numpy` you can perform basic linear algebra operations. It's important to remember that Python uses **zero-based-indexing** and uses `[ ]` for indexing variables. Numpy also uses the `:` to span a dimension.\n", "\n", "Finally `numpy` by default interprets the `*` to mean element-wise multiplication, which is the opposite of MATLAB. If you are a MATLAB user, you may find the following site helpful:\n", "http://wiki.scipy.org/NumPy_for_Matlab_Users\n", "\n", "The following snippet is some examples of numpy usage:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a new array with three elements\n", "A = np.array([1, 2, 3])\n", "print(A)\n", "\n", "# Note that the shape is the size of its dimensions, \n", "# while size is the number of elements.\n", "print('A is of dimension {A.ndim}, with shape: {A.shape}, and size {A.size}'.format(A=A))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that unlike `MATLAB`'s `rand` function, `numpy.random.random` has a single variable for size, but this can contain a list of dimension sizes. In the example below we pass the list `[4,3]` which is interpreted as the dimensions for the new matrix. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a random 4x3 array\n", "B = np.random.random(size=[4,3]) # Equivalent to rand(4,3) in Matlab.\n", "print(B)\n", "print('B is of dimension {A.ndim}, with shape: {A.shape}, and size {A.size}'.format(A=B))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# We can index an array like a list\n", "print(B[0]) # Access the first row\n", "\n", "# Multidimensions can be handled in two ways:\n", "print(B[0][0])\n", "print(B[0, 0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# We can slice arrays\n", "print(B[0, :]) # Get the first row\n", "print(B[:, -1]) # Get the last column" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Slices can have start/stop/step points defined\n", "\n", "# First two elements of the first column\n", "print(B[:2, 0])\n", "\n", "# Every second element of the first column\n", "print(B[::2, 0])\n", "\n", "# same again but starting from 1\n", "print(B[1::2, 0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For people familiar with MATLAB, zero indexing can be tricky. A common mistake is the length of vectors. In MATLAB `[0:2]` results in the vector `[0, 1, 2]`, but in python... indexing is **up to but not including** the last element:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Construct the vector [0, 1, ... 9]\n", "tmp = np.arange(0, 10)\n", "\n", "print(tmp)\n", "print(tmp[0:2])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Numpy Arrays have lots of methods defined on them\n", "print(B.mean())\n", "print(B.max())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Many of these methods can operate on one axis\n", "# Eg to find the mean and max of each column of B:\n", "print(B.mean(axis=0))\n", "print(B.max(axis=0))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examples of matrix manipulation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Note that you can use variable defined earlier in the notebook.\n", "# Create a matrix of ones.\n", "C = np.ones((4, 3))\n", "print(C)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Calculate the element wise product.\n", "print(B*C)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Calculate the matrix product. Note the dot method performs matrix mult\n", "# .T is the transpose\n", "print(B.dot(C.T)) " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also perform matrix inversions, using the `linalg` submodule." ] }, { "cell_type": "code", "collapsed": false, "input": [ "D = B.T.dot(B) # Form an invertible matrix (Positive Definite Matrix).\n", "print(D)\n", "print(np.linalg.inv(D)) # Matrix inverse.\n", "print(np.linalg.cholesky(D)) # Cholesky decomposition." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Eigen decomposition returns two variables, eigen values and eigen vectors\n", "vals, vecs = np.linalg.eig(D)\n", "print('Eigen values = {}'.format(vals))\n", "print('''Eigen vectors = \n", "{}'''.format(vecs))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should also checkout `scipy` if you ever find that numpy is missing functionality that you require. http://www.scipy.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.2 Getting Help\n", "----------------\n", "\n", "Occasionally, you will need to know the parameters of a function or need to read its description. You can use the ? operator to bring up a new window inside IPython's browser with that functions docstring. This works on functions and modules as well. Also IPython notebook can auto-complete for variables that have already been instanciated, just press ``. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.random.random?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "np.random?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also try auto-completing functions in the `np.random` module by typing `np.random.`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.3 Input/Output and a simple problem\n", "--------------------------\n", "\n", "Its very useful to be able to plot your data and read and write data to files. Lets construct a toy problem, plot the results and then save them to a file.\n", "\n", "Problem: The position of a robot oscilates randomly in a 3-d volume for 1000 steps. The oscilation can be modelled as a random perturbation of Gaussian noise centred on its current location. The noise has a variance of 2 world units. It begins at the origin. Calculate a possible trajectory and save it to disk as a matlab matrix.\n", "\n", "The robot state at time $t$ is:\n", "\\begin{equation}\\mathbf{x}_t = \n", "\\begin{bmatrix}\n", "x \\\\\n", "y \\\\\n", "z\n", "\\end{bmatrix}\n", "\\end{equation}\n", "\n", "The robots position at time t=0 is:\n", "\n", "\\begin{equation}\\mathbf{x}_0 = \n", "\\begin{bmatrix}\n", "0 \\\\\n", "0 \\\\\n", "0\n", "\\end{bmatrix}\n", "\\end{equation}\n", "\n", "The robots position is perturbed by Gaussian noise. The robots state at time $t+1$ is:\n", "\\begin{equation}\\mathbf{x}_{t+1} = \\mathbf{x}_t + \\mathcal{N}\\left(\\begin{bmatrix}0\\\\ 0 \\\\ 0\\end{bmatrix}, \n", "\\begin{bmatrix}\n", "2 & 0 & 0\\\\\n", "0 & 2 & 0\\\\\n", "0 & 0 & 2\\end{bmatrix}\\right)\n", "\\end{equation}\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline \n", "import matplotlib.pyplot as pl \n", "\n", "dims = 3\n", "steps = 1000\n", "\n", "state = np.zeros((1, dims))\n", "\n", "history = np.zeros([steps, dims]) #Observe that the dimension is a single variable.\n", "\n", "step_size = 2\n", "for i in range(0, steps):\n", " history[i,:] = state\n", " \n", " state = state + 2 * np.random.normal(size=(1,dims))\n", "\n", "pl.plot(history[:,0], history[:,1]);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting in 3D is not as simple in matplotlib but can be done as shown below. If you execute this code without the `%matplotlib inline` directive ever being called in this notebook, it should produce a new window that will allow you to manipulate the figures. You may need to comment `%matplotlib inline` above, restart the kernel (Kernel->Restart->Restart) and re-run all the above cells if you wish to do this." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import mpl_toolkits.mplot3d.art3d\n", "import mpl_toolkits.mplot3d.axes3d as a3\n", "\n", "ax = a3.Axes3D(pl.figure())\n", "ax.plot(history[:,0], history[:,1], history[:,2]);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Save our numpy array to disk in a Python format\n", "np.save('robot.npy', history)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To save a matlab compatible file, use the `scipy.io` module. The second argument is a python dictionary that uses the name of the matrix as the key and the value as the actual numpy array." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Save as a matlab file \n", "import scipy.io\n", "# Save mat takes a filename and a 'dictionary' of variables to save.\n", "scipy.io.savemat('robot.mat', {'robot': history})" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Principal Component Analysis (PCA)\n", "========================================\n", "$\\newcommand{\\X}{\\mathbf{X}}$\n", "\n", "[Principal component analysis](http://en.wikipedia.org/wiki/Principal_component_analysis) is an incredibly useful technique for visualising data and as a pre-processing step for machine learning algorithms like regression and clustering. Essentially it is a dimensionality reduction technique for data, e.g. if your data has 100 dimensions and you would like to visualise it in 2D, PCA will find a linear combination of the 100 features \"best\" for visualising in 2D. In this case, \"best\" is assumed to be dimensions with the most variance (which in many cases may not actually be a good assumption).\n", "\n", "Alternatively, PCA can be thought of as a mechanism for de-correlating your data, i.e. it will find a projection of your data such that the covariance matrix of the projected data is *diagonal only*. Using this projected data can in some cases improve the speed and performance of other machine learning algorithms. The following description and exercises should give you a better intuition for that PCA is actually doing.\n", "\n", "Let $\\X$ be an $n\\times m$ matrix, where $n$ is the number of observations and $m$ are the \"features\". Features are observed variables or combinations thereof, and relate to a specific observation. \n", "\n", "If $\\X$ has zero mean, then we can construct the empirical covariance matrix $\\mathbf{C}$ as:\n", "\n", "$\\mathbf{C}$ = $\\X^T\\X$\n", "\n", "PCA attempts to find a new coordinate system such that the variance is maximised along the first axis or component. It turns out that this is equivalent to the eigen-decomposition of $\\mathbf{C} = \\mathbf{V}\\mathbf{\\Sigma}\\mathbf{V}^T$. Where $\\mathbf{V}$ is a $m \\times m$ matrix of where each column the new basis (or eigen-vector) and $\\mathbf{\\Sigma}$ is a diagonal matrix of eigenvalues. \n", "\n", "Each eigen-vector is associated with one eigenvalue, by choosing a subset of the eigenvectors (typically those associated with the largest eigenvalues), we can reduce the dimensionality of the feature vector by projecting it into this lower dimensional space.\n", "\n", "Let $\\hat{\\mathbf{V}}$ be a $m \\times k$ matrix of eigenvectors where only $k$ of the $m$ columns associated with the largest eigenvalues have been chosen. We can project the data onto the new space to from $\\hat{\\X}$ as follows:\n", "\n", "$\\hat{\\X} = \\X\\hat{\\mathbf{V}}$\n", "\n", "$\\hat{\\X}$ is an approximation of $\\X$ with shape $n \\times k$.\n", "\n", "See Section 12.1 of *Pattern Recognition and Machine Learning* by Bishop for more details. \n", "\n", "\n", "\n", "3.1 PCA on synthetic data\n", "-------------------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will generate two Gaussians such that they are partially overlapping" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Step 1: Generate a 2-d dataset.\n", "cov1 = np.array([[6,3],[3,6]])\n", "cov2 = np.array([[6,3],[3,6]])\n", "\n", "data1 = np.random.multivariate_normal(np.array([4, 4]), cov1, 500)\n", "data2 = np.random.multivariate_normal(np.array([-4, -4]), cov2, 500)\n", "data = np.vstack([data1, data2])\n", "\n", "# We need to 'centre' the data, by subtracting the mean of each column.\n", "data = data - data.mean(axis=0)\n", "\n", "# Slice the data by taking x = first column, y = second column.\n", "x = data[:, 0]\n", "y = data[:, 1]\n", "\n", "# Plot the data\n", "pl.scatter(x,y,marker='x',color='b')\n", "pl.title('Random Dataset')\n", "pl.xlabel('variable 1')\n", "pl.ylabel('variable 2');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we will plot often, lets write a function to do this. Note that function variables can be given default values, below we will have a default variable `name` set to `'Random Dataset'`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_points(data, name='Random Dataset'):\n", " \"\"\"Plot the given 2d data on a scatter plot\"\"\"\n", " pl.scatter(data[:,0], data[:,1], marker='x', color='b')\n", " pl.title(name)\n", " pl.xlabel('variable 1')\n", " pl.ylabel('variable 2')\n", " pl.axis('equal')\n", "\n", "plot_points(data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "S = data.T.dot(data)\n", "S.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "u, v = np.linalg.eig(S)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `eig` function does not return ordered eigenvalues, thus we must inspect the eigenvalues to know which eigenvector is the principal component" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print('Eigenvalues are: {}'.format(u))\n", "max_eigen_index = np.argsort(u)[-1] # np.argsort orders in ascending order, [-1] indexes the final element of the list\n", "min_eigen_index = np.argsort(u)[0]\n", "print('The largest eigenvalue is at index : {}'.format(max_eigen_index))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the data, use `pl.arrow` to draw the arrows of the eigenvalues. Transform the data and plot it using the eigenvectors as the principal axes. You can transform the data by:\n", "$\\mathbf{X}_p = \\mathbf{X}\\mathbf{V^T}$, where $\\mathbf{V}$ is the matrix of eigen-vectors." ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_points(data)\n", "\n", "pl.arrow(0, 0, 10*v[0,0], 10*v[1,0] , color='r', head_width=0.5)\n", "pl.arrow(0, 0, 10*v[0,1],10*v[1,1] , color='m', head_width=0.5);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "transformed = data.dot(v)\n", "\n", "plot_points(transformed, name='transformed')\n", "\n", "pl.arrow(0, 0, 10,0 , color='r', head_width=0.5) \n", "pl.arrow(0, 0, 0,10 , color='m', head_width=0.5);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets project the data onto the first axis, and plot these points. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "first_component = np.zeros((2,2))\n", "first_component[:,max_eigen_index] = v[:,max_eigen_index] \n", "\n", "reconstructed = data.dot(first_component.dot(first_component.T))\n", "\n", "\n", "pl.scatter(reconstructed[:,0], reconstructed[:,1])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous plot it is difficult to see if there has been any benefit from performing PCA, as all the points are overlayed ontop of each other. These exercises will explore the use of histograms to illustrate the benefit of PCA.\n", "\n", "### Exercise 3.1 (a)\n", "\n", "Plot a histogram of all the data projected onto the first principal component of the data. Use the command `pl.hist`.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d1 = data.dot(v[:,max_eigen_index])\n", "pl.hist(d1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see that two peaks are clearly visible. We can improve this plot by using color.\n", "\n", "### Exercise 3.1 (b)\n", "\n", "Project the data from each of the generating distributions contained in `data1` and `data2` through the principal component of the whole data. \n", "\n", "Plot histograms of both projections on the same plot. Change the color and alpha value of the plots so they are easily distinguished. \n", "\n", "Use the ? operator if you need help using `pl.hist` . The attributes you need to change are `color` and `alpha`.\n", "\n", "Do the same for the second principal component of the data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for i in [max_eigen_index, min_eigen_index]:\n", " pl.figure()\n", " # The zip command merges two lists like a 'zipper'\n", " for d, c in zip([data1, data2], ['r','b']):\n", " transformed = d.dot(v[:,i])\n", " pl.hist(transformed, color=c, alpha=0.5)\n", " pl.title('Axis = {}'.format(i))\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3.1(c)\n", "\n", "Construct a plot similar to the one in 3.1(b), but this time use the data's x-axis. Do the same for the data's y-axis. If your task was to seperate the clusters, which of the four plots would you prefer to use?\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for i in range(2):\n", " pl.figure()\n", " # The zip command merges two lists like a 'zipper'\n", " for d, c in zip([data1, data2], ['r','b']):\n", " pl.hist(d[:,i], color=c, alpha=0.5)\n", " pl.title('Axis = {}'.format(i))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.2 PCA on faces dataset\n", "-------------------------\n", "\n", "So far we have seen that PCA can be used to produce a small number of informative features from a large set of features (in the previous case only 1 informative feature was created). However, is there any information left in the other principal components? This next section hopes to qualatatively demonstrate this using the Olivetti Faces dataset.\n", "\n", "To do this we will need to download some open data. The ML package `scikit-learn` contains some useful tools for doing just this. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sklearn.datasets as datasets" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets use the olivetti faces dataset. You can read about it here:\n", "http://scikit-learn.org/stable/datasets/olivetti_faces.html. Importantly it consists of 400 images of size 64x64 each stored in a single row of the data matrix. We can consider each pixel to be a feature, while each row is an observation. \n", "\n", "(Note: If you have been switching between python 2 and python 3 you may get an error when you run the cell below. Just delete the folder `~/scikit_learn_data` in your home directory and try the download again.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Download the dataset (can take some time)\n", "dataObject = datasets.fetch_olivetti_faces(shuffle=True, random_state=100)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_faces = dataObject.data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# To improve the contrast, lets remove the average value of each pixel, and then also for each image.\n", "per_pixel_centred = raw_faces - raw_faces.mean(axis=0)\n", "\n", "# Note that this line uses broadcasting, see Section 4.2 for more about Broadcasting.\n", "faces = per_pixel_centred - per_pixel_centred.mean(axis=1).reshape((400, 1)) # Remove the mean for each image." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_face(face):\n", " vmax = np.max(face.max(), -face.min())\n", " pl.imshow(face.reshape((64,64)), cmap=pl.cm.gray,\n", " vmin=-vmax,vmax=vmax, interpolation='nearest')\n", " ax = pl.gca();ax.set_yticks([]);ax.set_xticks([])\n", "\n", "def plot_faces(faces, number):\n", " for i in range(0, number):\n", " ax = pl.subplot(number//3, 3, i+1)\n", " plot_face(faces[i,:])\n", "\n", "plot_faces(faces, 6)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing the eigen-decomposition is difficult. An alternative decompostion is the Singular Value Decomposition (SVD). The Singluar Value Decompostion is given by:\n", "$\\newcommand{\\U}{\\mathbf{U}}\n", "\\newcommand{\\S}{\\mathbf{S}}\n", "\\newcommand{\\W}{\\mathbf{W}}\n", "\\newcommand{\\V}{\\mathbf{V}}$\n", "$\\X = \\mathbf{U}\\mathbf{S}\\mathbf{W}$, note that unlike the eigen decompostion, the SVD does not require a square matrix. Let $\\X$ have dimensions $n\\times m$ The matricies $\\U$ and $\\W$ are orthoganal and have dimensions $n \\times n$ and $m \\times m$.\n", "\n", "Consider the empirical covariance and use substitute the SVD:\n", "\n", "$\\X^T\\X = \\W^T \\S^T \\U^T \\U \\S \\W$. \n", "\n", "Since $\\U$ is an orthogonal matrix.\n", "\n", "$\\X^T\\X = \\W^T \\S^T \\S \\W$\n", "\n", "Comparing this to the eigen-decomposition of $\\X^T\\X = \\V\\mathbf{\\Sigma}\\V^T$, we can see that $\\W = \\V$. Using the SVD, we can compute the eigenvalues much faster.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "# Time is an IPython 'magic' function that will record how long this block took to execute.\n", "\n", "_, s, w = np.linalg.svd(faces) # Since we don't care about the u term, we can use the dummy varibale _ " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets see if we can reduce the size of the features in this data using PCA. Lets take the first 100 principal components." ] }, { "cell_type": "code", "collapsed": false, "input": [ "num_components = 100\n", "top_components = w[0:num_components, :]\n", "plot_faces(top_components, 9) # Plot the top (most important) eigenfaces" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the principal components of the faces data set that seem to resemble faces. We can visualise how much information is lost by transforming the first few faces through only the principal components." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# For the first face, show the components of the top 5 eigenfaces\n", "face = faces[0, :]\n", "plot_face(face)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "first_face_components = face.dot(top_components[0:9,:].T)\n", "print(\"The contribution from the first 9 eigenfaces are:\")\n", "for i, c in enumerate(first_face_components, start=1):\n", " print(\"{:6.2f} * eigenface {} {}\".format(c, i, \"+\" if i < 9 else \"\"))\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Build up a representation of the first face using just these first 9 eigenfaces\n", "plot_face(first_face_components.dot(top_components[:9, :]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Compare to the representation if we use the top 75 eigenvectors\n", "# Try other values!\n", "plot_face(face.dot(top_components[0:75,:].T.dot(top_components[:75, :])))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "transformed_faces = faces[0:5,:].dot(top_components.T).dot(top_components)\n", "def compare_faces(face1, face2, number):\n", " for i in range(0,number):\n", " pl.figure()\n", " for face,plot_num in zip([face1, face2], [121,122]):\n", " pl.subplot(plot_num)\n", " vmax = np.max([face[i,:].max(), -face[i,:].min()])\n", " pl.imshow(face[i,:].reshape((64,64)),cmap=pl.cm.gray,\n", " vmin=-vmax,vmax=vmax, interpolation='nearest')\n", "compare_faces(faces, transformed_faces, 5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3.2 (a)\n", "\n", "Try with different values of principal components. See if you can identify faces with only 5 principal components.\n", "\n", "### Exercise 3.2 (b)\n", "\n", "Try using an Eigen-decompostion instead of a SVD on the faces data-set. How long does each method take?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.3 Scikit Learn \n", "-----------------\n", "\n", "We can also use the scikit learn library to achieve a similar thing. For decompositions we need to import the scikit learn decomposition submodule `sklearn.decomposition`.\n", "\n", "To use scikit learn we will first instanciate a PCA object, and specify the number of components for the problem." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sklearn.decomposition as decomp\n", "num_components = 100\n", "pca = decomp.PCA(n_components = num_components)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn objects for supervised learning typically implement a `fit` method that is used to train the model. For the case of PCA, this *training* is equivalent to finding the `n` principal compenents of the faces as we did in section 3.2. These principal components are then stored inside the `pca` object so they can be used to transform the faces. The separation between finding the principal components and transforming the faces is useful if we wish to use the principal components of one set of faces as a basis for a different but hopefully very similar new set of faces.\n", "\n", "Pass in the `faces` array as training data and fit the model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%time\n", "# Time is an IPython 'magic' function that will record how long this block took to execute.\n", "\n", "pca.fit(faces)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the reducted dimensionality features, apply the `transform` method to the data. Since we are interested in visualising the difference between the true and approximate face, we need to immediately apply the `inverse_transform` to bring the data back into its original space. We can now plot once again." ] }, { "cell_type": "code", "collapsed": false, "input": [ "new_faces = pca.inverse_transform(pca.transform(faces))\n", "compare_faces(faces, new_faces, 5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3.3 (a)\n", "\n", "Try modifying the number of components to get different results. \n", "\n", "### Exercise 3.3 (b)\n", "\n", "Scikit learn also implements whitening. Whitening is a process where the components are scaled by their eigenvalues, and is useful as many machine learning algorithms expect all the features to be at the same scale. Experiment with setting `whiten=True` when constructing the `PCA` object. Can you tell the difference between whitened and unwhitened eigenfaces?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca = decomp.PCA(n_components = num_components, whiten=True)\n", "pca.fit(faces)\n", "whitened_faces = pca.inverse_transform(pca.transform(faces))\n", "compare_faces(faces, whitened_faces, 5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3.3 (c)\n", "\n", "Implement 'whitening' for the faces dataset. This requires dividing each principal component by the square root of the its corrosponding eigenvalue. This can be done by replacing each instance of $\\V$ with $\\mathbf{\\Sigma}^{-1/2}\\V$ in the PCA equations.\n", "\n", "Note that if you choose to use the Singular Value Decomposition (SVD), note that the each singluar value is the square root of a corresponding eigenvalue. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "whitened_top_components = np.diag(1/s[0:num_components]).dot(w[0:num_components, :])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(whitened_top_components)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "whitened_transformed_faces = faces[0:5,:].dot(whitened_top_components.T).dot(whitened_top_components)\n", "compare_faces(faces, whitened_transformed_faces, 5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4 Advanced Numpy Usage - Optional\n", "====================\n", "\n", "In this section we will explore the concept of broadcasting in numpy arrays. This is a difficult topic, so we will motivate it by first demonstating its usefullness. \n", "\n", "4.1 Motivating Example\n", "-----------------------\n", "\n", "In this example we will construct a Gram matrix using a kernel function. (Attribution to L. McCalman who showed me this cool application of broadcasting).\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def kernel(x,xdash):\n", " \"\"\"\n", " Simple exponential kernel.\n", " \"\"\"\n", " sig = 2.0\n", " return np.exp(-(np.abs(np.sum(x-xdash)))/sig)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine that this kernel encodes some concept of distance. We wish to know the 'distance' in as described by this kernel between n points. Lets use the data from the robot toy example at the beginning. This could be done niavely as follows:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "n = 1000 # Number of samples\n", "dim = 2 #Dimensionality of each sample.\n", "#points = np.random.random((n,dim))\n", "points = history\n", "DistMatrix = np.zeros((n,n)) # Construct a matrix to hold the result.\n", "\n", "for i in range(0, n):\n", " for j in range(0,n):\n", " DistMatrix[i,j] = kernel(points[i,:],points[j,:])\n", " \n", "pl.imshow(DistMatrix)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we can get a great speed-up by using the concept of broadcasting." ] }, { "cell_type": "code", "collapsed": false, "input": [ "points.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%time\n", "bcDist = np.exp(-(np.abs((points[:,np.newaxis,:]-points[np.newaxis,:,:]).sum(axis=2))/2.0))\n", "bcDist.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "(DistMatrix-bcDist).sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the use of the `np.newaxis` command, which adds a new axis to the vector. The purpose of this will become clear in the next section." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(points.shape)\n", "print(points[:,np.newaxis,:].shape)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.2 How broadcasting works\n", "----------------------------\n", "\n", "Broadcasting allows the elementwise multiplication of matricies which upon first glance are an illegal matrix operation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "A = np.random.random((4,1)) # Create a 4x1 matrix\n", "B = np.random.random((1,3)) # Create a 1x3 matrix\n", "print(A*B) # elementwise multiplication 4x1 and 1x3, have different sizes so this shouldn't work?!?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadcasting is a short-hand for copying the vector along the unitary dimension. It is activated when an elementwise operation is specified that has incorrect shape, but one of the dimensions sizes is 1. Thus `A*B` in the previous cell is actually equivalent to:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "Abroadcast = np.repeat(A, 3, axis=1) # Repeat matrix A along the second axis 3 times.\n", "\n", "Bbroadcast = np.repeat(B, 4, axis=0) # Repeat matrix B along the first axis 4 times.\n", "\n", "diff = Abroadcast*Bbroadcast - A*B #demonstrate equivalence\n", "\n", "print('Matrix A is broadcast to: \\n {A} \\n'.format(A=Abroadcast))\n", "print('Matrix B is broadcast to: \\n {B} \\n'.format(B=Bbroadcast))\n", "\n", "print('If the following result is zero, then the two approaches are equivalent: \\n {}'.format(diff))\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.3 Explaining the motivating example\n", "--------------------------------------\n", "\n", "The example in 4.1 demonstated an advanced usage of broadcasting, but the only difference is that matricies are copied accross dimensions instead of vectors in 4.2. Lets break it down further to show each step slowly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print('inital shape is {}'.format(points.shape))\n", "A = points[:, np.newaxis, :]\n", "B = points[ np.newaxis,:,:]\n", "print('the first matrix shape is {}'.format(A.shape))\n", "print('the second matrix shape is {}'.format(B.shape))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "C = A-B\n", "print('The result of A-B has shape {}'.format(C.shape))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that we have an extra axis. This contains the difference over every dimension. In order to calculate our kernel we need to sum over all these, so we can use the sum function. However, we need to specify the axis." ] }, { "cell_type": "code", "collapsed": false, "input": [ "D = C.sum(axis=2) # Sum over the second axis.\n", "print('The shape after the sum operation is {}'.format(D.shape))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rest is straightforward, as we need to take the absolute value of the sum, divde by sig and then the exponent, to finally arrive at the correct result." ] }, { "cell_type": "code", "collapsed": false, "input": [ "E = np.exp(-np.abs(D)/2)\n", "print((DistMatrix-E).sum())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.4 (a)\n", "\n", "In the cell below is a naive implementation that constructs a polynomial feature matrix (this will be covered more in the next workshop). \n", "\n", "Hints: \n", " - the `**` operator is an element-wise operator similar to addition and subtraction.\n", " - Don't forget to use `np.newaxis` to change the shape of the vectors.\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.arange(5)\n", "exponent = np.arange(5)\n", "\n", "naive_result = np.zeros((5,5))\n", "for i in range(x.shape[0]):\n", " for j in range(exponent.shape[0]):\n", " naive_result[i,j] = x[i]**exponent[j]\n", "print(naive_result)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell below, use broadcasting to re-implement the algorithm above:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.arange(5)[:,np.newaxis] # Make x a 5x1 matrix\n", "exponent = np.arange(5)[np.newaxis,:] # Make exponent a 1x5 matrix\n", "\n", "broadcast_result = x**exponent\n", "print(np.sum(np.abs(naive_result - broadcast_result)))\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Sparse Matricies - Optional - Advanced\n", "======================\n", "\n", "Occasionally you may encounter a problem where you want to deal with a sparse matrix. Perhaps your problem is naturally sparse or you wish to make a sparse approximation. For simplicity's sake lets approximate the matrix in the toy robot example and then attempt to use it to solve a linear equation. This will allow us to compare dense vs sparse matrix techniques in python.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "denseMat = DistMatrix.copy() # Lets make a copy so that we don't over-write the DistMatrix variable\n", "\n", "denseMat[denseMat < 0.1] = 0 # Use boolean indexing to set matrix indecies to zero.\n", "\n", "pl.imshow(denseMat)\n", "pl.figure()\n", "pl.imshow(DistMatrix)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy.sparse as sparse\n", "import scipy.sparse.linalg as splinalg\n", "import numpy.linalg as linalg" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "sparseMat = sparse.csc_matrix(denseMat) #Construct a sparse matrix with column ordering.\n", "b = np.ones((sparseMat.shape[0]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "# If this fails, attempt to make the matrix invertible by applying a small non-zero value to every diagonal entry. \n", "xsp = splinalg.spsolve(sparseMat, b)\n", "print('done')\n", "# sparseMat)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "xdense = linalg.solve(denseMat,b)\n", " \n", "print('done')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above shows that you can get a speed increase by using sparse matricies. Note that it is strongly dependent upon the size and sparsity of the problem. Try again but this time set all values below 0.8 to zero, you should see a significant speed improvement." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Verify that the results are numerically equivalent.\n", "print(np.sum(np.abs(xsp-xdense)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }