{ "metadata": { "name": "", "signature": "sha256:dd0581545ef7f9a65392530dfc5a09cb17b8c4f96b26283a9243833854fc9af8" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project: Linear Regression\n", "\n", "*Based on the notebook by [Michael Granitzer](http://nbviewer.ipython.org/github/mgrani/LODA-lecture-notes-on-data-analysis/blob/master/II.ML-and-DM/II.ML-and-DM-Example-Regression.ipynb#)*\n", "\n", "Goal: Implement and evaluate a linear model to predict car mileage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Get the data!\n", "\n", "Location:\n", " * http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data\n", " * http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names\n", " \n", "Tools:\n", " * `!curl -o \"\"`\n", " * `urllib2.urlretrieve(, )`\n", " * `tempfileName = urllib2.urlretrieve()`\n", " \n", "Specifications:\n", " * Use `data` as the filename for the data.\n", " * Use `description` as the filename for the names." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to get the data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/get-data-curl.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/get-data-urllib.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Load the data\n", "\n", "Tools:\n", " * `auto = numpy.loadtxt(\"\")`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to load the data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/load-data.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Inspection" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#look at the first few examples.\n", "!cat data | grep \"?\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Clean the data\n", "\n", "We don't want the car name and the data has missing values, marked by \"?\".\n", "\n", "Tools:\n", " * `auto = np.genfromtxt(\"\", usecols=(0,1,2,3,4,5,6,7), missing_values='?')`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to load the data without strings" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/load-data2.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "auto = np.genfromtxt(\"data\", usecols=(0,1,2,3,4,5,6,7),\n", " missing_values='?')\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now handle the missing values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#missing values are displayed as nan. Let's take a look which columns have them\n", "np.any(np.isnan(auto), axis=0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Now let's take a look which rows have them\n", "np.any(np.isnan(auto), axis=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now print the rows which contain nans, remove them from the `auto` matrix and store the result in the same `auto` matrix." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to remove nans\n", "nan_rows = ? # fixme\n", "auto = auto[:,:] # fixme" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/remove-nans.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Fit the data\n", "\n", " * Let's assume that we are going to predict the mileage of a car from various *features* in the data. We use a *linear model*.\n", " * Denote the data by $X_{n \\times p}$, containing $n$ rows and $p$ columns.\n", " * A single *training example* is a single row of $X$, with each element being the value of a feature. \n", " * In a linear model, the target is a linear combination of the feature values.\n", " * We can write the *prediction* of a single row of $X$ as follows:\n", "\n", "$$\n", "\\hat{y}(X)_i = w_0 + \\sum_{j=1}^p w_jx_{ij}, \\quad x_{ij} \\in X[i,:]\n", "$$\n", "\n", " * If we augment $X$ with a leftmost column of 1's, we can write all predictions in vector form:\n", "\n", "$$\n", "\\hat{y}(X) = w^TX\n", "$$\n", "\n", " * Our predictions from the training examples may differ from the actualy values of the target. We define the error as:\n", " \n", "$$\n", "\\textrm{SSE} = \\sum_{i=1}^n (y_i - \\hat{y}(X)_i )^2\n", "$$\n", "\n", " * We want to find the value of $w$ that minimises this error. This is given by:\n", " \n", "$$\n", "w = (X^TX)^{-1}X^Ty\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy.linalg as la" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to fit and predict\n", "def linreg(X,y):\n", " # 1. Add a column of 1's to X; now it has a total of p columns\n", " \n", " # 2. Calculate the weight vector w (should have p columns too)\n", " \n", " # 3. Calculate the predicted values of the target, y_pred\n", " \n", " # 4. Calculate the error, sse\n", " \n", " return w, y_pred, sse\n", "\n", "# 1. Split the data into the training examples X (cylinders and displacement)\n", "# and target column y (miles per gallon)\n", "X = auto[:,:] # fixme\n", "y = auto[:,:] # fixme\n", "print linreg(X,y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/linreg.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A better error measure: $RMS = \\sqrt{SSE / n}$. Can be roughly interpreted as the prediction error in miles-per-gallon." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from math import sqrt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# code to print the RMS error" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/rms.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Visualise the data\n", "\n", "Let's make predictions using single features and plot the fit." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# predict and plot using just cylinders as a feature, print the sse\n", "X = auto[:,:] # fixme\n", "y = auto[:,:] # fixme\n", "w, y_pred, sse = linreg(X, y)\n", "\n", "plt.plot(X, y, 'o')\n", "plt.plot(X, y_pred, 'r')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/predict-cylinders.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# predict and plot using just weight as a feature, print the sse" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load solutions/predict-and-plot.py" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }