{
 "metadata": {
  "name": "",
  "signature": "sha256:dd0581545ef7f9a65392530dfc5a09cb17b8c4f96b26283a9243833854fc9af8"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Project: Linear Regression\n",
      "\n",
      "*Based on the notebook by [Michael Granitzer](http://nbviewer.ipython.org/github/mgrani/LODA-lecture-notes-on-data-analysis/blob/master/II.ML-and-DM/II.ML-and-DM-Example-Regression.ipynb#)*\n",
      "\n",
      "Goal: Implement and evaluate a linear model to predict car mileage."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 1. Get the data!\n",
      "\n",
      "Location:\n",
      "   * http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data\n",
      "   * http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names\n",
      "   \n",
      "Tools:\n",
      "   * `!curl -o <filename> \"<url>\"`\n",
      "   * `urllib2.urlretrieve(<url>, <filename>)`\n",
      "   * `tempfileName = urllib2.urlretrieve(<url>)`\n",
      "   \n",
      "Specifications:\n",
      "   * Use `data` as the filename for the data.\n",
      "   * Use `description` as the filename for the names."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to get the data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/get-data-curl.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/get-data-urllib.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 2. Load the data\n",
      "\n",
      "Tools:\n",
      "   * `auto = numpy.loadtxt(\"<filename>\")`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to load the data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/load-data.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2.1. Inspection"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#look at the first few examples.\n",
      "!cat data | grep \"?\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2.2. Clean the data\n",
      "\n",
      "We don't want the car name and the data has missing values, marked by \"?\".\n",
      "\n",
      "Tools:\n",
      "   * `auto = np.genfromtxt(\"<filename>\", usecols=(0,1,2,3,4,5,6,7), missing_values='?')`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to load the data without strings"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load solutions/load-data2.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "auto = np.genfromtxt(\"data\", usecols=(0,1,2,3,4,5,6,7),\n",
      "                     missing_values='?')\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's now handle the missing values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#missing values are displayed as nan. Let's take a look which columns have them\n",
      "np.any(np.isnan(auto), axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Now let's take a look which rows have them\n",
      "np.any(np.isnan(auto), axis=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now print the rows which contain nans, remove them from the `auto` matrix and store the result in the same `auto` matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to remove nans\n",
      "nan_rows = ? # fixme\n",
      "auto = auto[:,:] # fixme"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/remove-nans.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 3. Fit the data\n",
      "\n",
      "   * Let's assume that we are going to predict the mileage of a car from various *features* in the data. We use a *linear model*.\n",
      "   * Denote the data by $X_{n \\times p}$, containing $n$ rows and $p$ columns.\n",
      "   * A single *training example* is a single row of $X$, with each element being the value of a feature. \n",
      "   * In a linear model, the target is a linear combination of the feature values.\n",
      "   * We can write the *prediction* of a single row of $X$ as follows:\n",
      "\n",
      "$$\n",
      "\\hat{y}(X)_i = w_0 + \\sum_{j=1}^p w_jx_{ij}, \\quad x_{ij} \\in X[i,:]\n",
      "$$\n",
      "\n",
      "   * If we augment $X$ with a leftmost column of 1's, we can write all predictions in vector form:\n",
      "\n",
      "$$\n",
      "\\hat{y}(X) = w^TX\n",
      "$$\n",
      "\n",
      "   * Our predictions from the training examples may differ from the actualy values of the target. We define the error as:\n",
      "   \n",
      "$$\n",
      "\\textrm{SSE} = \\sum_{i=1}^n (y_i - \\hat{y}(X)_i )^2\n",
      "$$\n",
      "\n",
      "   * We want to find the value of $w$ that minimises this error. This is given by:\n",
      "   \n",
      "$$\n",
      "w = (X^TX)^{-1}X^Ty\n",
      "$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy.linalg as la"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to fit and predict\n",
      "def linreg(X,y):\n",
      "    # 1. Add a column of 1's to X; now it has a total of p columns\n",
      "    \n",
      "    # 2. Calculate the weight vector w (should have p columns too)\n",
      "    \n",
      "    # 3. Calculate the predicted values of the target, y_pred\n",
      "    \n",
      "    # 4. Calculate the error, sse\n",
      "    \n",
      "    return w, y_pred, sse\n",
      "\n",
      "# 1. Split the data into the training examples X (cylinders and displacement)\n",
      "#    and target column y (miles per gallon)\n",
      "X = auto[:,:] # fixme\n",
      "y = auto[:,:] # fixme\n",
      "print linreg(X,y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/linreg.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A better error measure: $RMS = \\sqrt{SSE / n}$. Can be roughly interpreted as the prediction error in miles-per-gallon."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from math import sqrt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code to print the RMS error"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/rms.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 4. Visualise the data\n",
      "\n",
      "Let's make predictions using single features and plot the fit."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# predict and plot using just cylinders as a feature, print the sse\n",
      "X = auto[:,:] # fixme\n",
      "y = auto[:,:] # fixme\n",
      "w, y_pred, sse = linreg(X, y)\n",
      "\n",
      "plt.plot(X, y, 'o')\n",
      "plt.plot(X, y_pred, 'r')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/predict-cylinders.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# predict and plot using just weight as a feature, print the sse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#%load solutions/predict-and-plot.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}