{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Andreas Mueller, Kyle Kastner, Sebastian Raschka \n",
      "last updated: 2016-07-11 \n",
      "\n",
      "CPython 3.5.1\n",
      "IPython 4.1.2\n",
      "\n",
      "numpy 1.11.0\n",
      "scipy 0.17.1\n",
      "matplotlib 1.5.1\n",
      "scikit-learn 0.17.1\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised Learning Part 2 -- Regression Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In regression we are trying to predict a continuous output variable -- in contrast to the nominal variables we were predicting in the previous classification examples. \n",
    "\n",
    "Let's start with a simple toy example with one feature dimension (explanatory variable) and one target variable. We will create a dataset out of a sinus curve with some noise:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = np.linspace(-3, 3, 100)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rng = np.random.RandomState(42)\n",
    "y = np.sin(4 * x) + x + rng.uniform(size=len(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(x, y, 'o');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear Regression\n",
    "=================\n",
    "\n",
    "The first model that we will introduce is the so-called simple linear regression. Here, we want to fit a line to the data, which \n",
    "\n",
    "One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is `LinearRegression` (also known as [*Ordinary Least Squares (OLS)*](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression).\n",
    "The interface for LinearRegression is exactly the same as for the classifiers before, only that ``y`` now contains float values, instead of classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we remember, the scikit-learn API requires us to provide the target variable (`y`) as a 1-dimensional array; scikit-learn's API expects the samples (`X`) in form a 2-dimensional array -- even though it may only consist of 1 feature. Thus, let us convert the 1-dimensional `x` NumPy array into an `X` array with 2 axes:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Before: ', x.shape)\n",
    "X = x[:, np.newaxis]\n",
    "print('After: ', X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we start by splitting our dataset into a training (75%) and a test set (25%):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use the learning algorithm implemented in `LinearRegression` to **fit a regression model to the training data**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "regressor = LinearRegression()\n",
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After fitting to the training data, we paramerterized a linear regression model with the following values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Weight coefficients: ', regressor.coef_)\n",
    "print('y-axis intercept: ', regressor.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our regression model is a linear one, the relationship between the target variable (y) and the feature variable (x) is defined as \n",
    "\n",
    "$$y = weight \\times x + \\text{intercept}$$.\n",
    "\n",
    "Plugging in the min and max values into thos equation, we can plot the regression fit to our training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "min_pt = X.min() * regressor.coef_[0] + regressor.intercept_\n",
    "max_pt = X.max() * regressor.coef_[0] + regressor.intercept_\n",
    "\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt])\n",
    "plt.plot(X_train, y_train, 'o');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the estimators for classification in the previous notebook, we use the `predict` method to predict the target variable. And we expect these predicted values to fall onto the line that we plotted previously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred_train = regressor.predict(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(X_train, y_train, 'o', label=\"data\")\n",
    "plt.plot(X_train, y_pred_train, 'o', label=\"prediction\")\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\n",
    "plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see in the plot above, the line is able to capture the general slope of the data, but not many details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's try the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred_test = regressor.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(X_test, y_test, 'o', label=\"data\")\n",
    "plt.plot(X_test, y_pred_test, 'o', label=\"prediction\")\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the ``score`` method. For regression tasks, this is the R<sup>2</sup> score. Another popular way would be the Mean Squared Error (MSE). As its name implies, the MSE is simply the average squared difference over the predicted and actual target values\n",
    "\n",
    "$$MSE = \\frac{1}{n} \\sum^{n}_{i=1} (\\text{predicted}_i - \\text{true}_i)^2$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "regressor.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "KNeighborsRegression\n",
    "=======================\n",
    "As for classification, we can also use a neighbor based method for regression. We can simply take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "kneighbor_regression = KNeighborsRegressor(n_neighbors=1)\n",
    "kneighbor_regression.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let us look at the behavior on training and test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_pred_train = kneighbor_regression.predict(X_train)\n",
    "\n",
    "plt.plot(X_train, y_train, 'o', label=\"data\", markersize=10)\n",
    "plt.plot(X_train, y_pred_train, 's', label=\"prediction\", markersize=4)\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the training set, we do a perfect job: each point is its own nearest neighbor!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_pred_test = kneighbor_regression.predict(X_test)\n",
    "\n",
    "plt.plot(X_test, y_test, 'o', label=\"data\", markersize=8)\n",
    "plt.plot(X_test, y_pred_test, 's', label=\"prediction\", markersize=4)\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the test set, we also do a better job of capturing the variation, but our estimates look much messier than before.\n",
    "Let us look at the R<sup>2</sup> score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kneighbor_regression.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better than before! Here, the linear model was not a good fit for our problem; it was lacking in complexity and thus under-fit our data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise\n",
    "=========\n",
    "Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using ``sklearn.datasets.load_boston``. You can learn about the dataset by reading the ``DESCR`` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load solutions/06A_knn_vs_linreg.py"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}