{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can read an overview of this Numerical Linear Algebra course in [this blog post](http://www.fast.ai/2017/07/17/num-lin-alg/).  The course was originally taught in the [University of San Francisco MS in Analytics](https://www.usfca.edu/arts-sciences/graduate-programs/analytics) graduate program.  Course lecture videos are [available on YouTube](https://www.youtube.com/playlist?list=PLtmWHNX-gukIc92m1K0P6bIOnZb-mg0hY) (note that the notebook numbers and video numbers do not line up, since some notebooks took longer than 1 video to cover).\n",
    "\n",
    "You can ask questions about the course on [our fast.ai forums](http://forums.fast.ai/c/lin-alg)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Health Outcomes with Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets, linear_model, metrics\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "import math, scipy, numpy as np\n",
    "from scipy import linalg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diabetes Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use a dataset from patients with diabates.  The data consists of 442 samples and 10 variables (all are physiological characteristics), so it is tall and skinny.  The dependent variable is a quantitative measure of disease progression one year after baseline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a classic dataset, famously used by Efron, Hastie, Johnstone, and Tibshirani in their [Least Angle Regression](https://arxiv.org/pdf/math/0406456.pdf) paper, and one of the [many datasets included with scikit-learn](http://scikit-learn.org/stable/datasets/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = datasets.load_diabetes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((353, 10), (89, 10))"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trn.shape, test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression in Scikit Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider a system $X\\beta = y$, where $X$ has more rows than columns.  This occurs when you have more data samples than variables.  We want to find $\\hat{\\beta}$ that minimizes: \n",
    "$$ \\big\\vert\\big\\vert X\\beta - y \\big\\vert\\big\\vert_2$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by using the sklearn implementation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "458 µs ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "regr = linear_model.LinearRegression()\n",
    "%timeit regr.fit(trn, y_trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred = regr.predict(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It will be helpful to have some metrics on how good our prediciton is.  We will look at the mean squared norm (L2) and mean absolute error (L1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def regr_metrics(act, pred):\n",
    "    return (math.sqrt(metrics.mean_squared_error(act, pred)), \n",
    "     metrics.mean_absolute_error(act, pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(75.36166834955054, 60.629082113104403)"
      ]
     },
     "execution_count": 129,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr_metrics(y_test, regr.predict(test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polynomial Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear regression finds the best coefficients $\\beta_i$ for:\n",
    "\n",
    "$$ x_0\\beta_0 + x_1\\beta_1 + x_2\\beta_2 = y $$\n",
    "\n",
    "Adding polynomial features is still a linear regression problem, just with more terms:\n",
    "\n",
    "$$ x_0\\beta_0 + x_1\\beta_1 + x_2\\beta_2 + x_0^2\\beta_3 + x_0 x_1\\beta_4 + x_0 x_2\\beta_5 + x_1^2\\beta_6 + x_1 x_2\\beta_7 + x_2^2\\beta_8 = y $$\n",
    "\n",
    "We need to use our original data $X$ to calculate the additional polynomial features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(353, 10)"
      ]
     },
     "execution_count": 172,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trn.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we want to try improving our model's performance by adding some more features.  Currently, our model is linear in each variable, but we can add polynomial features to change this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "poly = PolynomialFeatures(include_bias=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn_feat = poly.fit_transform(trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, age^2, age sex, age bmi, age bp, age s1, age s2, age s3, age s4, age s5, age s6, sex^2, sex bmi, sex bp, sex s1, sex s2, sex s3, sex s4, sex s5, sex s6, bmi^2, bmi bp, bmi s1, bmi s2, bmi s3, bmi s4, bmi s5, bmi s6, bp^2, bp s1, bp s2, bp s3, bp s4, bp s5, bp s6, s1^2, s1 s2, s1 s3, s1 s4, s1 s5, s1 s6, s2^2, s2 s3, s2 s4, s2 s5, s2 s6, s3^2, s3 s4, s3 s5, s3 s6, s4^2, s4 s5, s4 s6, s5^2, s5 s6, s6^2'"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "', '.join(poly.get_feature_names(feature_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(353, 65)"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trn_feat.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr.fit(trn_feat, y_trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(55.747345922929185, 42.836164292252235)"
      ]
     },
     "execution_count": 135,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr_metrics(y_test, regr.predict(poly.fit_transform(test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time is squared in #features and linear in #points, so this will get very slow!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit poly.fit_transform(trn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Speeding up feature generation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We would like to speed this up.  We will use [Numba](http://numba.pydata.org/numba-doc/0.12.2/tutorial_firststeps.html), a Python library that compiles code directly to C."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Numba is a compiler.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[This tutorial](https://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/) from Jake VanderPlas is a nice introduction.  Here Jake [implements a non-trivial algorithm](https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/) (non-uniform fast Fourier transform) with Numba.\n",
    "\n",
    "Cython is another alternative.  I've found Cython to require more knowledge to use than Numba (it's closer to C), but to provide similar speed-ups to Numba.\n",
    "\n",
    "<img src=\"images/cython_vs_numba.png\" alt=\"Cython vs Numba\" style=\"width: 80%\"/>\n",
    "\n",
    "Here is a [thorough answer](https://softwareengineering.stackexchange.com/questions/246094/understanding-the-differences-traditional-interpreter-jit-compiler-jit-interp) on the differences between an Ahead Of Time (AOT) compiler, a Just In Time (JIT) compiler, and an interpreter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiments with vectorization and native code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first get aquainted with Numba, and then we will return to our problem of polynomial features for regression on the diabates data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math, numpy as np, matplotlib.pyplot as plt\n",
    "from pandas_summary import DataFrameSummary\n",
    "from scipy import ndimage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import jit, vectorize, guvectorize, cuda, float32, void, float64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will show the impact of:\n",
    "- Avoiding memory allocations and copies (slower than CPU calculations)\n",
    "- Better locality\n",
    "- Vectorization\n",
    "\n",
    "If we use numpy on whole arrays at a time, it creates lots of temporaries, and can't use cache. If we use numba looping through an array item at a time, then we don't have to allocate large temporary arrays, and can reuse cached data since we're doing multiple calculations on each array item."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Untype and Unvectorized\n",
    "def proc_python(xx,yy):\n",
    "    zz = np.zeros(nobs, dtype='float32')\n",
    "    for j in range(nobs):   \n",
    "        x, y = xx[j], yy[j] \n",
    "        x = x*2 - ( y * 55 )\n",
    "        y = x + y*2         \n",
    "        z = x + y + 99      \n",
    "        z = z * ( z - .88 ) \n",
    "        zz[j] = z           \n",
    "    return zz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nobs = 10000\n",
    "x = np.random.randn(nobs).astype('float32')\n",
    "y = np.random.randn(nobs).astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "49.8 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit proc_python(x,y)   # Untyped and unvectorized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numpy lets us vectorize this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Typed and Vectorized\n",
    "def proc_numpy(x,y):\n",
    "    z = np.zeros(nobs, dtype='float32')\n",
    "    x = x*2 - ( y * 55 )\n",
    "    y = x + y*2         \n",
    "    z = x + y + 99      \n",
    "    z = z * ( z - .88 ) \n",
    "    return z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.allclose( proc_numpy(x,y), proc_python(x,y), atol=1e-4 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "35.9 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit proc_numpy(x,y)    # Typed and vectorized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Numba"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numba offers several different decorators.  We will try two different ones:\n",
    "\n",
    "- `@jit`: very general\n",
    "- `@vectorize`: don't need to write a for loop.  useful when operating on vectors of the same size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we will use Numba's jit  (just-in-time) compiler decorator, without explicitly vectorizing.  This avoids large memory allocations, so we have better locality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "collapsed": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "@jit()\n",
    "def proc_numba(xx,yy,zz):\n",
    "    for j in range(nobs):   \n",
    "        x, y = xx[j], yy[j] \n",
    "        x = x*2 - ( y * 55 )\n",
    "        y = x + y*2         \n",
    "        z = x + y + 99      \n",
    "        z = z * ( z - .88 ) \n",
    "        zz[j] = z           \n",
    "    return zz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "z = np.zeros(nobs).astype('float32')\n",
    "np.allclose( proc_numpy(x,y), proc_numba(x,y,z), atol=1e-4 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6.4 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit proc_numba(x,y,z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use Numba's `vectorize` decorator. Numba's compiler optimizes this in a smarter way than what is possible with plain Python and Numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@vectorize\n",
    "def vec_numba(x,y):\n",
    "    x = x*2 - ( y * 55 )\n",
    "    y = x + y*2         \n",
    "    z = x + y + 99      \n",
    "    return z * ( z - .88 ) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 153,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.allclose(vec_numba(x,y), proc_numba(x,y,z), atol=1e-4 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.82 µs ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit vec_numba(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numba is **amazing**.  Look how fast this is!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### Numba polynomial features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "@jit(nopython=True)\n",
    "def vec_poly(x, res):\n",
    "    m,n=x.shape\n",
    "    feat_idx=0\n",
    "    for i in range(n):\n",
    "        v1=x[:,i]\n",
    "        for k in range(m): res[k,feat_idx] = v1[k]\n",
    "        feat_idx+=1\n",
    "        for j in range(i,n):\n",
    "            for k in range(m): res[k,feat_idx] = v1[k]*x[k,j]\n",
    "            feat_idx+=1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "#### Row-Major vs Column-Major Storage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "From this [blog post by Eli Bendersky](http://eli.thegreenplace.net/2015/memory-layout-of-multi-dimensional-arrays/):\n",
    "\n",
    "\"The row-major layout of a matrix puts the first row in contiguous memory, then the second row right after it, then the third, and so on. Column-major layout puts the first column in contiguous memory, then the second, etc.... While knowing which layout a particular data set is using is critical for good performance, there's no single answer to the question which layout 'is better' in general.\n",
    "\n",
    "\"It turns out that matching the way your algorithm works with the data layout can make or break the performance of an application.\n",
    "\n",
    "\"The short takeaway is: **always traverse the data in the order it was laid out**.\"\n",
    "\n",
    "**Column-major layout**: Fortran, Matlab, R, and Julia\n",
    "\n",
    "**Row-major layout**: C, C++, Python, Pascal, Mathematica"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "trn = np.asfortranarray(trn)\n",
    "test = np.asfortranarray(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m,n=trn.shape\n",
    "n_feat = n*(n+1)//2 + n\n",
    "trn_feat = np.zeros((m,n_feat), order='F')\n",
    "test_feat = np.zeros((len(y_test), n_feat), order='F')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "vec_poly(trn, trn_feat)\n",
    "vec_poly(test, test_feat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 159,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr.fit(trn_feat, y_trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(55.74734592292935, 42.836164292252306)"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr_metrics(y_test, regr.predict(test_feat))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.33 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit vec_poly(trn, trn_feat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Recall, this was the time from the scikit learn implementation PolynomialFeatures, which was created by experts: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit poly.fit_transform(trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "78.57142857142857"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "605/7.7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "This is a big deal!  Numba is **amazing**!  With a single line of code, we are getting a 78x speed-up over scikit learn (which was optimized by experts)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Regularization and noise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Regularization is a way to reduce over-fitting and create models that better generalize to new data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "### Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Lasso regression uses an L1 penalty, which pushes towards sparse coefficients. The parameter $\\alpha$ is used to weight the penalty term. Scikit Learn's LassoCV performs cross validation with a number of different values for $\\alpha$.\n",
    "\n",
    "Watch this [Coursera video on Lasso regression](https://www.coursera.org/learn/machine-learning-data-analysis/lecture/0KIy7/what-is-lasso-regression) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "reg_regr = linear_model.LassoCV(n_alphas=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jhoward/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.\n",
      "  ConvergenceWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,\n",
       "    max_iter=1000, n_alphas=10, n_jobs=1, normalize=False, positive=False,\n",
       "    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,\n",
       "    verbose=False)"
      ]
     },
     "execution_count": 164,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reg_regr.fit(trn_feat, y_trn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0098199431661591518"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reg_regr.alpha_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(50.0982471642817, 40.065199085003101)"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr_metrics(y_test, reg_regr.predict(test_feat))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "### Noise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Now we will add some noise to the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "idxs = np.random.randint(0, len(trn), 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "y_trn2 = np.copy(y_trn)\n",
    "y_trn2[idxs] *= 10 # label noise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(51.1766253181518, 41.415992803872754)"
      ]
     },
     "execution_count": 169,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr = linear_model.LinearRegression()\n",
    "regr.fit(trn, y_trn)\n",
    "regr_metrics(y_test, regr.predict(test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(62.66110319520415, 53.21914420254862)"
      ]
     },
     "execution_count": 170,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr.fit(trn, y_trn2)\n",
    "regr_metrics(y_test, regr.predict(test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Huber loss is a loss function that is less sensitive to outliers than squared error loss.  It is quadratic for small error values, and linear for large values.\n",
    "\n",
    " $$L(x)= \n",
    "\\begin{cases}\n",
    "    \\frac{1}{2}x^2,         & \\text{for } \\lvert x\\rvert\\leq \\delta \\\\\n",
    "    \\delta(\\lvert x \\rvert - \\frac{1}{2}\\delta),  & \\text{otherwise}\n",
    "\\end{cases}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(51.24055602541746, 41.670840571376822)"
      ]
     },
     "execution_count": 171,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hregr = linear_model.HuberRegressor()\n",
    "hregr.fit(trn, y_trn2)\n",
    "regr_metrics(y_test, hregr.predict(test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# End"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}