{
 "metadata": {
  "name": "06B_validation_exercise"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Exercise: Cross Validation and Model Selection"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Populating the interactive namespace from numpy and matplotlib\n"
       ]
      }
     ],
     "prompt_number": 0
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This exercise covers cross-validation of regression models on the Diabetes\n",
      "dataset.  The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure)\n",
      "measure on 442 patients, and an indication of disease progression after one year:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import load_diabetes\n",
      "data = load_diabetes()\n",
      "X, y = data.data, data.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print X.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(442, 10)\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print y.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(442,)\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we'll be fitting two regularized linear models,\n",
      "*Ridge Regression*, which uses $\\ell_2$ regularlization,\n",
      "and *Lasso Regression*, which uses $\\ell_1$ regularization."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import Ridge, Lasso"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll first use the default hyper-parameters to see the baseline estimator.  We'll\n",
      "use the cross-validation score to determine goodness-of-fit."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import cross_val_score\n",
      "\n",
      "for Model in [Ridge, Lasso]:\n",
      "    model = Model()\n",
      "    print Model.__name__, cross_val_score(model, X, y).mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Ridge 0.409427438303\n",
        "Lasso 0.353800083299\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that for the default hyper-parameter values, Lasso outperforms Ridge.\n",
      "But is this the case for the *optimal* hyperparameters of each model?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Exercise: Basic Hyperparameter Optimization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here spend some time writing a function which computes the cross-validation\n",
      "score as a function of ``alpha``, the strength of the regularization for\n",
      "``Lasso`` and ``Ridge``.  We'll choose 20 values of ``alpha`` between\n",
      "0.0001 and 1:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "alphas = np.logspace(-3, -1, 30)\n",
      "\n",
      "# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
      "# as a function of alpha.  Which is more difficult to tune?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Solution"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load solutions/06B_basic_grid_search.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Automatically Performing Grid Search"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because searching a grid of hyperparameters is such a common task, scikit-learn provides\n",
      "several hyper-parameter estimators to automate this.  We'll explore this more in depth\n",
      "later in the tutorial, but for now it is interesting to see how ``GridSearchCV`` works:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.grid_search import GridSearchCV"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "``GridSearchCV`` is constructed with an estimator, as well as a dictionary\n",
      "of parameter values to be searched.  We can find the optimal parameters this\n",
      "way:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for Model in [Ridge, Lasso]:\n",
      "    gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)\n",
      "    print Model.__name__, gscv.best_params_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Ridge {'alpha': 0.062101694189156162}\n",
        "Lasso"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " {'alpha': 0.01268961003167922}\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Built-in Hyperparameter Search"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For some models within scikit-learn, cross-validation can be performed more efficiently\n",
      "on large datasets.  In this case, a cross-validated version of the particular model is\n",
      "included.  The cross-validated versions of ``Ridge`` and ``Lasso`` are ``RidgeCV`` and\n",
      "``LassoCV``, respectively.  The grid search on these estimators can be performed as\n",
      "follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import RidgeCV, LassoCV\n",
      "for Model in [RidgeCV, LassoCV]:\n",
      "    model = Model(alphas=alphas, cv=3).fit(X, y)\n",
      "    print Model.__name__, model.alpha_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "RidgeCV 0.0621016941892\n",
        "LassoCV"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " 0.00788046281567\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the results match those returned by ``GridSearchCV``."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Exercise: Learning Curves"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we'll apply our learning curves to the diabetes data.  The question to answer is this:\n",
      "\n",
      "- Given the optimal models above, which is over-fitting and which is under-fitting the data?\n",
      "- To obtain better results, would you invest time and effort in gathering\n",
      "  more **training samples**, or gathering more **attributes** for each sample?\n",
      "  Recall the previous discussion of reading learning curves.\n",
      "\n",
      "You can follow the process used in the previous notebook to plot the learning curves.\n",
      "A good metric to use is the ``mean_squared_error``, which we'll import below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.metrics import mean_squared_error\n",
      "# define a function that computes the learning curve (i.e. mean_squared_error as a function\n",
      "# of training set size, for both training and test sets) and plot the result\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Solution"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load solutions/06B_learning_curves.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    }
   ],
   "metadata": {}
  }
 ]
}