{
 "metadata": {
  "name": "",
  "signature": "sha256:672c7d02621c234ccdc6ec679a1ab230c08d263193cf036d815a822fba313a3f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Breakout: Model Validation\n",
      "\n",
      "Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the *Labeled Faces in the Wild* dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline\n",
      "import numpy as np\n",
      "import matplotlib.pyplot as plt\n",
      "from scipy import stats\n",
      "\n",
      "# use seaborn plotting defaults\n",
      "# If this causes an error, you can comment it out.\n",
      "import seaborn as sns\n",
      "sns.set()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import fetch_lfw_people\n",
      "faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)\n",
      "\n",
      "X, y = faces.data, faces.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 1. Validation with Random Forests\n",
      "\n",
      "- Use a ``RandomForestClassifier`` with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.\n",
      "- Construct validation curves for the random forest classifier on this data, exploring the effect of ``max_depth`` on the result.\n",
      "- What is the best value for ``max_depth`` (approximately)? What is the best score for this estimator?\n",
      "- Construct a learning curve for the Random Forest Classifier using this value for ``max_depth``.\n",
      "- Given the validation and learning curves, how do you think you could improve this classifier \u2013 should you seek a better model/more features, or should you seek more data?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 2. Validation with Support Vector Machines\n",
      "\n",
      "The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.\n",
      "\n",
      "Here we'll repeat the above exercise, but use ``sklearn.svm.SVC`` instead.\n",
      "The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.\n",
      "\n",
      "- Use the ``SVC`` with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.\n",
      "\n",
      "You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier.\n",
      "This is because the data has a very high dimension, and ``SVC`` does not scale well with data dimension.\n",
      "In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data\n",
      "\n",
      "- Use the ``PCA`` estimator to project the data down to 100 dimensions\n",
      "- re-compute the ``SVC`` cross-validation on this result. Is the score similar?\n",
      "\n",
      "Now we'll carry-on with the learning/validation curves using this projected data.\n",
      "\n",
      "- Construct validation curves for ``SVC`` on this (projected) data, using a linear kernel (``kernel='linear'``) and exploring the effect of ``C`` on the result. Note that the effect of ``C`` only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.\n",
      "- What is the optimal value for ``C``? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?\n",
      "- Construct a learning curve for the Support Vector Machine using this value for ``max_depth``.\n",
      "- Given the validation and learning curves, how do you think you could improve this classifier \u2013 should you seek a better model/more features, or should you seek more data?\n",
      "- Overall, how does this compare to the Random Forest Classifier?"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}