{ "metadata": { "name": "", "signature": "sha256:672c7d02621c234ccdc6ec679a1ab230c08d263193cf036d815a822fba313a3f" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Breakout: Model Validation\n", "\n", "Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the *Labeled Faces in the Wild* dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "\n", "# use seaborn plotting defaults\n", "# If this causes an error, you can comment it out.\n", "import seaborn as sns\n", "sns.set()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import fetch_lfw_people\n", "faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)\n", "\n", "X, y = faces.data, faces.target" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Validation with Random Forests\n", "\n", "- Use a ``RandomForestClassifier`` with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.\n", "- Construct validation curves for the random forest classifier on this data, exploring the effect of ``max_depth`` on the result.\n", "- What is the best value for ``max_depth`` (approximately)? What is the best score for this estimator?\n", "- Construct a learning curve for the Random Forest Classifier using this value for ``max_depth``.\n", "- Given the validation and learning curves, how do you think you could improve this classifier \u2013 should you seek a better model/more features, or should you seek more data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Validation with Support Vector Machines\n", "\n", "The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.\n", "\n", "Here we'll repeat the above exercise, but use ``sklearn.svm.SVC`` instead.\n", "The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.\n", "\n", "- Use the ``SVC`` with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.\n", "\n", "You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier.\n", "This is because the data has a very high dimension, and ``SVC`` does not scale well with data dimension.\n", "In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data\n", "\n", "- Use the ``PCA`` estimator to project the data down to 100 dimensions\n", "- re-compute the ``SVC`` cross-validation on this result. Is the score similar?\n", "\n", "Now we'll carry-on with the learning/validation curves using this projected data.\n", "\n", "- Construct validation curves for ``SVC`` on this (projected) data, using a linear kernel (``kernel='linear'``) and exploring the effect of ``C`` on the result. Note that the effect of ``C`` only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.\n", "- What is the optimal value for ``C``? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?\n", "- Construct a learning curve for the Support Vector Machine using this value for ``max_depth``.\n", "- Given the validation and learning curves, how do you think you could improve this classifier \u2013 should you seek a better model/more features, or should you seek more data?\n", "- Overall, how does this compare to the Random Forest Classifier?" ] } ], "metadata": {} } ] }