{ "metadata": { "name": "07.2_in_depth_trees_and_forests" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Estimators In Depth: Trees and Forests" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll explore a class of algorithms based on Decision trees.\n", "Decision trees at their root (Ha!) are extremely intuitive. They\n", "encode a series of binary choices in a process that parallels how\n", "a person might classify things themselves, but using an information criterion\n", "to decide which question is most fruitful at each step. For example, if\n", "you wanted to create a guide to identifying an animal found in nature, you\n", "might ask the following series of questions:\n", "\n", "- Is the animal bigger or smaller than a meter long?\n", " + *bigger*: does the animal have horns?\n", " - *yes*: are the horns longer than ten centimeters?\n", " - *no*: is the animal wearing a collar\n", " + *smaller*: does the animal have two or four legs?\n", " - *two*: does the animal have wings?\n", " - *four*: does the animal have a bushy tail?\n", "\n", "and so on. This binary splitting of questions is the essence of a decision tree." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decision trees, and the related boosted trees and random forest estimators,\n", "are powerful non-parametric approaches that are broadly useful. Non-parametric\n", "approaches are useful when the underlying structure or model of the data is\n", "unknown.\n", "\n", "For example, imagine you wanted to derive a model for some periodic function\n", "having to do with tides. We'll mimic this data as a sum of two sine\n", "waves:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rng = np.random.RandomState(0)\n", "N = 100\n", "X = np.linspace(0, 6, N)[:, np.newaxis]\n", "error = 0.4\n", "y_true = np.sin(X).ravel() + np.sin(6 * X).ravel()\n", "y_noisy = y_true + rng.normal(0, error, X.shape[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.plot(X.ravel(), y_true, color='gray')\n", "plt.plot(X.ravel(), y_noisy, '.k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data looks relatively complicated, but if we had the intuition to know that it is\n", "simply the combination of a small number of sine waves, we could use this sparse\n", "representation in Fourier space and use a fast linear estimator.\n", "\n", "Taking the FFT of the data gives us two peaks in frequency, indicating that the representation\n", "is sparse in this basis:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from scipy import fftpack\n", "plt.plot(fftpack.fftfreq(len(y_noisy))[:N/2], abs(fftpack.fft(y_noisy))[:N/2])\n", "plt.xlim(0, None)\n", "plt.xlabel('frequency')\n", "plt.ylabel('Fourier power')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows how important **intuition**, especially physical intuition, can be when\n", "performing a learning problem on real-world data.\n", "If you blindly throw algorithms at a data set, you\n", "might be missing a simple intuition which might lead to a sparse representation that\n", "is much more meaningful.\n", "\n", "But suppose we don't have any intuition that would lead to representing the data\n", "in a sparse basis. In this case we can benefit from using a non-parametric estimator\n", "to fit our task. One well-known and powerful non-parametric estimator is the\n", "Decision Tree." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Decision Tree Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A decision tree is a simple binary classification tree that, at its root, is\n", "similar to nearest neighbor classification. It can be used as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "i = np.random.permutation(X.shape[0])\n", "X = X[i]\n", "y_noisy = y_noisy[i]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.tree import DecisionTreeRegressor\n", "clf = DecisionTreeRegressor(max_depth=5)\n", "clf.fit(X, y_noisy)\n", "\n", "X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))\n", "y_fit_1 = clf.predict(X_fit)\n", "\n", "plt.plot(X_fit.ravel(), y_fit_1, color='blue')\n", "plt.plot(X.ravel(), y_noisy, '.k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single decision tree allows us to estimate the signal in a non-parametric way,\n", "but clearly has some issues. In some regions, the model shows high bias and\n", "under-fits the data\n", "(seen in the long flat lines which don't follow the contours of the data),\n", "while in other regions the model shows high variance and over-fits the data\n", "(reflected in the narrow spikes which are influenced by noise in single points).\n", "\n", "One way to address this is to combine many trees together, so that the\n", "effects of their over-fitting go away on average. This approach is known as\n", "*Random Forests*" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Random Forests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will use a random forest of 200 trees to reduce the tendency of each\n", "tree to over-fitting the data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import RandomForestRegressor\n", "clf = RandomForestRegressor(n_estimators=200, max_depth=5)\n", "clf.fit(X, y_noisy)\n", "\n", "y_fit_200 = clf.predict(X_fit)\n", "\n", "plt.plot(X_fit.ravel(), y_fit_200, color='blue')\n", "plt.plot(X.ravel(), y_noisy, '.k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Selecting the Optimal Estimator via Cross-Validation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import grid_search\n", "\n", "rf = RandomForestRegressor()\n", "parameters = {'n_estimators':[200, 300, 400],\n", " 'max_depth':[5, 7, 9]}\n", "\n", "# Warning: be sure your data is shuffled before using GridSearch!\n", "clf_grid = grid_search.GridSearchCV(rf, parameters)\n", "clf_grid.fit(X, y_noisy)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rf_best = clf_grid.best_estimator_\n", "X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))\n", "y_fit_best = rf_best.predict(X_fit)\n", "\n", "print rf_best.n_estimators, rf_best.max_depth\n", "\n", "plt.plot(X_fit.ravel(), y_fit_best, color='blue')\n", "plt.plot(X.ravel(), y_noisy, '.k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Another option: Gradient Boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another Ensemble method that can be useful is *Boosting*: here, rather than\n", "looking at 200 (say) parallel estimators, We construct a chain of 200 estimators\n", "which iteratively refine the results of the previous estimator.\n", "The idea is that by sequentially applying very fast, simple models, we can get a\n", "total model error which is better than any of the individual pieces." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "clf = GradientBoostingRegressor(n_estimators=200, max_depth=2)\n", "clf.fit(X, y_noisy)\n", "\n", "y_fit_200 = clf.predict(X_fit)\n", "\n", "plt.plot(X_fit.ravel(), y_fit_200, color='blue')\n", "plt.plot(X.ravel(), y_noisy, '.k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise: Cross-validating Gradient Boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use a grid search to optimize the number of estimators and max_depth for a Gradient Boosted\n", "Decision tree." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plug this optimal ``max_depth`` into a *single* decision tree. Does this single tree over-fit or under-fit the data?" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat this for the Random Forest. Construct a single decision tree using the ``max_depth``\n", "which is optimal for the Random Forest. Does this single tree over-fit or under-fit the data?" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/07B_grid_search.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }