{ "metadata": { "name": "", "signature": "sha256:877b87c4b57aa739dfdb6bb1e8a4eea764166f4a5cb38a435e6fae88d246d3c7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Ensembling\n", "\n", "*Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*\n", "\n", "Let's pretend that instead of building a single model to solve a classification problem, you created **five independent models**, and each model was correct 70% of the time. If you combined these models into an \"ensemble\" and used their majority vote as a prediction, how often would the ensemble be correct?\n", "\n", "Let's simulate it to find out!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "\n", "# set a seed for reproducibility\n", "np.random.seed(1234)\n", "\n", "# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations\n", "mod1 = np.random.rand(1000)\n", "mod2 = np.random.rand(1000)\n", "mod3 = np.random.rand(1000)\n", "mod4 = np.random.rand(1000)\n", "mod5 = np.random.rand(1000)\n", "\n", "# each model independently predicts 1 (the \"correct response\") if random number was at least 0.3\n", "preds1 = np.where(mod1 > 0.3, 1, 0)\n", "preds2 = np.where(mod2 > 0.3, 1, 0)\n", "preds3 = np.where(mod3 > 0.3, 1, 0)\n", "preds4 = np.where(mod4 > 0.3, 1, 0)\n", "preds5 = np.where(mod5 > 0.3, 1, 0)\n", "\n", "# print the first 20 predictions from each model\n", "print preds1[:20]\n", "print preds2[:20]\n", "print preds3[:20]\n", "print preds4[:20]\n", "print preds5[:20]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1]\n", "[1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0]\n", "[1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1]\n", "[1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0]\n", "[0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1]\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# add the predictions together\n", "sum_of_preds = preds1 + preds2 + preds3 + preds4 + preds5\n", "\n", "# ensemble predicts 1 (the \"correct response\") if at least 3 models predict 1\n", "ensemble_preds = np.where(sum_of_preds >=3 , 1, 0)\n", "\n", "# print the ensemble's first 20 predictions\n", "print ensemble_preds[:20]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1]\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# how accurate was the ensemble?\n", "ensemble_preds.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "0.84099999999999997" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazing, right?\n", "\n", "**Ensemble learning (or \"ensembling\")** is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For **classification** problems, the combination is often done by majority vote. For **regression** problems, the combination is often done by taking an average of the predictions.\n", "\n", "For ensembling to work well, the individual models must meet two conditions:\n", "\n", "- Models should be **accurate** (they must outperform random guessing)\n", "- Models should be **independent** (their predictions are not correlated with one another)\n", "\n", "The idea, then, is that if you have a collection of individually imperfect (and independent) models, the \"one-off\" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.\n", "\n", "It turns out that as you add more models to the voting process, the probability of error decreases. This is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem), which was developed by a French political scientist in the 18th century.\n", "\n", "Anyway, we'll see examples of ensembling below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bootstrapping\n", "\n", "**Some preliminary terminology:** In statistics, \"bootstrapping\" refers to the process of using \"bootstrap samples\" to quantify the uncertainty of a model. Bootstrap samples are simply random samples with replacement:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# set a seed for reproducibility\n", "np.random.seed(1)\n", "\n", "# create an array of 0 to 9, then sample 10 times with replacement\n", "np.random.choice(a=10, size=10, replace=True)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "array([5, 8, 9, 5, 0, 0, 1, 7, 6, 9])" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bagging\n", "\n", "On their own, decision trees are not competitive with the best supervised learning methods in terms of **predictive accuracy**. However, they can be used as the basis for more sophisticated methods that have much higher accuracy!\n", "\n", "One of the main issues with decision trees is **high variance**, meaning that different splits in the training data can lead to very different trees. **\"Bootstrap aggregation\" (aka \"bagging\")** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees.\n", "\n", "What is the bagging process (in general)?\n", "\n", "- Take repeated bootstrap samples (random samples with replacement) from the training data set\n", "- Train our method on each bootstrapped training set and make predictions\n", "- Average the predictions\n", "\n", "This increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with the test set approach (for estimating out-of-sample error) by splitting many times an averaging the results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying bagging to decision trees\n", "\n", "So how exactly can bagging be used with decision trees? Here's how it applies to **regression trees**:\n", "\n", "- Grow B regression trees using B bootstrapped training sets\n", "- Grow each tree deep so that each one has low bias\n", "- Every tree makes a numeric prediction, and the predictions are averaged (to reduce the variance)\n", "\n", "It is applied in a similar fashion to **classification trees**, except that during the prediction stage, the overall prediction is based upon a majority vote of the trees.\n", "\n", "**What value should be used for B?** Simply use a large enough value that the error seems to have stabilized. (Choosing a value of B that is \"too large\" will generally not lead to overfitting.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manually implementing bagged decision trees (with B=3)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "# read in vehicle data\n", "vehicles = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles.csv')\n", "\n", "# convert car to 0 and truck to 1\n", "vehicles['type'] = vehicles.type.map({'car':0, 'truck':1})\n", "\n", "# print out data\n", "vehicles" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceyearmilesdoorstype
0 22000 2012 13000 2 0
1 14000 2010 30000 2 0
2 13000 2010 73500 4 0
3 9500 2009 78000 4 0
4 9000 2007 47000 4 0
5 4000 2006 124000 2 0
6 3000 2004 177000 4 0
7 2000 2004 209000 4 1
8 3000 2003 138000 2 0
9 1900 2003 160000 4 0
10 2500 2003 190000 2 1
11 5000 2001 62000 4 0
12 1800 1999 163000 2 1
13 1300 1997 138000 4 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ " price year miles doors type\n", "0 22000 2012 13000 2 0\n", "1 14000 2010 30000 2 0\n", "2 13000 2010 73500 4 0\n", "3 9500 2009 78000 4 0\n", "4 9000 2007 47000 4 0\n", "5 4000 2006 124000 2 0\n", "6 3000 2004 177000 4 0\n", "7 2000 2004 209000 4 1\n", "8 3000 2003 138000 2 0\n", "9 1900 2003 160000 4 0\n", "10 2500 2003 190000 2 1\n", "11 5000 2001 62000 4 0\n", "12 1800 1999 163000 2 1\n", "13 1300 1997 138000 4 0" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate the number of rows in vehicles\n", "n_rows = vehicles.shape[0]\n", "\n", "# set a seed for reproducibility\n", "np.random.seed(123)\n", "\n", "# create three bootstrap samples (will be used to select rows from the DataFrame)\n", "sample1 = np.random.choice(a=n_rows, size=n_rows, replace=True)\n", "sample2 = np.random.choice(a=n_rows, size=n_rows, replace=True)\n", "sample3 = np.random.choice(a=n_rows, size=n_rows, replace=True)\n", "\n", "# print samples\n", "print sample1\n", "print sample2\n", "print sample3" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[13 2 12 2 6 1 3 10 11 9 6 1 0 1]\n", "[ 9 0 0 9 3 13 4 0 0 4 1 7 3 2]\n", "[ 4 7 2 4 8 13 0 7 9 3 12 12 4 6]\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# use sample1 to select rows from DataFrame\n", "print vehicles.iloc[sample1, :]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " price year miles doors type\n", "13 1300 1997 138000 4 0\n", "2 13000 2010 73500 4 0\n", "12 1800 1999 163000 2 1\n", "2 13000 2010 73500 4 0\n", "6 3000 2004 177000 4 0\n", "1 14000 2010 30000 2 0\n", "3 9500 2009 78000 4 0\n", "10 2500 2003 190000 2 1\n", "11 5000 2001 62000 4 0\n", "9 1900 2003 160000 4 0\n", "6 3000 2004 177000 4 0\n", "1 14000 2010 30000 2 0\n", "0 22000 2012 13000 2 0\n", "1 14000 2010 30000 2 0\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "# grow one regression tree with each bootstrapped training set\n", "treereg1 = DecisionTreeRegressor(random_state=123)\n", "treereg1.fit(vehicles.iloc[sample1, 1:], vehicles.iloc[sample1, 0])\n", "\n", "treereg2 = DecisionTreeRegressor(random_state=123)\n", "treereg2.fit(vehicles.iloc[sample2, 1:], vehicles.iloc[sample2, 0])\n", "\n", "treereg3 = DecisionTreeRegressor(random_state=123)\n", "treereg3.fit(vehicles.iloc[sample3, 1:], vehicles.iloc[sample3, 0])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "DecisionTreeRegressor(compute_importances=None, criterion='mse',\n", " max_depth=None, max_features=None, max_leaf_nodes=None,\n", " min_density=None, min_samples_leaf=1, min_samples_split=2,\n", " random_state=123, splitter='best')" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# read in out-of-sample data\n", "oos = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles_oos.csv')\n", "\n", "# convert car to 0 and truck to 1\n", "oos['type'] = oos.type.map({'car':0, 'truck':1})\n", "\n", "# print data\n", "oos" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceyearmilesdoorstype
0 3000 2003 130000 4 1
1 6000 2005 82500 4 0
2 12000 2010 60000 2 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " price year miles doors type\n", "0 3000 2003 130000 4 1\n", "1 6000 2005 82500 4 0\n", "2 12000 2010 60000 2 0" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# select feature columns (every column except for the 0th column)\n", "feature_cols = vehicles.columns[1:]\n", "\n", "# make predictions on out-of-sample data\n", "preds1 = treereg1.predict(oos[feature_cols])\n", "preds2 = treereg2.predict(oos[feature_cols])\n", "preds3 = treereg3.predict(oos[feature_cols])\n", "\n", "# print predictions\n", "print preds1\n", "print preds2\n", "print preds3" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 1300. 5000. 14000.]\n", "[ 1300. 1300. 13000.]\n", "[ 2000. 9000. 13000.]\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# average predictions and compare to actual values\n", "print (preds1 + preds2 + preds3)/3\n", "print oos.price.values" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 1533.33333333 5100. 13333.33333333]\n", "[ 3000 6000 12000]\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating out-of-sample error\n", "\n", "Bagged models have a very nice property: **out-of-sample error can be estimated without using the test set approach or cross-validation!**\n", "\n", "Here's how the out-of-sample estimation process works with bagged trees:\n", "\n", "- On average, each bagged tree uses about two-thirds of the observations. **For each tree, the remaining observations are called \"out-of-bag\" observations.**\n", "- For the first observation in the training data, predict its response using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a majority vote (for classification).\n", "- Repeat this process for every observation in the training data.\n", "- Compare all predictions to the actual responses in order to compute a mean squared error or classification error. This is known as the **out-of-bag error**.\n", "\n", "**When B is sufficiently large, the out-of-bag error is an accurate estimate of out-of-sample error.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# set is a data structure used to identify unique elements\n", "print set(range(14))\n", "\n", "# only show the unique elements in sample1\n", "print set(sample1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])\n", "set([0, 1, 2, 3, 6, 9, 10, 11, 12, 13])\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "# use the \"set difference\" to identify the out-of-bag observations for each tree\n", "print sorted(set(range(14)) - set(sample1))\n", "print sorted(set(range(14)) - set(sample2))\n", "print sorted(set(range(14)) - set(sample3))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[4, 5, 7, 8]\n", "[5, 6, 8, 10, 11, 12]\n", "[1, 5, 10, 11]\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, we would predict the response for **observation 4** by using tree 1 (because it is only out-of-bag for tree 1). We would predict the response for **observation 5** by averaging the predictions from trees 1, 2, and 3 (since it is out-of-bag for all three trees). We would repeat this process for all observations, and then calculate the MSE using those predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating variable importance\n", "\n", "Although bagging **increases predictive accuracy**, it **decreases model interpretability** because it's no longer possible to visualize the tree to understand the importance of each variable.\n", "\n", "However, we can still obtain an overall summary of \"variable importance\" from bagged models:\n", "\n", "- To compute variable importance for bagged regression trees, we can calculate the **total amount that the mean squared error is decreased due to splits over a given predictor, averaged over all trees**.\n", "- A similar process is used for bagged classification trees, except we use the Gini index instead of the mean squared error.\n", "\n", "(We'll see an example of this below.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests\n", "\n", "Random Forests is a **slight variation of bagged trees** that has even better performance! Here's how it works:\n", "\n", "- Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.\n", "- However, when building each tree, **each time a split is considered**, a random sample of m predictors is chosen as split candidates from the full set of p predictors. **The split is only allowed to use one of those m predictors.**\n", "\n", "Notes:\n", "\n", "- A new random sample of predictors is chosen for **every single tree at every single split**.\n", "- For **classification**, m is typically chosen to be the square root of p. For **regression**, m is typically chosen to be somewhere between p/3 and p.\n", "\n", "What's the point?\n", "\n", "- Suppose there is one very strong predictor in the data set. When using bagged trees, most of the trees will use that predictor as the top split, resulting in an ensemble of similar trees that are \"highly correlated\".\n", "- Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).\n", "- **By randomly leaving out candidate predictors from each split, Random Forests \"decorrelates\" the trees**, such that the averaging process can reduce the variance of the resulting model." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# read in the Titanic data\n", "titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/titanic.csv')\n", "\n", "# encode sex feature\n", "titanic['sex'] = titanic.sex.map({'female':0, 'male':1})\n", "\n", "# fill in missing values for age\n", "titanic.age.fillna(titanic.age.mean(), inplace=True)\n", "\n", "# create three dummy variables, drop the first dummy variable, and store this as a DataFrame\n", "embarked_dummies = pd.get_dummies(titanic.embarked, prefix='embarked').iloc[:, 1:]\n", "\n", "# concatenate the two dummy variable columns onto the original DataFrame\n", "# note: axis=0 means rows, axis=1 means columns\n", "titanic = pd.concat([titanic, embarked_dummies], axis=1)\n", "\n", "# create a list of feature columns\n", "feature_cols = ['pclass', 'sex', 'age', 'embarked_Q', 'embarked_S']\n", "\n", "# print the updated DataFrame\n", "titanic.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
survivedpclassnamesexagesibspparchticketfarecabinembarkedembarked_Qembarked_S
0 0 3 Braund, Mr. Owen Harris 1 22.000000 1 0 A/5 21171 7.2500 NaN S 0 1
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.000000 1 0 PC 17599 71.2833 C85 C 0 0
2 1 3 Heikkinen, Miss. Laina 0 26.000000 0 0 STON/O2. 3101282 7.9250 NaN S 0 1
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.000000 1 0 113803 53.1000 C123 S 0 1
4 0 3 Allen, Mr. William Henry 1 35.000000 0 0 373450 8.0500 NaN S 0 1
5 0 3 Moran, Mr. James 1 29.699118 0 0 330877 8.4583 NaN Q 1 0
6 0 1 McCarthy, Mr. Timothy J 1 54.000000 0 0 17463 51.8625 E46 S 0 1
7 0 3 Palsson, Master. Gosta Leonard 1 2.000000 3 1 349909 21.0750 NaN S 0 1
8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 27.000000 0 2 347742 11.1333 NaN S 0 1
9 1 2 Nasser, Mrs. Nicholas (Adele Achem) 0 14.000000 1 0 237736 30.0708 NaN C 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " survived pclass name sex \\\n", "0 0 3 Braund, Mr. Owen Harris 1 \n", "1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 \n", "2 1 3 Heikkinen, Miss. Laina 0 \n", "3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 \n", "4 0 3 Allen, Mr. William Henry 1 \n", "5 0 3 Moran, Mr. James 1 \n", "6 0 1 McCarthy, Mr. Timothy J 1 \n", "7 0 3 Palsson, Master. Gosta Leonard 1 \n", "8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 \n", "9 1 2 Nasser, Mrs. Nicholas (Adele Achem) 0 \n", "\n", " age sibsp parch ticket fare cabin embarked \\\n", "0 22.000000 1 0 A/5 21171 7.2500 NaN S \n", "1 38.000000 1 0 PC 17599 71.2833 C85 C \n", "2 26.000000 0 0 STON/O2. 3101282 7.9250 NaN S \n", "3 35.000000 1 0 113803 53.1000 C123 S \n", "4 35.000000 0 0 373450 8.0500 NaN S \n", "5 29.699118 0 0 330877 8.4583 NaN Q \n", "6 54.000000 0 0 17463 51.8625 E46 S \n", "7 2.000000 3 1 349909 21.0750 NaN S \n", "8 27.000000 0 2 347742 11.1333 NaN S \n", "9 14.000000 1 0 237736 30.0708 NaN C \n", "\n", " embarked_Q embarked_S \n", "0 0 1 \n", "1 0 0 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", "5 1 0 \n", "6 0 1 \n", "7 0 1 \n", "8 0 1 \n", "9 0 0 " ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "# import class, instantiate estimator, fit with all data\n", "from sklearn.ensemble import RandomForestClassifier\n", "rfclf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)\n", "rfclf.fit(titanic[feature_cols], titanic.survived)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "RandomForestClassifier(bootstrap=True, compute_importances=None,\n", " criterion='gini', max_depth=None, max_features='auto',\n", " max_leaf_nodes=None, min_density=None, min_samples_leaf=1,\n", " min_samples_split=2, n_estimators=100, n_jobs=1,\n", " oob_score=True, random_state=1, verbose=0)" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the most important tuning parameters for Random Forests:\n", "\n", "- **n_estimators:** more estimators (trees) increases performance but decreases speed\n", "- **max_features:** cross-validate to choose an ideal value" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# compute the feature importances\n", "pd.DataFrame({'feature':feature_cols, 'importance':rfclf.feature_importances_})" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featureimportance
0 pclass 0.160553
1 sex 0.366700
2 age 0.434528
3 embarked_Q 0.012129
4 embarked_S 0.026089
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ " feature importance\n", "0 pclass 0.160553\n", "1 sex 0.366700\n", "2 age 0.434528\n", "3 embarked_Q 0.012129\n", "4 embarked_S 0.026089" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# compute the out-of-bag classification accuracy\n", "rfclf.oob_score_" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "0.80022446689113358" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping up ensembling\n", "\n", "Ensembling is incredibly popular, when the **primary goal is predictive accuracy**. For example, the team that eventually won the $1 million [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_Prize) used an [ensemble of 107 models](http://www2.research.att.com/~volinsky/papers/chance.pdf) early on in the competition.\n", "\n", "There was a recent paper in the Journal of Machine Learning Research titled \"[Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?](http://jmlr.csail.mit.edu/papers/volume15/delgado14a/delgado14a.pdf)\" (**Spoiler alert:** Random Forests did very well.) In the [comments about the paper](https://news.ycombinator.com/item?id=8719723) on Hacker News, Ben Hamner (Kaggle's chief scientist) said the following:\n", "\n", "> This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembled decision trees (random forests, gradient boosted machines, etc.) performs the best. This is typically in conjunction with clever data processing, feature selection, and internal validation.\n", "\n", "> One key exception is where the data is richly and hierarchically structured. Text, speech, and visual data falls under this category. In many cases here, variations of neural networks (deep neural nets/CNN's/RNN's/etc.) provide very dramatic improvements.\n", "\n", "But as you can imagine, ensembling may not often be practical in a real-time environment.\n", "\n", "**You can also build your own ensembles:** just build a variety of models and average them together! Here are some strategies for building independent models:\n", "\n", "- using different models\n", "- choosing different combinations of features\n", "- changing the tuning parameters\n", "\n", "Note that there is an entire class of well-known ensembling methods that we did not discuss, namely **boosting**. Instead of building independent models and averaging the predictions, the models are built sequentially on repeatedly modified versions of the data. More information is available in the scikit-learn documentation on Ensemble Methods, namely the sections on [AdaBoost](http://scikit-learn.org/stable/modules/ensemble.html#adaboost) and [Gradient Tree Boosting](http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "\n", "- scikit-learn documentation: [Ensemble Methods](http://scikit-learn.org/stable/modules/ensemble.html)\n", "- Quora: [How do random forests work in layman's terms?](http://www.quora.com/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)" ] } ], "metadata": {} } ] }