{
 "metadata": {
  "name": "",
  "signature": "sha256:d5b62b84ad3a1a1ddd8ff00ba9ff55ce7124cb5bc8f82e16b5a5c18f03ed2506"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Introduction to Decision Trees\n",
      "\n",
      "*Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*\n",
      "\n",
      "||continuous|categorical|\n",
      "|---|---|---|\n",
      "|**supervised**|**regression**|**classification**|\n",
      "|**unsupervised**|dimension reduction|clustering|"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Regression trees\n",
      "\n",
      "Let's look at a simple example to motivate our learning.\n",
      "\n",
      "Our goal is to **predict a baseball player's Salary** based on **Years** (number of years playing in the major leagues) and **Hits** (number of hits he made in the previous year). Here is the training data, represented visually (low salary is blue/green, high salary is red/yellow):"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_salary_color.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**How might you \"stratify\" or \"segment\" the feature space into regions, based on salary?** Intuitively, you want to **maximize** the similarity (or \"homogeneity\") within a given region, and **minimize** the similarity between different regions.\n",
      "\n",
      "Below is a regression tree that has been fit to the data by a computer. (We will talk later about how the fitting algorithm actually works.) Note that  Salary is measured in thousands and has been log-transformed."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_salary_tree.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**How do we make Salary predictions (for out-of-sample data) using a decision tree?**\n",
      "\n",
      "- Start at the top, and examine the first \"splitting rule\" (Years < 4.5).\n",
      "- If the rule is True for a given player, follow the left branch. If the rule is False, follow the right branch.\n",
      "- Continue until reaching the bottom. The predicted Salary is the number in that particular \"bucket\".\n",
      "- *Side note:* Years and Hits are both integers, but the convention is to label these rules using the midpoint between adjacent values.\n",
      "\n",
      "Examples predictions:\n",
      "\n",
      "- Years=3, then predict 5.11 ($\\$1000 \\times e^{5.11} \\approx \\$166000$)\n",
      "- Years=5 and Hits=100, then predict 6.00 ($\\$1000 \\times e^{6.00} \\approx \\$403000$)\n",
      "- Years=8 and Hits=120, then predict 6.74 ($\\$1000 \\times e^{6.74} \\approx \\$846000$)\n",
      "\n",
      "**How did we come up with the numbers at the bottom of the tree?** Each number is just the **mean Salary in the training data** of players who fit that criteria. Here's the same diagram as before, split into the three regions:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_salary_regions.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This diagram is essentially a combination of the two previous diagrams (except that the observations are no longer color-coded). In $R_1$, the mean log Salary was 5.11. In $R_2$, the mean log Salary was 6.00. In $R_3$, the mean log Salary was 6.74. Thus, those values are used to predict out-of-sample data.\n",
      "\n",
      "Let's introduce some terminology:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_salary_tree_annotated.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**How might you interpret the \"meaning\" of this tree?**\n",
      "\n",
      "- Years is the most important factor determining Salary, with a lower number of Years corresponding to a lower Salary.\n",
      "- For a player with a lower number of Years, Hits is not an important factor determining Salary.\n",
      "- For a player with a higher number of Years, Hits is an important factor determining Salary, with a greater number of Hits corresponding to a higher Salary.\n",
      "\n",
      "What we have seen so far hints at the advantages and disadvantages of decision trees:\n",
      "\n",
      "**Advantages:**\n",
      "\n",
      "- Highly interpretable\n",
      "- Can be displayed graphically\n",
      "- Prediction is fast\n",
      "\n",
      "**Disadvantages:**\n",
      "\n",
      "- Predictive accuracy is not as high as some supervised learning methods\n",
      "- Can easily overfit the training data (high variance)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Building a regression tree by hand\n",
      "\n",
      "How do you build a decision tree? You're going to find out by building one in pairs!\n",
      "\n",
      "Your training data is a tiny dataset of [used vehicle sale prices](https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles.csv). Your goal is to predict Price for out-of-sample data. Here are your instructions:\n",
      "\n",
      "- Read the data into Pandas.\n",
      "- Explore the data by sorting, plotting, or split-apply-combine (aka `group_by`).\n",
      "- Decide which feature is the most important predictor, and use that to make your first split. (Only binary splits are allowed!)\n",
      "- After making your first split, you should actually split your data in Pandas into two parts, and then explore each part to figure out what other splits to make.\n",
      "- Stop making splits once you are convinced that it strikes a good balance between underfitting and overfitting. (As always, your goal is to build a model that generalizes well!)\n",
      "- You are allowed to split on the same variable multiple times!\n",
      "- Draw your tree, making sure to label your leaves with the mean Price for the observations in that \"bucket\".\n",
      "- When you're finished, review your tree to make sure nothing is backwards. (Remember: follow the left branch if the rule is true, and follow the right branch if the rule is false.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## How does a computer build a regression tree?\n",
      "\n",
      "The ideal approach would be for the computer to consider every possible partition of the feature space. However, this is computationally infeasible, so instead an approach is used called **recursive binary splitting:**\n",
      "\n",
      "- Begin at the top of the tree.\n",
      "- For every single predictor, examine every possible cutpoint, and choose the predictor and cutpoint such that the resulting tree has the **lowest possible mean squared error (MSE)**. Make that split.\n",
      "- Repeat the examination for the two resulting regions, and again make a single split (in one of the regions) to minimize the MSE.\n",
      "- Keep repeating this process until a stopping criteria is met.\n",
      "\n",
      "**How does it know when to stop?**\n",
      "\n",
      "1. We could define a stopping criterion, such as a **maximum depth** of the tree or the **minimum number of samples in the leaf**.\n",
      "2. We could grow the tree deep, and then \"prune\" it back using a method such as \"cost complexity pruning\" (aka \"weakest link pruning\").\n",
      "\n",
      "Method 2 involves setting a tuning parameter that penalizes the tree for having too many leaves. As the parameter is increased, branches automatically get pruned from the tree, resulting in smaller and smaller trees. The tuning parameter can be selected through cross-validation.\n",
      "\n",
      "Note: **Method 2 is not currently supported by scikit-learn**, and so we will use Method 1 instead.\n",
      "\n",
      "Here's an example of an **unpruned tree**, and a comparison of the training, test, and cross-validation errors for trees with different numbers of leaves:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_salary_unpruned.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, the **training error** continues to go down as the tree size increases, but the lowest **cross-validation error** occurs for a tree with 3 leaves."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Building a regression tree in scikit-learn"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import pandas\n",
      "import pandas as pd\n",
      "\n",
      "# read in vehicle data\n",
      "vehicles = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles.csv')\n",
      "\n",
      "# print out data\n",
      "vehicles"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>price</th>\n",
        "      <th>year</th>\n",
        "      <th>miles</th>\n",
        "      <th>doors</th>\n",
        "      <th>type</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td> 22000</td>\n",
        "      <td> 2012</td>\n",
        "      <td>  13000</td>\n",
        "      <td> 2</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td> 14000</td>\n",
        "      <td> 2010</td>\n",
        "      <td>  30000</td>\n",
        "      <td> 2</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2 </th>\n",
        "      <td> 13000</td>\n",
        "      <td> 2010</td>\n",
        "      <td>  73500</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td>  9500</td>\n",
        "      <td> 2009</td>\n",
        "      <td>  78000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4 </th>\n",
        "      <td>  9000</td>\n",
        "      <td> 2007</td>\n",
        "      <td>  47000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td>  4000</td>\n",
        "      <td> 2006</td>\n",
        "      <td> 124000</td>\n",
        "      <td> 2</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td>  3000</td>\n",
        "      <td> 2004</td>\n",
        "      <td> 177000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7 </th>\n",
        "      <td>  2000</td>\n",
        "      <td> 2004</td>\n",
        "      <td> 209000</td>\n",
        "      <td> 4</td>\n",
        "      <td> truck</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td>  3000</td>\n",
        "      <td> 2003</td>\n",
        "      <td> 138000</td>\n",
        "      <td> 2</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td>  1900</td>\n",
        "      <td> 2003</td>\n",
        "      <td> 160000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td>  2500</td>\n",
        "      <td> 2003</td>\n",
        "      <td> 190000</td>\n",
        "      <td> 2</td>\n",
        "      <td> truck</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>11</th>\n",
        "      <td>  5000</td>\n",
        "      <td> 2001</td>\n",
        "      <td>  62000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>12</th>\n",
        "      <td>  1800</td>\n",
        "      <td> 1999</td>\n",
        "      <td> 163000</td>\n",
        "      <td> 2</td>\n",
        "      <td> truck</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>13</th>\n",
        "      <td>  1300</td>\n",
        "      <td> 1997</td>\n",
        "      <td> 138000</td>\n",
        "      <td> 4</td>\n",
        "      <td>   car</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "    price  year   miles  doors   type\n",
        "0   22000  2012   13000      2    car\n",
        "1   14000  2010   30000      2    car\n",
        "2   13000  2010   73500      4    car\n",
        "3    9500  2009   78000      4    car\n",
        "4    9000  2007   47000      4    car\n",
        "5    4000  2006  124000      2    car\n",
        "6    3000  2004  177000      4    car\n",
        "7    2000  2004  209000      4  truck\n",
        "8    3000  2003  138000      2    car\n",
        "9    1900  2003  160000      4    car\n",
        "10   2500  2003  190000      2  truck\n",
        "11   5000  2001   62000      4    car\n",
        "12   1800  1999  163000      2  truck\n",
        "13   1300  1997  138000      4    car"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# convert car to 0 and truck to 1\n",
      "vehicles['type'] = vehicles.type.map({'car':0, 'truck':1})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# select feature columns (every column except for the 0th column)\n",
      "feature_cols = vehicles.columns[1:]\n",
      "\n",
      "# define X (features) and y (response)\n",
      "X = vehicles[feature_cols]\n",
      "y = vehicles.price"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# split into train/test\n",
      "from sklearn.cross_validation import train_test_split\n",
      "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# print out each of the arrays\n",
      "print X_train\n",
      "print y_train\n",
      "print X_test\n",
      "print y_test"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[  2003 190000      2      1]\n",
        " [  2007  47000      4      0]\n",
        " [  2010  30000      2      0]\n",
        " [  1999 163000      2      1]\n",
        " [  2012  13000      2      0]\n",
        " [  1997 138000      4      0]\n",
        " [  2003 160000      4      0]\n",
        " [  2003 138000      2      0]\n",
        " [  2001  62000      4      0]\n",
        " [  2006 124000      2      0]]\n",
        "[ 2500  9000 14000  1800 22000  1300  1900  3000  5000  4000]\n",
        "[[  2009  78000      4      0]\n",
        " [  2004 209000      4      1]\n",
        " [  2004 177000      4      0]\n",
        " [  2010  73500      4      0]]\n",
        "[ 9500  2000  3000 13000]\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import class, instantiate estimator, fit with training set\n",
      "from sklearn.tree import DecisionTreeRegressor\n",
      "treereg = DecisionTreeRegressor(random_state=1)\n",
      "treereg.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "DecisionTreeRegressor(compute_importances=None, criterion='mse',\n",
        "           max_depth=None, max_features=None, max_leaf_nodes=None,\n",
        "           min_density=None, min_samples_leaf=1, min_samples_split=2,\n",
        "           random_state=1, splitter='best')"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# make predictions\n",
      "preds = treereg.predict(X_test)\n",
      "\n",
      "# print predictions and actual values\n",
      "print preds\n",
      "print y_test"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 5000.  1900.  1900.  5000.]\n",
        "[ 9500  2000  3000 13000]\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# print RMSE\n",
      "from sklearn import metrics\n",
      "import numpy as np\n",
      "np.sqrt(metrics.mean_squared_error(y_test, preds))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "4622.4993239588475"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# use cross-validation to find best max_depth\n",
      "from sklearn.cross_validation import cross_val_score"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# try max_depth=2\n",
      "treereg = DecisionTreeRegressor(max_depth=2, random_state=1)\n",
      "scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')\n",
      "np.mean(np.sqrt(-scores))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "4804.3767888427128"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# try max_depth=3\n",
      "treereg = DecisionTreeRegressor(max_depth=3, random_state=1)\n",
      "scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')\n",
      "np.mean(np.sqrt(-scores))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "4592.1554255755254"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# try max_depth=4\n",
      "treereg = DecisionTreeRegressor(max_depth=4, random_state=1)\n",
      "scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')\n",
      "np.mean(np.sqrt(-scores))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "4704.0052694797387"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# max_depth=3 was best, so fit a tree using that parameter with ALL DATA\n",
      "treereg = DecisionTreeRegressor(max_depth=3, random_state=1)\n",
      "treereg.fit(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=3,\n",
        "           max_features=None, max_leaf_nodes=None, min_density=None,\n",
        "           min_samples_leaf=1, min_samples_split=2, random_state=1,\n",
        "           splitter='best')"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# compute the \"Gini importance\" of each feature: the (normalized) total reduction of MSE brought by that feature\n",
      "pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>feature</th>\n",
        "      <th>importance</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  year</td>\n",
        "      <td> 0.798744</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> miles</td>\n",
        "      <td> 0.201256</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> doors</td>\n",
        "      <td> 0.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>  type</td>\n",
        "      <td> 0.000000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "  feature  importance\n",
        "0    year    0.798744\n",
        "1   miles    0.201256\n",
        "2   doors    0.000000\n",
        "3    type    0.000000"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create a Graphviz file\n",
      "from sklearn.tree import export_graphviz\n",
      "with open(\"15_vehicles.dot\", 'wb') as f:\n",
      "    f = export_graphviz(treereg, out_file=f, feature_names=feature_cols)\n",
      "\n",
      "# at the command line, run this to convert to PNG:\n",
      "# dot -Tpng 15_vehicles.dot -o 15_vehicles.png"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_vehicles.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Interpreting a tree diagram\n",
      "\n",
      "How do we read this decision tree?\n",
      "\n",
      "**Internal nodes:**\n",
      "\n",
      "- \"samples\" is the number of observations in that node before splitting\n",
      "- \"mse\" is the mean squared error calculated by comparing the actual response values in that node against the mean response value in that node\n",
      "- first line is the condition used to split that node (go left if true, go right if false)\n",
      "\n",
      "**Leaves:**\n",
      "\n",
      "- \"samples\" is the number of observations in that node\n",
      "- \"value\" is the mean response value in that node\n",
      "- \"mse\" is the mean squared error calculated by comparing the actual response values in that node against \"value\""
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Predicting for out-of-sample data\n",
      "\n",
      "How accurate is scikit-learn's regression tree at predicting the out-of-sample data?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# read in out-of-sample data\n",
      "oos = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles_oos.csv')\n",
      "\n",
      "# convert car to 0 and truck to 1\n",
      "oos['type'] = oos.type.map({'car':0, 'truck':1})\n",
      "\n",
      "# print data\n",
      "oos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>price</th>\n",
        "      <th>year</th>\n",
        "      <th>miles</th>\n",
        "      <th>doors</th>\n",
        "      <th>type</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  3000</td>\n",
        "      <td> 2003</td>\n",
        "      <td> 130000</td>\n",
        "      <td> 4</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  6000</td>\n",
        "      <td> 2005</td>\n",
        "      <td>  82500</td>\n",
        "      <td> 4</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 12000</td>\n",
        "      <td> 2010</td>\n",
        "      <td>  60000</td>\n",
        "      <td> 2</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "   price  year   miles  doors  type\n",
        "0   3000  2003  130000      4     1\n",
        "1   6000  2005   82500      4     0\n",
        "2  12000  2010   60000      2     0"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# define X and y\n",
      "X_oos = oos[feature_cols]\n",
      "y_oos = oos.price"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# make predictions on out-of-sample data\n",
      "preds = treereg.predict(X_oos)\n",
      "\n",
      "# print predictions and actual values\n",
      "print preds\n",
      "print y_oos.values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[  4000.   5000.  13500.]\n",
        "[ 3000  6000 12000]\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# print RMSE\n",
      "np.sqrt(metrics.mean_squared_error(y_oos, preds))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "1190.2380714238084"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# print RMSE for the tree you created!\n",
      "your_preds = [4000, 5000, 13500]\n",
      "np.sqrt(metrics.mean_squared_error(y_oos, your_preds))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "1190.2380714238084"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Classification trees\n",
      "\n",
      "Classification trees are very similar to regression trees. Here is a quick comparison:\n",
      "\n",
      "|regression trees|classification trees|\n",
      "|---|---|\n",
      "|predict a continuous response|predict a categorical response|\n",
      "|predict using mean response of each leaf|predict using most commonly occuring class of each leaf|\n",
      "|splits are chosen to minimize MSE|splits are chosen to minimize a different criterion (discussed below)|\n",
      "\n",
      "Note that classification trees easily handle **more than two response classes**! (How have other classification models we've seen handled this scenario?)\n",
      "\n",
      "Here's an **example of a classification tree**, which predicts whether or not a patient who presented with chest pain has heart disease:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_heart_tree.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Splitting criteria for classification trees\n",
      "\n",
      "Here are common options for the splitting criteria:\n",
      "\n",
      "- **classification error rate:** fraction of training observations in a region that don't belong to the most common class\n",
      "- **Gini index:** measure of total variance across classes in a region\n",
      "- **cross-entropy:** numerically similar to Gini index, but uses logarithms\n",
      "\n",
      "Which to use?\n",
      "\n",
      "- When growing a tree, Gini index and cross-entropy are better measures of \"node purity\" than classification error rate. The Gini index is faster to compute than cross-entropy, so it is generally preferred (and is used by scikit-learn by default).\n",
      "- When pruning a tree, classification error rate is preferable in order to maximize predictive accuracy.\n",
      "\n",
      "Why do some splits result in leaves with the same predicted class?\n",
      "\n",
      "- The split was performed to increase node purity, even though it didn't reduce the classification error.\n",
      "- Node purity is important because we're interested in the class proportions among the observations in each region."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Handling categorical predictors\n",
      "\n",
      "Some implementations of classification trees will allow you to handle categorical predictors **without creating dummy variables**. When splitting on a categorical predictor, they will try splitting on **every possible combination of categories** to find the best split. In the example above, \"ChestPain:bc\" means that the left-hand branch consists of observations with the second and third ChestPain categories, and the right-hand branch consists of remaining observations.\n",
      "\n",
      "**Unfortunately, scikit-learn's classification tree implementation does not support this approach.** Instead, here's how you can handle categorical predictors:\n",
      "\n",
      "- If a predictor only has **two possible values**, code it as a single binary variable (0 or 1). Since it's treated as a number, splits will naturally occur at 0.5.\n",
      "- If a predictor has **three or more possible values that are ordered**, code it as a single variable (1, 2, 3, etc). Splits will naturally occur at 1.5, 2.5, etc.\n",
      "- If a predictor has **three or more possible values that are unordered**, create dummy variables and drop one level as usual. The decision tree won't know that the dummy variables are related to one another, but that shouldn't matter in terms of predictive accuracy.\n",
      "- If a predictor has **thousands of possible unordered values**, then it may be best to code it as a single variable (1, 2, 3, etc) instead of using dummy variables to minimize the size of the resulting model. ([reference](http://stackoverflow.com/a/18736132/1636598))\n",
      "\n",
      "We'll see examples of these strategies below."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Building a classification tree in scikit-learn"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll build a classification tree using the [Titanic data](https://www.kaggle.com/c/titanic-gettingStarted/data) provided by Kaggle."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# read in the data\n",
      "titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/titanic.csv')\n",
      "titanic.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>survived</th>\n",
        "      <th>pclass</th>\n",
        "      <th>name</th>\n",
        "      <th>sex</th>\n",
        "      <th>age</th>\n",
        "      <th>sibsp</th>\n",
        "      <th>parch</th>\n",
        "      <th>ticket</th>\n",
        "      <th>fare</th>\n",
        "      <th>cabin</th>\n",
        "      <th>embarked</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                           Braund, Mr. Owen Harris</td>\n",
        "      <td>   male</td>\n",
        "      <td> 22</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>        A/5 21171</td>\n",
        "      <td>  7.2500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
        "      <td> female</td>\n",
        "      <td> 38</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>         PC 17599</td>\n",
        "      <td> 71.2833</td>\n",
        "      <td>  C85</td>\n",
        "      <td> C</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td>                            Heikkinen, Miss. Laina</td>\n",
        "      <td> female</td>\n",
        "      <td> 26</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> STON/O2. 3101282</td>\n",
        "      <td>  7.9250</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td>      Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
        "      <td> female</td>\n",
        "      <td> 35</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           113803</td>\n",
        "      <td> 53.1000</td>\n",
        "      <td> C123</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                          Allen, Mr. William Henry</td>\n",
        "      <td>   male</td>\n",
        "      <td> 35</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           373450</td>\n",
        "      <td>  8.0500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                                  Moran, Mr. James</td>\n",
        "      <td>   male</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           330877</td>\n",
        "      <td>  8.4583</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> Q</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6</th>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td>                           McCarthy, Mr. Timothy J</td>\n",
        "      <td>   male</td>\n",
        "      <td> 54</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>            17463</td>\n",
        "      <td> 51.8625</td>\n",
        "      <td>  E46</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                    Palsson, Master. Gosta Leonard</td>\n",
        "      <td>   male</td>\n",
        "      <td>  2</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1</td>\n",
        "      <td>           349909</td>\n",
        "      <td> 21.0750</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td> Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
        "      <td> female</td>\n",
        "      <td> 27</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td>           347742</td>\n",
        "      <td> 11.1333</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9</th>\n",
        "      <td> 1</td>\n",
        "      <td> 2</td>\n",
        "      <td>               Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
        "      <td> female</td>\n",
        "      <td> 14</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           237736</td>\n",
        "      <td> 30.0708</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> C</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "   survived  pclass                                               name  \\\n",
        "0         0       3                            Braund, Mr. Owen Harris   \n",
        "1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   \n",
        "2         1       3                             Heikkinen, Miss. Laina   \n",
        "3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   \n",
        "4         0       3                           Allen, Mr. William Henry   \n",
        "5         0       3                                   Moran, Mr. James   \n",
        "6         0       1                            McCarthy, Mr. Timothy J   \n",
        "7         0       3                     Palsson, Master. Gosta Leonard   \n",
        "8         1       3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   \n",
        "9         1       2                Nasser, Mrs. Nicholas (Adele Achem)   \n",
        "\n",
        "      sex  age  sibsp  parch            ticket     fare cabin embarked  \n",
        "0    male   22      1      0         A/5 21171   7.2500   NaN        S  \n",
        "1  female   38      1      0          PC 17599  71.2833   C85        C  \n",
        "2  female   26      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
        "3  female   35      1      0            113803  53.1000  C123        S  \n",
        "4    male   35      0      0            373450   8.0500   NaN        S  \n",
        "5    male  NaN      0      0            330877   8.4583   NaN        Q  \n",
        "6    male   54      0      0             17463  51.8625   E46        S  \n",
        "7    male    2      3      1            349909  21.0750   NaN        S  \n",
        "8  female   27      0      2            347742  11.1333   NaN        S  \n",
        "9  female   14      1      0            237736  30.0708   NaN        C  "
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# look for missing values\n",
      "titanic.isnull().sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "survived      0\n",
        "pclass        0\n",
        "name          0\n",
        "sex           0\n",
        "age         177\n",
        "sibsp         0\n",
        "parch         0\n",
        "ticket        0\n",
        "fare          0\n",
        "cabin       687\n",
        "embarked      2\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's choose our response and a few features, and decide whether we need to adjust them:\n",
      "\n",
      "- **survived:** This is our response, and is already encoded as 0=died and 1=survived.\n",
      "- **pclass:** These are the passenger class categories (1=first class, 2=second class, 3=third class). They are ordered, so we'll leave them as-is.\n",
      "- **sex:** This is a binary category, so we should encode as 0=female and 1=male.\n",
      "- **age:** We need to fill in the missing values.\n",
      "- **embarked:** This is the port they emarked from. There are three unordered categories, so we'll create dummy variables."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# encode sex feature\n",
      "titanic['sex'] = titanic.sex.map({'female':0, 'male':1})\n",
      "\n",
      "# fill in missing values for age\n",
      "titanic.age.fillna(titanic.age.mean(), inplace=True)\n",
      "\n",
      "# print the updated DataFrame\n",
      "titanic.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>survived</th>\n",
        "      <th>pclass</th>\n",
        "      <th>name</th>\n",
        "      <th>sex</th>\n",
        "      <th>age</th>\n",
        "      <th>sibsp</th>\n",
        "      <th>parch</th>\n",
        "      <th>ticket</th>\n",
        "      <th>fare</th>\n",
        "      <th>cabin</th>\n",
        "      <th>embarked</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                           Braund, Mr. Owen Harris</td>\n",
        "      <td> 1</td>\n",
        "      <td> 22.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>        A/5 21171</td>\n",
        "      <td>  7.2500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
        "      <td> 0</td>\n",
        "      <td> 38.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>         PC 17599</td>\n",
        "      <td> 71.2833</td>\n",
        "      <td>  C85</td>\n",
        "      <td> C</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td>                            Heikkinen, Miss. Laina</td>\n",
        "      <td> 0</td>\n",
        "      <td> 26.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> STON/O2. 3101282</td>\n",
        "      <td>  7.9250</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td>      Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 35.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           113803</td>\n",
        "      <td> 53.1000</td>\n",
        "      <td> C123</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                          Allen, Mr. William Henry</td>\n",
        "      <td> 1</td>\n",
        "      <td> 35.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           373450</td>\n",
        "      <td>  8.0500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                                  Moran, Mr. James</td>\n",
        "      <td> 1</td>\n",
        "      <td> 29.699118</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           330877</td>\n",
        "      <td>  8.4583</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> Q</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6</th>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td>                           McCarthy, Mr. Timothy J</td>\n",
        "      <td> 1</td>\n",
        "      <td> 54.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>            17463</td>\n",
        "      <td> 51.8625</td>\n",
        "      <td>  E46</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                    Palsson, Master. Gosta Leonard</td>\n",
        "      <td> 1</td>\n",
        "      <td>  2.000000</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1</td>\n",
        "      <td>           349909</td>\n",
        "      <td> 21.0750</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td> Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 27.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td>           347742</td>\n",
        "      <td> 11.1333</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9</th>\n",
        "      <td> 1</td>\n",
        "      <td> 2</td>\n",
        "      <td>               Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 14.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           237736</td>\n",
        "      <td> 30.0708</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> C</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "   survived  pclass                                               name  sex  \\\n",
        "0         0       3                            Braund, Mr. Owen Harris    1   \n",
        "1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...    0   \n",
        "2         1       3                             Heikkinen, Miss. Laina    0   \n",
        "3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)    0   \n",
        "4         0       3                           Allen, Mr. William Henry    1   \n",
        "5         0       3                                   Moran, Mr. James    1   \n",
        "6         0       1                            McCarthy, Mr. Timothy J    1   \n",
        "7         0       3                     Palsson, Master. Gosta Leonard    1   \n",
        "8         1       3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)    0   \n",
        "9         1       2                Nasser, Mrs. Nicholas (Adele Achem)    0   \n",
        "\n",
        "         age  sibsp  parch            ticket     fare cabin embarked  \n",
        "0  22.000000      1      0         A/5 21171   7.2500   NaN        S  \n",
        "1  38.000000      1      0          PC 17599  71.2833   C85        C  \n",
        "2  26.000000      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
        "3  35.000000      1      0            113803  53.1000  C123        S  \n",
        "4  35.000000      0      0            373450   8.0500   NaN        S  \n",
        "5  29.699118      0      0            330877   8.4583   NaN        Q  \n",
        "6  54.000000      0      0             17463  51.8625   E46        S  \n",
        "7   2.000000      3      1            349909  21.0750   NaN        S  \n",
        "8  27.000000      0      2            347742  11.1333   NaN        S  \n",
        "9  14.000000      1      0            237736  30.0708   NaN        C  "
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create three dummy variables using get_dummies\n",
      "pd.get_dummies(titanic.embarked, prefix='embarked').head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>embarked_C</th>\n",
        "      <th>embarked_Q</th>\n",
        "      <th>embarked_S</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5</th>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "   embarked_C  embarked_Q  embarked_S\n",
        "0           0           0           1\n",
        "1           1           0           0\n",
        "2           0           0           1\n",
        "3           0           0           1\n",
        "4           0           0           1\n",
        "5           0           1           0\n",
        "6           0           0           1\n",
        "7           0           0           1\n",
        "8           0           0           1\n",
        "9           1           0           0"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create three dummy variables, drop the first dummy variable, and store this as a DataFrame\n",
      "embarked_dummies = pd.get_dummies(titanic.embarked, prefix='embarked').iloc[:, 1:]\n",
      "\n",
      "# concatenate the two dummy variable columns onto the original DataFrame\n",
      "# note: axis=0 means rows, axis=1 means columns\n",
      "titanic = pd.concat([titanic, embarked_dummies], axis=1)\n",
      "\n",
      "# print the updated DataFrame\n",
      "titanic.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>survived</th>\n",
        "      <th>pclass</th>\n",
        "      <th>name</th>\n",
        "      <th>sex</th>\n",
        "      <th>age</th>\n",
        "      <th>sibsp</th>\n",
        "      <th>parch</th>\n",
        "      <th>ticket</th>\n",
        "      <th>fare</th>\n",
        "      <th>cabin</th>\n",
        "      <th>embarked</th>\n",
        "      <th>embarked_Q</th>\n",
        "      <th>embarked_S</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                           Braund, Mr. Owen Harris</td>\n",
        "      <td> 1</td>\n",
        "      <td> 22.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>        A/5 21171</td>\n",
        "      <td>  7.2500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
        "      <td> 0</td>\n",
        "      <td> 38.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>         PC 17599</td>\n",
        "      <td> 71.2833</td>\n",
        "      <td>  C85</td>\n",
        "      <td> C</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td>                            Heikkinen, Miss. Laina</td>\n",
        "      <td> 0</td>\n",
        "      <td> 26.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> STON/O2. 3101282</td>\n",
        "      <td>  7.9250</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td>      Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 35.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           113803</td>\n",
        "      <td> 53.1000</td>\n",
        "      <td> C123</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                          Allen, Mr. William Henry</td>\n",
        "      <td> 1</td>\n",
        "      <td> 35.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           373450</td>\n",
        "      <td>  8.0500</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                                  Moran, Mr. James</td>\n",
        "      <td> 1</td>\n",
        "      <td> 29.699118</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>           330877</td>\n",
        "      <td>  8.4583</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> Q</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6</th>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td>                           McCarthy, Mr. Timothy J</td>\n",
        "      <td> 1</td>\n",
        "      <td> 54.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>            17463</td>\n",
        "      <td> 51.8625</td>\n",
        "      <td>  E46</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7</th>\n",
        "      <td> 0</td>\n",
        "      <td> 3</td>\n",
        "      <td>                    Palsson, Master. Gosta Leonard</td>\n",
        "      <td> 1</td>\n",
        "      <td>  2.000000</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1</td>\n",
        "      <td>           349909</td>\n",
        "      <td> 21.0750</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3</td>\n",
        "      <td> Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 27.000000</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td>           347742</td>\n",
        "      <td> 11.1333</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> S</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9</th>\n",
        "      <td> 1</td>\n",
        "      <td> 2</td>\n",
        "      <td>               Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
        "      <td> 0</td>\n",
        "      <td> 14.000000</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td>           237736</td>\n",
        "      <td> 30.0708</td>\n",
        "      <td>  NaN</td>\n",
        "      <td> C</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "   survived  pclass                                               name  sex  \\\n",
        "0         0       3                            Braund, Mr. Owen Harris    1   \n",
        "1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...    0   \n",
        "2         1       3                             Heikkinen, Miss. Laina    0   \n",
        "3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)    0   \n",
        "4         0       3                           Allen, Mr. William Henry    1   \n",
        "5         0       3                                   Moran, Mr. James    1   \n",
        "6         0       1                            McCarthy, Mr. Timothy J    1   \n",
        "7         0       3                     Palsson, Master. Gosta Leonard    1   \n",
        "8         1       3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)    0   \n",
        "9         1       2                Nasser, Mrs. Nicholas (Adele Achem)    0   \n",
        "\n",
        "         age  sibsp  parch            ticket     fare cabin embarked  \\\n",
        "0  22.000000      1      0         A/5 21171   7.2500   NaN        S   \n",
        "1  38.000000      1      0          PC 17599  71.2833   C85        C   \n",
        "2  26.000000      0      0  STON/O2. 3101282   7.9250   NaN        S   \n",
        "3  35.000000      1      0            113803  53.1000  C123        S   \n",
        "4  35.000000      0      0            373450   8.0500   NaN        S   \n",
        "5  29.699118      0      0            330877   8.4583   NaN        Q   \n",
        "6  54.000000      0      0             17463  51.8625   E46        S   \n",
        "7   2.000000      3      1            349909  21.0750   NaN        S   \n",
        "8  27.000000      0      2            347742  11.1333   NaN        S   \n",
        "9  14.000000      1      0            237736  30.0708   NaN        C   \n",
        "\n",
        "   embarked_Q  embarked_S  \n",
        "0           0           1  \n",
        "1           0           0  \n",
        "2           0           1  \n",
        "3           0           1  \n",
        "4           0           1  \n",
        "5           1           0  \n",
        "6           0           1  \n",
        "7           0           1  \n",
        "8           0           1  \n",
        "9           0           0  "
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create a list of feature columns\n",
      "feature_cols = ['pclass', 'sex', 'age', 'embarked_Q', 'embarked_S']\n",
      "\n",
      "# define X and y\n",
      "X = titanic[feature_cols]\n",
      "y = titanic.survived"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# fit a classification tree with max_depth=3 on all data\n",
      "from sklearn.tree import DecisionTreeClassifier\n",
      "treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)\n",
      "treeclf.fit(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "DecisionTreeClassifier(compute_importances=None, criterion='gini',\n",
        "            max_depth=3, max_features=None, max_leaf_nodes=None,\n",
        "            min_density=None, min_samples_leaf=1, min_samples_split=2,\n",
        "            random_state=1, splitter='best')"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create a Graphviz file\n",
      "with open(\"15_titanic.dot\", 'wb') as f:\n",
      "    f = export_graphviz(treeclf, out_file=f, feature_names=feature_cols)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_titanic.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice the split in the bottom right, which was made only to increase node purity."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# compute the feature importances\n",
      "pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>feature</th>\n",
        "      <th>importance</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>     pclass</td>\n",
        "      <td> 0.242664</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>        sex</td>\n",
        "      <td> 0.655584</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>        age</td>\n",
        "      <td> 0.064494</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> embarked_Q</td>\n",
        "      <td> 0.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> embarked_S</td>\n",
        "      <td> 0.037258</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 29,
       "text": [
        "      feature  importance\n",
        "0      pclass    0.242664\n",
        "1         sex    0.655584\n",
        "2         age    0.064494\n",
        "3  embarked_Q    0.000000\n",
        "4  embarked_S    0.037258"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Wrapping up decision trees\n",
      "\n",
      "Here are some advantages and disadvantages of decision trees that we haven't yet talked about:\n",
      "\n",
      "**Advantages:**\n",
      "\n",
      "- Can be specified as a series of rules, and are thought to more closely approximate human decision-making than other models\n",
      "- Non-parametric (will do better than linear regression if relationship between predictors and response is highly non-linear)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"15_linear_vs_tree.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Disadvantages:**\n",
      "\n",
      "- Small variations in the data can result in a completely different tree\n",
      "- Recursive binary splitting makes \"locally optimal\" decisions that may not result in a globally optimal tree\n",
      "- Can create biased trees if the classes are highly imbalanced\n",
      "\n",
      "Note that there is not just one decision tree algorithm; instead, there are many variations. A few common decision tree algorithms that are often referred to by name are C4.5, C5.0, and CART. (More details are available in the [scikit-learn documentation](http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart).) scikit-learn uses an \"optimized version\" of CART."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Resources\n",
      "\n",
      "- scikit-learn documentation: [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}