{ "metadata": { "name": "05_Decision_Trees" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

# Decision Trees

\n", "

## Introduction

\n", "Decision trees divide the input space into some number of cuboid regions. **Every** input within a given region is assigned the *same output*. A decision tree model can be represented by a binary tree where \n", "\n", "* Each tree node tests one attribute from the input vector $X$\n", "\n", "* Each tree branch selects one value (or range) for the given attribute\n", "\n", "* Each leaf node predicts an output, Y\n", "\n", "Decision tree models cab be applied to both regression and classification problems, however the focus here will be on classification trees. Note also, that decision trees can represent **any** boolean function. The tree will have $2^N$ leaf nodes where $N$ is the number of boolean variables, however there are $2^{2^N}$ possible trees given $N$ variables. For other decision trees (i.e. for non-Boolean variables) the optimal number of leaves is unknown and the input vector selection attribute may be repeated at multiple nodes. \n", "\n", "The algorithms presented here are the so-called *CART* (classification and regression trees) methods. Another popular method, is the C5.0 (formally ID3) method presented by Quinlan. The resulting tree models will be binary decision trees (i.e. there are exactly 2 edges from each node except for leaf nodes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## Classification Trees

\n", "Assume our training data consists of $N$ input vectors each with $p$ elements and an associateed single output from a set of $K$ classes, i.e. each input is of the form $X_i = \\\\{x_{i1}, x_{i2}, \\ldots , x_{ip}\\\\}$ and with an associated output $y_i \\in \\\\{c_1, c_2, \\ldots, c_K\\\\}$ where $i \\in \\\\{1 \\ldots N\\\\}$. Assume the decision tree model has $M$ leaf nodes, i.e. the input space is divided into $M$ regions, denoted $R_m$. Every input in the region $R_m$ will be assigned to class $k(m)$ where \n", "\n", "$k(m) = \\max_k p^e_{mk} = \\max_k\\frac{1}{N_m} \\sum_{x_i \\in R_m} I(y_i = k)$\n", "\n", "where $I$ returns 0 if $y_i=k$ and 0 otherwise, $p^e$ is the estimate of the probability of the observation being in class $k$ given the training data, and \n", "\n", "$N_m = num\\\\{x_i \\in R_m\\\\}$\n", "\n", "is the number of inputs from the training set in region $R_m$. NOTE: I would prefer to use $\\hat{p}$ instead of $p^e$ but there appears to be a bug in IPython Notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

### Growing The Tree

\n", "The standard approach to creating a decision tree model is to use a greedy algorithm. The tree is started with a single root node that splits the data into two regions based on the selection of a single input feature, $j$, and an associated splitting value, $s$. That is, $j$ is the choice of feature and $s$ is the value of that feature used to split the training data. From there new nodes are added by successively repeating this process for the training data subregions resulting from the previous split. This process is continued until some stopping criteria is reached. From there, the tree pruning is applied as a means to prevent overfitting the data. \n", "

\n", "\n", "

#### Split Criteria

\n", "The first task is to define how to select the split variable, $j$, and value $s$. This is done by minimizing an impurity measure, $Q_m(T)$, summed over the two nodes resulting from the split. Some of the more impurity measures for classification problems are\n", "\n", "Misclassification error: $\\frac{1}{N_m} \\sum_{i\\in R_m} I \\left( y_i \\ne k(m) \\right) = 1 - p^e_{mk}(m)$\n", "\n", "Gini Index: $\\sum_{k \\ne k'} p^e_{mk} p^e_{mk'} = \\sum_{k=1}^{K} p^e_{mk} \\left(1 - p^e_{mk} \\right)$\n", "\n", "Cross-entropy: $- \\sum_{k=1}^K p^e_{mk} \\log p^e_{mk}$\n", "\n", "Given a choice of $Q_m$, it is a straightforward matter of iterating over each feature, $j$, and split value $s$ to determine the optimal choice. Note that even if a feature is continuous, implying an infinite space of possible split values, the training data set, from which $s$ must be selected, is finite.\n", "\n", "The stopping criteria is generally chosen as some number, $D$, of training values associated with the leaf nodes. That is, if any region $R_m$ contains greater than $D$ inputs from the training data, continue the splitting process.\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TODO** \n", "

#### Pruning

\n", "Given a starting tree, $T_0$, pruning is done by minimizing the cost complexity function\n", "\n", "$C_{\\lambda}(T) = \\sum_{m=1}^{|T|} N_m Q_m(T) + \\lambda |T|$\n", "\n", "where $T \\subset T_0$, $|T|$ is the number of leaf nodes in $T$, and $\\lambda$ is a tuning parameter.\n", "\n", "Reduced-Error Pruning 1. Split data inot training and validation set 2. Create tree that classifies training set data Prune until further pruning is harmful 1. Evaluate impact on validation data of pruning each possible node 2. Greedily remove the node that most improves the validation accuracy\n", "\n", "Rule Post-Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "