{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Nonlinear Classification and Regression with Decision Trees \n", "\n", "#### Decision trees \n", "\n", "Decision trees are commonly learned by recursively splitting the set of training\n", "instances into subsets based on the instances' values for the explanatory variables. \n", "\n", "In classification tasks, the leaf nodes\n", "of the decision tree represent classes. In regression tasks, the values of the response\n", "variable for the instances contained in a leaf node may be averaged to produce the\n", "estimate for the response variable. After the decision tree has been constructed,\n", "making a prediction for a test instance requires only following the edges until a\n", "leaf node is reached. \n", "\n", "Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3).\n", "Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision\n", "trees. \n", "\n", "But how to choose the first variable on which we have to divide the data so that we can have smaller tree. \n", "\n", "Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy\n", "is given by the following equation, where n is the number of outcomes and ( ) i P x is\n", "the probability of the outcome i. Common values for b are 2, e, and 10. Because the\n", "log of a number less than one will be negative, the entire sum is negated to return a\n", "positive value. \n", "\n", "**entropy** $$ H(X) = -\\sum_{i=1}^{n} P(x_i)log_b P(x_i) $$ \n", "\n", "\n", "#### Information gain \n", "Selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. \n", "we will measure the reduction in entropy using a metric called information gain. \n", "Calculated with the following equation, information gain is the difference between the entropy of the parent\n", "node, H (T ), and the weighted average of the children nodes' entropies. \n", "\n", "![](data/information_gain.png) \n", "\n", "\n", "For creating Decision Tree, Algo **ID3** is the one mostly used. **C4.5** is a modified version of ID3\n", "that can be used with continuous explanatory variables and can accommodate\n", "missing values for features. C4.5 also can prune trees. \n", "Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by\n", "scikit-learn's implementation of decision trees, **CART** is another learning algorithm\n", "that supports pruning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Gini impurity \n", "Gini impurity measures the proportions of classes in a set. Gini impurity\n", "is given by the following equation, where j is the number of classes, t is the subset\n", "of instances for the node, and P(i|t) is the probability of selecting an element of\n", "class i from the node's subset: \n", "\n", "$$ Gini (t) = 1 - \\sum_{i=1}^{j} P(i|t)^2 $$ \n", "\n", "Intuitively, Gini impurity is zero when all of the elements of the set are the same\n", "class, as the probability of selecting an element of that class is equal to one. Like\n", "entropy, Gini impurity is greatest when each class has an equal probability of being\n", "selected. The maximum value of Gini impurity depends on the number of possible\n", "classes, and it is given by the following equation: \n", "\n", "$$ Gini_{max} = 1 - \\frac{1}{n} $$" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# import\n", "import pandas as pd\n", "\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.grid_search import GridSearchCV\n", "from sklearn.metrics import classification_report\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:2723: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "df = pd.read_csv(\"data/ad.data\", header=None)\n", "\n", "explanatory_variable_columns = set(df.columns.values)\n", "response_variable_column = df[len(df.columns.values)-1]\n", "# The last column describes the targets\n", "explanatory_variable_columns.remove(len(df.columns.values)-1)\n", "y = [1 if e == 'ad.' else 0 for e in response_variable_column]\n", "X = df[list(explanatory_variable_columns)]\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...1548154915501551155215531554155515561557
01251251.01000000...0000000000
1574688.21051000000...0000000000
2332306.96961000000...0000000000
3604687.81000000...0000000000
4604687.81000000...0000000000
5604687.81000000...0000000000
6594607.79661000000...0000000000
7602343.91000000...0000000000
8604687.81000000...0000000000
9604687.81000000...0000000000
10-1-1-11000000...0000000000
1190520.57771000000...0000000000
1290600.66661000000...0000000000
1390600.66661000000...0000000000
14332306.96961000000...0000000000
15604687.81000000...0000001100
16604687.80000000...0000001100
171251251.01000000...0000000000
18604687.81000000...0000001110
193058519.51000000...0000000000
2090600.66661000000...0000000000
2190600.66661000000...0000000000
2290600.66661000000...0000000000
2390600.66661000000...0000000000
24-1-1-11000000...0000000000
2590520.57771000000...0000000000
2690600.66661000000...0000000000
27604687.81000000...0000000000
28602343.90000000...0000000000
29602343.90000000...0000000000
..................................................................
3249-1-1-11000000...0000000000
3250-1-1-11000000...0000000000
325116161.01000000...0000000000
325224753.1251000000...0000000000
3253-1-1-11000000...0000000000
3254251004.01000000...0000000000
3255-1-1-11000000...0000000000
3256551753.18181000000...0000000000
3257-1-1-11000000...0000000000
3258-1-1-11000000...0000000000
32591060060.01000000...0000000000
326011645.81811000000...0000000000
3261-1-1-11000000...0000000000
32621502001.33331000000...0000000000
326316161.01000000...0000000000
32641341841.37310000000...0000000000
326523261.13041000000...0000000000
3266401303.251000000...0000000000
32671581921.21511000000...0000000000
3268251004.01000000...0000000000
3269-1-1-11000000...0000000000
3270-1-1-11000000...0000000000
3271-1-1-11000000...0000000000
32721061101.03771000000...0000000000
327330301.00000000...1000000000
3274170940.55290000000...0000000000
32751011401.38611000000...0000000000
3276231205.21731000000...0000000000
3277-1-1-11000000...0000000000
327840401.01000000...0000000000
\n", "

3279 rows × 1558 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 ... 1548 \\\n", "0 125 125 1.0 1 0 0 0 0 0 0 ... 0 \n", "1 57 468 8.2105 1 0 0 0 0 0 0 ... 0 \n", "2 33 230 6.9696 1 0 0 0 0 0 0 ... 0 \n", "3 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "4 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "5 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "6 59 460 7.7966 1 0 0 0 0 0 0 ... 0 \n", "7 60 234 3.9 1 0 0 0 0 0 0 ... 0 \n", "8 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "9 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "10 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "11 90 52 0.5777 1 0 0 0 0 0 0 ... 0 \n", "12 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "13 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "14 33 230 6.9696 1 0 0 0 0 0 0 ... 0 \n", "15 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "16 60 468 7.8 0 0 0 0 0 0 0 ... 0 \n", "17 125 125 1.0 1 0 0 0 0 0 0 ... 0 \n", "18 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "19 30 585 19.5 1 0 0 0 0 0 0 ... 0 \n", "20 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "21 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "22 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "23 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "24 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "25 90 52 0.5777 1 0 0 0 0 0 0 ... 0 \n", "26 90 60 0.6666 1 0 0 0 0 0 0 ... 0 \n", "27 60 468 7.8 1 0 0 0 0 0 0 ... 0 \n", "28 60 234 3.9 0 0 0 0 0 0 0 ... 0 \n", "29 60 234 3.9 0 0 0 0 0 0 0 ... 0 \n", "... ... ... ... ... ... ... ... ... ... ... ... ... \n", "3249 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3250 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3251 16 16 1.0 1 0 0 0 0 0 0 ... 0 \n", "3252 24 75 3.125 1 0 0 0 0 0 0 ... 0 \n", "3253 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3254 25 100 4.0 1 0 0 0 0 0 0 ... 0 \n", "3255 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3256 55 175 3.1818 1 0 0 0 0 0 0 ... 0 \n", "3257 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3258 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3259 10 600 60.0 1 0 0 0 0 0 0 ... 0 \n", "3260 11 64 5.8181 1 0 0 0 0 0 0 ... 0 \n", "3261 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3262 150 200 1.3333 1 0 0 0 0 0 0 ... 0 \n", "3263 16 16 1.0 1 0 0 0 0 0 0 ... 0 \n", "3264 134 184 1.3731 0 0 0 0 0 0 0 ... 0 \n", "3265 23 26 1.1304 1 0 0 0 0 0 0 ... 0 \n", "3266 40 130 3.25 1 0 0 0 0 0 0 ... 0 \n", "3267 158 192 1.2151 1 0 0 0 0 0 0 ... 0 \n", "3268 25 100 4.0 1 0 0 0 0 0 0 ... 0 \n", "3269 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3270 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3271 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3272 106 110 1.0377 1 0 0 0 0 0 0 ... 0 \n", "3273 30 30 1.0 0 0 0 0 0 0 0 ... 1 \n", "3274 170 94 0.5529 0 0 0 0 0 0 0 ... 0 \n", "3275 101 140 1.3861 1 0 0 0 0 0 0 ... 0 \n", "3276 23 120 5.2173 1 0 0 0 0 0 0 ... 0 \n", "3277 -1 -1 -1 1 0 0 0 0 0 0 ... 0 \n", "3278 40 40 1.0 1 0 0 0 0 0 0 ... 0 \n", "\n", " 1549 1550 1551 1552 1553 1554 1555 1556 1557 \n", "0 0 0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 \n", "5 0 0 0 0 0 0 0 0 0 \n", "6 0 0 0 0 0 0 0 0 0 \n", "7 0 0 0 0 0 0 0 0 0 \n", "8 0 0 0 0 0 0 0 0 0 \n", "9 0 0 0 0 0 0 0 0 0 \n", "10 0 0 0 0 0 0 0 0 0 \n", "11 0 0 0 0 0 0 0 0 0 \n", "12 0 0 0 0 0 0 0 0 0 \n", "13 0 0 0 0 0 0 0 0 0 \n", "14 0 0 0 0 0 0 0 0 0 \n", "15 0 0 0 0 0 1 1 0 0 \n", "16 0 0 0 0 0 1 1 0 0 \n", "17 0 0 0 0 0 0 0 0 0 \n", "18 0 0 0 0 0 1 1 1 0 \n", "19 0 0 0 0 0 0 0 0 0 \n", "20 0 0 0 0 0 0 0 0 0 \n", "21 0 0 0 0 0 0 0 0 0 \n", "22 0 0 0 0 0 0 0 0 0 \n", "23 0 0 0 0 0 0 0 0 0 \n", "24 0 0 0 0 0 0 0 0 0 \n", "25 0 0 0 0 0 0 0 0 0 \n", "26 0 0 0 0 0 0 0 0 0 \n", "27 0 0 0 0 0 0 0 0 0 \n", "28 0 0 0 0 0 0 0 0 0 \n", "29 0 0 0 0 0 0 0 0 0 \n", "... ... ... ... ... ... ... ... ... ... \n", "3249 0 0 0 0 0 0 0 0 0 \n", "3250 0 0 0 0 0 0 0 0 0 \n", "3251 0 0 0 0 0 0 0 0 0 \n", "3252 0 0 0 0 0 0 0 0 0 \n", "3253 0 0 0 0 0 0 0 0 0 \n", "3254 0 0 0 0 0 0 0 0 0 \n", "3255 0 0 0 0 0 0 0 0 0 \n", "3256 0 0 0 0 0 0 0 0 0 \n", "3257 0 0 0 0 0 0 0 0 0 \n", "3258 0 0 0 0 0 0 0 0 0 \n", "3259 0 0 0 0 0 0 0 0 0 \n", "3260 0 0 0 0 0 0 0 0 0 \n", "3261 0 0 0 0 0 0 0 0 0 \n", "3262 0 0 0 0 0 0 0 0 0 \n", "3263 0 0 0 0 0 0 0 0 0 \n", "3264 0 0 0 0 0 0 0 0 0 \n", "3265 0 0 0 0 0 0 0 0 0 \n", "3266 0 0 0 0 0 0 0 0 0 \n", "3267 0 0 0 0 0 0 0 0 0 \n", "3268 0 0 0 0 0 0 0 0 0 \n", "3269 0 0 0 0 0 0 0 0 0 \n", "3270 0 0 0 0 0 0 0 0 0 \n", "3271 0 0 0 0 0 0 0 0 0 \n", "3272 0 0 0 0 0 0 0 0 0 \n", "3273 0 0 0 0 0 0 0 0 0 \n", "3274 0 0 0 0 0 0 0 0 0 \n", "3275 0 0 0 0 0 0 0 0 0 \n", "3276 0 0 0 0 0 0 0 0 0 \n", "3277 0 0 0 0 0 0 0 0 0 \n", "3278 0 0 0 0 0 0 0 0 0 \n", "\n", "[3279 rows x 1558 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#X.replace(to_replace=' *\\?', value=-1, regex=True, inplace=True)\n", "X.replace(['?'], [-1])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pipeline = Pipeline([\n", "('clf', DecisionTreeClassifier(criterion='entropy'))\n", "])\n", "\n", "parameters = {\n", "'clf__max_depth': (150, 155, 160),\n", "'clf__min_samples_split': (1, 2, 3),\n", "'clf__min_samples_leaf': (1, 2, 3)\n", "}" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#grid_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print( 'Best score: %0.3f' % grid_search.best_score_)\n", "print( 'Best parameters set:')\n", "best_parameters = grid_search.best_estimator_.get_params()\n", "for param_name in sorted(parameters.keys()):\n", " print( '\\t%s: %r' % (param_name, best_parameters[param_name]))\n", " \n", "predictions = grid_search.predict(X_test)\n", "\n", "print ('Accuracy:', accuracy_score(y_test, predictions))\n", "print ('Confusion Matrix:', confusion_matrix(y_test, predictions))\n", "print ('Classification Report:', classification_report(y_test, predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tree ensembles (RandomForestClassifier) \n", "\n", "Ensemble learning methods combine a set of models to produce an estimator that\n", "has better predictive performance than its individual components. A random forest\n", "is a collection of decision trees that have been trained on randomly selected subsets\n", "of the training instances and explanatory variables. Random forests usually make\n", "predictions by returning the mode or mean of the predictions of their constituent\n", "trees. \n", "\n", "Random forests are less prone to overfitting than decision trees because no single\n", "tree can learn from all of the instances and explanatory variables; no single tree can\n", "memorize all of the noise in the representation" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pipeline = Pipeline([\n", "('clf', RandomForestClassifier(criterion='entropy'))\n", "])\n", "\n", "parameters = {\n", "'clf__n_estimators': (5, 10, 20, 50),\n", "'clf__max_depth': (50, 150, 250),\n", "'clf__min_samples_split': (1, 2, 3),\n", "'clf__min_samples_leaf': (1, 2, 3)\n", "}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')\n", "#grid_search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The advantages and disadvantages of decision trees \n", "\n", "Decision trees are easy to use. Unlike many learning\n", "algorithms, decision trees do not require the data to have zero mean and unit\n", "variance. While decision trees can tolerate missing values for explanatory variables,\n", "scikit-learn's current implementation cannot. Decision trees can even learn to ignore\n", "explanatory variables that are not relevant to the task. \n", "\n", "Small decision trees can be easy to interpret and visualize with the export_graphviz\n", "function from scikit-learn's tree module. The branches of a decision tree are\n", "conjunctions of logical predicates, and they are easily visualized as flowcharts.\n", "Decision trees support multioutput tasks, and a single decision tree can be used for\n", "multiclass classification without employing a strategy like one-versus-all. \n", "\n", "\n", "decision trees are eager learners. Eager learners\n", "must build an input-independent model from the training data before they can be\n", "used to estimate the values of test instances, but can predict relatively quickly once\n", "the model has been built. In contrast, lazy learners such as the k-nearest neighbors\n", "algorithm defer all generalization until they must make a prediction. Lazy learners\n", "do not spend time training, but often predict slowly compared to eager learners. \n", "\n", "Decision trees are more prone to overfitting than many of the models, Pruning is a common\n", "strategy that removes some of the tallest nodes and leaves of a decision tree but\n", "it is not currently implemented in scikit-learn. However, similar effects can be\n", "achieved by setting a maximum depth for the tree or by creating child nodes only\n", "when the number of training instances they will contain exceeds a threshold.\n", "\n", "Some of Algo are :\n", "ID3, C4.5, J4.5, RandomeForest\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }