{ "metadata": { "name": "", "signature": "sha256:84213b66dc238f70d6ee451ea986e0e6f1b761b35f059af8bb7d29774323efba" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to data analysis using machine learning #\n", "\n", "## 06. Classification with Decision Trees ##\n", "\n", "by David Taylor, [www.prooffreader.com](http://www.prooffreader.com) (blog) [www.dtdata.io](http://dtdata.io) (hire me!)\n", "\n", "For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of [this GitHub repo.](https://github.com/Prooffreader/intro_machine_learning)\n", "\n", "This is notebook 6 of 8. The next notebook is: [[07. Classification with Random Forest]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/07_Classification_Random_Forest.ipynb)\n", "\n", "[[01]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/01_The_Dataset.ipynb) [[02]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/02_Clustering_KMeans.ipynb) [[03]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/03_Clustering_OtherAlgos.ipynb) [[04]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/04_Classification_kNN.ipynb) [[05]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/05_Classification_OtherAlgos.ipynb) **[06]** [[07]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/07_Classification_Random_Forest.ipynb) [[08]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n", "\n", "***\n", "\n", "We look further at one of the classification algorithms we saw in the [previous notebook](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/05_Classification_OtherAlgos.ipynb). Again, this is an algorithm that is not used a lot in practice, but is very intuitively useful for beginners. Don't worry, the [next algorithm](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/07_Classification_Random_Forest.ipynb) is one that's used a lot!\n", "\n", "For the first time, we encounter an algorithm that is convenient to visualize with all five of our features, not just two that we can see in a scatter plot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Import libraries and load data #" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from sklearn import tree\n", "from sklearn.externals.six import StringIO\n", "import re\n", "\n", "df = pd.read_csv('fruit.csv')\n", "fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}\n", "colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}\n", "fruitlist = ['Orange', 'Pear', 'Apple']\n", "\n", "df.sort('fruit_id', inplace=True) # This is important because the factorizer assigns numbers\n", " # based on the order the first label is encountered, e.g. if the first instance had\n", " # fruit = 3, the y value would be 0.\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Classify with a Decision Tree and view Confusion Matrix #\n", "\n", "With all five features used, the confusion matrix should be a perfect or near-perfect classifier on the testing set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n", "train, test = df[df['is_train']==True], df[df['is_train']==False]\n", "features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n", "y, _ = pd.factorize(train['fruit_id'])\n", "clf = tree.DecisionTreeClassifier()\n", "clf = clf.fit(train[features], y)\n", "preds = clf.predict(test[features])\n", "test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n", " np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n", "test_result" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
predicted | \n", "Apple | \n", "Orange | \n", "Pear | \n", "
---|---|---|---|
actual | \n", "\n", " | \n", " | \n", " |
Apple | \n", "11 | \n", "0 | \n", "0 | \n", "
Orange | \n", "1 | \n", "13 | \n", "0 | \n", "
Pear | \n", "0 | \n", "0 | \n", "15 | \n", "