{ "metadata": { "name": "", "signature": "sha256:63fd463d06ca725cacea9d6caa81981f0c791d4b95f4fdb09d09484770b6d833" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to data analysis using machine learning #\n", "\n", "## 07. Classification with Random Forest ##\n", "\n", "by David Taylor, [www.prooffreader.com](http://www.prooffreader.com) (blog) [www.dtdata.io](http://dtdata.io) (hire me!)\n", "\n", "For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of [this GitHub repo.](https://github.com/Prooffreader/intro_machine_learning)\n", "\n", "This is notebook 7 of 8. The next notebook is: [[08. Dimensionality reduction with PCA]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n", "\n", "[[01]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/01_The_Dataset.ipynb) [[02]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/02_Clustering_KMeans.ipynb) [[03]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/03_Clustering_OtherAlgos.ipynb) [[04]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/04_Classification_kNN.ipynb) [[05]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/05_Classification_OtherAlgos.ipynb) [[06]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb) **[07]** [[08]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n", "\n", "***\n", "\n", "In the [previous notebook](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb), we saw Decision Trees; now we look at an ensemble method, Random Forest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Import libraries and load data #" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from matplotlib.colors import ListedColormap\n", "from sklearn import neighbors\n", "from sklearn.ensemble import RandomForestClassifier\n", "import time\n", "\n", "df = pd.read_csv('fruit.csv')\n", "fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}\n", "colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}\n", "fruitlist = ['Orange', 'Pear', 'Apple']\n", "\n", "df.sort('fruit_id', inplace=True) # This is important because the factorizer assigns numbers\n", " # based on the order the first label is encountered, e.g. if the first instance had\n", " # fruit = 3, the y value would be 0.\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Train a Random Forest Classifier and show a confusion matrix of the test set #" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n", "train, test = df[df['is_train']==True], df[df['is_train']==False]\n", "features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n", "y, _ = pd.factorize(train['fruit_id'])\n", "clf = RandomForestClassifier(n_jobs=2)\n", "clf = clf.fit(train[features], y)\n", "preds = clf.predict(test[features])\n", "test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n", " np.array([fruitnames[x+1] for x in preds]),\n", " rownames=['actual'], colnames=['predicted'])\n", "test_result" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedAppleOrangePear
actual
Apple 15 0 0
Orange 0 11 0
Pear 1 0 18
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "predicted Apple Orange Pear\n", "actual \n", "Apple 15 0 0\n", "Orange 0 11 0\n", "Pear 1 0 18" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show feature importances" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for i, score in enumerate(list(clf.feature_importances_)):\n", " print(round(100*score, 1), features[i])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2.2 color_id\n", "53.1 elongatedness\n", "7.0 weight\n", "24.8 sweetness\n", "12.9 acidity\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show confusion matrix of 100 runs of Random Forest Classifier using all features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "reps=100\n", "features=['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n", "title_suffix='with all features'\n", "\n", "start = time.time()\n", "for i in range(reps):\n", " df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n", " train, test = df[df['is_train']==True], df[df['is_train']==False]\n", " y, _ = pd.factorize(train['fruit_id'])\n", " clf = RandomForestClassifier(n_jobs=2)\n", " clf = clf.fit(train[features], y)\n", " preds = clf.predict(test[features])\n", " test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n", " np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n", " if i == 0:\n", " final_result = test_result[:]\n", " else:\n", " final_result += test_result\n", "confmatrix = np.array(final_result)\n", "correct = 0\n", "for i in range(confmatrix.shape[0]):\n", " correct += confmatrix[i,i]\n", "accuracy = correct/confmatrix.sum()\n", "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n", "final_result" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "100 runs with all features\n", "Features: ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n", "Accuracy: 0.9856951274027715\n", "time: 23 sec\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedAppleOrangePear
actual
Apple 1184 31 21
Orange 4 1481 1
Pear 2 5 1745
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "predicted Apple Orange Pear\n", "actual \n", "Apple 1184 31 21\n", "Orange 4 1481 1\n", "Pear 2 5 1745" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show confusion matrix of 100 runs of Random Forest Classifier using only two most important features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "reps=100\n", "features=['elongatedness','sweetness',]\n", "title_suffix='with only 2 most important features'\n", "\n", "import time\n", "start = time.time()\n", "for i in range(reps):\n", " df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n", " train, test = df[df['is_train']==True], df[df['is_train']==False]\n", " y, _ = pd.factorize(train['fruit_id'])\n", " clf = RandomForestClassifier(n_jobs=2)\n", " clf = clf.fit(train[features], y)\n", " preds = clf.predict(test[features])\n", " test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n", " np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n", " if i == 0:\n", " final_result = test_result[:]\n", " else:\n", " final_result += test_result\n", "confmatrix = np.array(final_result)\n", "correct = 0\n", "for i in range(confmatrix.shape[0]):\n", " correct += confmatrix[i,i]\n", "accuracy = correct/confmatrix.sum()\n", "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n", "final_result" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "100 runs with only 2 most important features\n", "Features: ['elongatedness', 'sweetness']\n", "Accuracy: 0.9867831541218638\n", "time: 23 sec\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedAppleOrangePear
actual
Apple 1221 18 31
Orange 8 1481 0
Pear 1 1 1703
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "predicted Apple Orange Pear\n", "actual \n", "Apple 1221 18 31\n", "Orange 8 1481 0\n", "Pear 1 1 1703" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show confusion matrix of 100 runs of Random Forest Classifier using only two least important features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "reps=100\n", "features=['color_id','acidity',]\n", "title_suffix='with only 2 least important features'\n", "\n", "import time\n", "start = time.time()\n", "for i in range(reps):\n", " df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n", " train, test = df[df['is_train']==True], df[df['is_train']==False]\n", " y, _ = pd.factorize(train['fruit_id'])\n", " clf = RandomForestClassifier(n_jobs=2)\n", " clf = clf.fit(train[features], y)\n", " preds = clf.predict(test[features])\n", " test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n", " np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n", " if i == 0:\n", " final_result = test_result[:]\n", " else:\n", " final_result += test_result\n", "confmatrix = np.array(final_result)\n", "correct = 0\n", "for i in range(confmatrix.shape[0]):\n", " correct += confmatrix[i,i]\n", "accuracy = correct/confmatrix.sum()\n", "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n", "final_result" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "100 runs with only 2 least important features\n", "Features: ['color_id', 'acidity']\n", "Accuracy: 0.7456040191824618\n", "time: 23 sec\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedAppleOrangePear
actual
Apple 647 104 483
Orange 79 1359 25
Pear 353 70 1259
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "predicted Apple Orange Pear\n", "actual \n", "Apple 647 104 483\n", "Orange 79 1359 25\n", "Pear 353 70 1259" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }