{
"metadata": {
"name": "",
"signature": "sha256:63fd463d06ca725cacea9d6caa81981f0c791d4b95f4fdb09d09484770b6d833"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to data analysis using machine learning #\n",
"\n",
"## 07. Classification with Random Forest ##\n",
"\n",
"by David Taylor, [www.prooffreader.com](http://www.prooffreader.com) (blog) [www.dtdata.io](http://dtdata.io) (hire me!)\n",
"\n",
"For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of [this GitHub repo.](https://github.com/Prooffreader/intro_machine_learning)\n",
"\n",
"This is notebook 7 of 8. The next notebook is: [[08. Dimensionality reduction with PCA]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n",
"\n",
"[[01]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/01_The_Dataset.ipynb) [[02]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/02_Clustering_KMeans.ipynb) [[03]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/03_Clustering_OtherAlgos.ipynb) [[04]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/04_Classification_kNN.ipynb) [[05]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/05_Classification_OtherAlgos.ipynb) [[06]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb) **[07]** [[08]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n",
"\n",
"***\n",
"\n",
"In the [previous notebook](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb), we saw Decision Trees; now we look at an ensemble method, Random Forest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Import libraries and load data #"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"from matplotlib.colors import ListedColormap\n",
"from sklearn import neighbors\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"import time\n",
"\n",
"df = pd.read_csv('fruit.csv')\n",
"fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}\n",
"colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}\n",
"fruitlist = ['Orange', 'Pear', 'Apple']\n",
"\n",
"df.sort('fruit_id', inplace=True) # This is important because the factorizer assigns numbers\n",
" # based on the order the first label is encountered, e.g. if the first instance had\n",
" # fruit = 3, the y value would be 0.\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Train a Random Forest Classifier and show a confusion matrix of the test set #"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
"train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
"features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
"y, _ = pd.factorize(train['fruit_id'])\n",
"clf = RandomForestClassifier(n_jobs=2)\n",
"clf = clf.fit(train[features], y)\n",
"preds = clf.predict(test[features])\n",
"test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
" np.array([fruitnames[x+1] for x in preds]),\n",
" rownames=['actual'], colnames=['predicted'])\n",
"test_result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" predicted | \n",
" Apple | \n",
" Orange | \n",
" Pear | \n",
"
\n",
" \n",
" actual | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Apple | \n",
" 15 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Orange | \n",
" 0 | \n",
" 11 | \n",
" 0 | \n",
"
\n",
" \n",
" Pear | \n",
" 1 | \n",
" 0 | \n",
" 18 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"predicted Apple Orange Pear\n",
"actual \n",
"Apple 15 0 0\n",
"Orange 0 11 0\n",
"Pear 1 0 18"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show feature importances"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for i, score in enumerate(list(clf.feature_importances_)):\n",
" print(round(100*score, 1), features[i])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"2.2 color_id\n",
"53.1 elongatedness\n",
"7.0 weight\n",
"24.8 sweetness\n",
"12.9 acidity\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show confusion matrix of 100 runs of Random Forest Classifier using all features"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reps=100\n",
"features=['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
"title_suffix='with all features'\n",
"\n",
"start = time.time()\n",
"for i in range(reps):\n",
" df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
" train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
" y, _ = pd.factorize(train['fruit_id'])\n",
" clf = RandomForestClassifier(n_jobs=2)\n",
" clf = clf.fit(train[features], y)\n",
" preds = clf.predict(test[features])\n",
" test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
" np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
" if i == 0:\n",
" final_result = test_result[:]\n",
" else:\n",
" final_result += test_result\n",
"confmatrix = np.array(final_result)\n",
"correct = 0\n",
"for i in range(confmatrix.shape[0]):\n",
" correct += confmatrix[i,i]\n",
"accuracy = correct/confmatrix.sum()\n",
"print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
"final_result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100 runs with all features\n",
"Features: ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
"Accuracy: 0.9856951274027715\n",
"time: 23 sec\n"
]
},
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" predicted | \n",
" Apple | \n",
" Orange | \n",
" Pear | \n",
"
\n",
" \n",
" actual | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Apple | \n",
" 1184 | \n",
" 31 | \n",
" 21 | \n",
"
\n",
" \n",
" Orange | \n",
" 4 | \n",
" 1481 | \n",
" 1 | \n",
"
\n",
" \n",
" Pear | \n",
" 2 | \n",
" 5 | \n",
" 1745 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"predicted Apple Orange Pear\n",
"actual \n",
"Apple 1184 31 21\n",
"Orange 4 1481 1\n",
"Pear 2 5 1745"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show confusion matrix of 100 runs of Random Forest Classifier using only two most important features"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reps=100\n",
"features=['elongatedness','sweetness',]\n",
"title_suffix='with only 2 most important features'\n",
"\n",
"import time\n",
"start = time.time()\n",
"for i in range(reps):\n",
" df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
" train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
" y, _ = pd.factorize(train['fruit_id'])\n",
" clf = RandomForestClassifier(n_jobs=2)\n",
" clf = clf.fit(train[features], y)\n",
" preds = clf.predict(test[features])\n",
" test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
" np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
" if i == 0:\n",
" final_result = test_result[:]\n",
" else:\n",
" final_result += test_result\n",
"confmatrix = np.array(final_result)\n",
"correct = 0\n",
"for i in range(confmatrix.shape[0]):\n",
" correct += confmatrix[i,i]\n",
"accuracy = correct/confmatrix.sum()\n",
"print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
"final_result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100 runs with only 2 most important features\n",
"Features: ['elongatedness', 'sweetness']\n",
"Accuracy: 0.9867831541218638\n",
"time: 23 sec\n"
]
},
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" predicted | \n",
" Apple | \n",
" Orange | \n",
" Pear | \n",
"
\n",
" \n",
" actual | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Apple | \n",
" 1221 | \n",
" 18 | \n",
" 31 | \n",
"
\n",
" \n",
" Orange | \n",
" 8 | \n",
" 1481 | \n",
" 0 | \n",
"
\n",
" \n",
" Pear | \n",
" 1 | \n",
" 1 | \n",
" 1703 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"predicted Apple Orange Pear\n",
"actual \n",
"Apple 1221 18 31\n",
"Orange 8 1481 0\n",
"Pear 1 1 1703"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show confusion matrix of 100 runs of Random Forest Classifier using only two least important features"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reps=100\n",
"features=['color_id','acidity',]\n",
"title_suffix='with only 2 least important features'\n",
"\n",
"import time\n",
"start = time.time()\n",
"for i in range(reps):\n",
" df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
" train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
" y, _ = pd.factorize(train['fruit_id'])\n",
" clf = RandomForestClassifier(n_jobs=2)\n",
" clf = clf.fit(train[features], y)\n",
" preds = clf.predict(test[features])\n",
" test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
" np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
" if i == 0:\n",
" final_result = test_result[:]\n",
" else:\n",
" final_result += test_result\n",
"confmatrix = np.array(final_result)\n",
"correct = 0\n",
"for i in range(confmatrix.shape[0]):\n",
" correct += confmatrix[i,i]\n",
"accuracy = correct/confmatrix.sum()\n",
"print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
"final_result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100 runs with only 2 least important features\n",
"Features: ['color_id', 'acidity']\n",
"Accuracy: 0.7456040191824618\n",
"time: 23 sec\n"
]
},
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" predicted | \n",
" Apple | \n",
" Orange | \n",
" Pear | \n",
"
\n",
" \n",
" actual | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Apple | \n",
" 647 | \n",
" 104 | \n",
" 483 | \n",
"
\n",
" \n",
" Orange | \n",
" 79 | \n",
" 1359 | \n",
" 25 | \n",
"
\n",
" \n",
" Pear | \n",
" 353 | \n",
" 70 | \n",
" 1259 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"predicted Apple Orange Pear\n",
"actual \n",
"Apple 647 104 483\n",
"Orange 79 1359 25\n",
"Pear 353 70 1259"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}