{
 "metadata": {
  "name": "",
  "signature": "sha256:63fd463d06ca725cacea9d6caa81981f0c791d4b95f4fdb09d09484770b6d833"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Introduction to data analysis using machine learning #\n",
      "\n",
      "## 07. Classification with Random Forest ##\n",
      "\n",
      "by David Taylor, [www.prooffreader.com](http://www.prooffreader.com) (blog) [www.dtdata.io](http://dtdata.io) (hire me!)\n",
      "\n",
      "For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of [this GitHub repo.](https://github.com/Prooffreader/intro_machine_learning)\n",
      "\n",
      "This is notebook 7 of 8. The next notebook is: [[08. Dimensionality reduction with PCA]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n",
      "\n",
      "[[01]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/01_The_Dataset.ipynb) [[02]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/02_Clustering_KMeans.ipynb) [[03]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/03_Clustering_OtherAlgos.ipynb) [[04]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/04_Classification_kNN.ipynb) [[05]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/05_Classification_OtherAlgos.ipynb) [[06]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb) **[07]** [[08]](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/08_Dimensionality_Reduction.ipynb)\n",
      "\n",
      "***\n",
      "\n",
      "In the [previous notebook](http://nbviewer.ipython.org/github/Prooffreader/intro_machine_learning/blob/master/06_Classification_Decision_Trees.ipynb), we saw Decision Trees; now we look at an ensemble method, Random Forest."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 1. Import libraries and load data #"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline\n",
      "from matplotlib.colors import ListedColormap\n",
      "from sklearn import neighbors\n",
      "from sklearn.ensemble import RandomForestClassifier\n",
      "import time\n",
      "\n",
      "df = pd.read_csv('fruit.csv')\n",
      "fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}\n",
      "colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}\n",
      "fruitlist = ['Orange', 'Pear', 'Apple']\n",
      "\n",
      "df.sort('fruit_id', inplace=True) # This is important because the factorizer assigns numbers\n",
      "    # based on the order the first label is encountered, e.g. if the first instance had\n",
      "    # fruit = 3, the y value would be 0.\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 2. Train a Random Forest Classifier and show a confusion matrix of the test set #"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
      "train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
      "features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
      "y, _ = pd.factorize(train['fruit_id'])\n",
      "clf = RandomForestClassifier(n_jobs=2)\n",
      "clf = clf.fit(train[features], y)\n",
      "preds = clf.predict(test[features])\n",
      "test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
      "                      np.array([fruitnames[x+1] for x in preds]),\n",
      "                      rownames=['actual'], colnames=['predicted'])\n",
      "test_result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>predicted</th>\n",
        "      <th>Apple</th>\n",
        "      <th>Orange</th>\n",
        "      <th>Pear</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>actual</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Apple</th>\n",
        "      <td> 15</td>\n",
        "      <td>  0</td>\n",
        "      <td>  0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Orange</th>\n",
        "      <td>  0</td>\n",
        "      <td> 11</td>\n",
        "      <td>  0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Pear</th>\n",
        "      <td>  1</td>\n",
        "      <td>  0</td>\n",
        "      <td> 18</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "predicted  Apple  Orange  Pear\n",
        "actual                        \n",
        "Apple         15       0     0\n",
        "Orange         0      11     0\n",
        "Pear           1       0    18"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Show feature importances"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for i, score in enumerate(list(clf.feature_importances_)):\n",
      "    print(round(100*score, 1), features[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2.2 color_id\n",
        "53.1 elongatedness\n",
        "7.0 weight\n",
        "24.8 sweetness\n",
        "12.9 acidity\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Show confusion matrix of 100 runs of Random Forest Classifier using all features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reps=100\n",
      "features=['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
      "title_suffix='with all features'\n",
      "\n",
      "start = time.time()\n",
      "for i in range(reps):\n",
      "    df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
      "    train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
      "    y, _ = pd.factorize(train['fruit_id'])\n",
      "    clf = RandomForestClassifier(n_jobs=2)\n",
      "    clf = clf.fit(train[features], y)\n",
      "    preds = clf.predict(test[features])\n",
      "    test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
      "                          np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
      "    if i == 0:\n",
      "        final_result = test_result[:]\n",
      "    else:\n",
      "        final_result += test_result\n",
      "confmatrix = np.array(final_result)\n",
      "correct = 0\n",
      "for i in range(confmatrix.shape[0]):\n",
      "    correct += confmatrix[i,i]\n",
      "accuracy = correct/confmatrix.sum()\n",
      "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
      "final_result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "100 runs with all features\n",
        "Features: ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']\n",
        "Accuracy: 0.9856951274027715\n",
        "time: 23 sec\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>predicted</th>\n",
        "      <th>Apple</th>\n",
        "      <th>Orange</th>\n",
        "      <th>Pear</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>actual</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Apple</th>\n",
        "      <td> 1184</td>\n",
        "      <td>   31</td>\n",
        "      <td>   21</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Orange</th>\n",
        "      <td>    4</td>\n",
        "      <td> 1481</td>\n",
        "      <td>    1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Pear</th>\n",
        "      <td>    2</td>\n",
        "      <td>    5</td>\n",
        "      <td> 1745</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "predicted  Apple  Orange  Pear\n",
        "actual                        \n",
        "Apple       1184      31    21\n",
        "Orange         4    1481     1\n",
        "Pear           2       5  1745"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Show confusion matrix of 100 runs of Random Forest Classifier using only two most important features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reps=100\n",
      "features=['elongatedness','sweetness',]\n",
      "title_suffix='with only 2 most important features'\n",
      "\n",
      "import time\n",
      "start = time.time()\n",
      "for i in range(reps):\n",
      "    df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
      "    train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
      "    y, _ = pd.factorize(train['fruit_id'])\n",
      "    clf = RandomForestClassifier(n_jobs=2)\n",
      "    clf = clf.fit(train[features], y)\n",
      "    preds = clf.predict(test[features])\n",
      "    test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
      "                          np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
      "    if i == 0:\n",
      "        final_result = test_result[:]\n",
      "    else:\n",
      "        final_result += test_result\n",
      "confmatrix = np.array(final_result)\n",
      "correct = 0\n",
      "for i in range(confmatrix.shape[0]):\n",
      "    correct += confmatrix[i,i]\n",
      "accuracy = correct/confmatrix.sum()\n",
      "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
      "final_result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "100 runs with only 2 most important features\n",
        "Features: ['elongatedness', 'sweetness']\n",
        "Accuracy: 0.9867831541218638\n",
        "time: 23 sec\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>predicted</th>\n",
        "      <th>Apple</th>\n",
        "      <th>Orange</th>\n",
        "      <th>Pear</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>actual</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Apple</th>\n",
        "      <td> 1221</td>\n",
        "      <td>   18</td>\n",
        "      <td>   31</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Orange</th>\n",
        "      <td>    8</td>\n",
        "      <td> 1481</td>\n",
        "      <td>    0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Pear</th>\n",
        "      <td>    1</td>\n",
        "      <td>    1</td>\n",
        "      <td> 1703</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "predicted  Apple  Orange  Pear\n",
        "actual                        \n",
        "Apple       1221      18    31\n",
        "Orange         8    1481     0\n",
        "Pear           1       1  1703"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Show confusion matrix of 100 runs of Random Forest Classifier using only two least important features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reps=100\n",
      "features=['color_id','acidity',]\n",
      "title_suffix='with only 2 least important features'\n",
      "\n",
      "import time\n",
      "start = time.time()\n",
      "for i in range(reps):\n",
      "    df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set\n",
      "    train, test = df[df['is_train']==True], df[df['is_train']==False]\n",
      "    y, _ = pd.factorize(train['fruit_id'])\n",
      "    clf = RandomForestClassifier(n_jobs=2)\n",
      "    clf = clf.fit(train[features], y)\n",
      "    preds = clf.predict(test[features])\n",
      "    test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), \n",
      "                          np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])\n",
      "    if i == 0:\n",
      "        final_result = test_result[:]\n",
      "    else:\n",
      "        final_result += test_result\n",
      "confmatrix = np.array(final_result)\n",
      "correct = 0\n",
      "for i in range(confmatrix.shape[0]):\n",
      "    correct += confmatrix[i,i]\n",
      "accuracy = correct/confmatrix.sum()\n",
      "print('{} runs {}\\nFeatures: {}\\nAccuracy: {}\\ntime: {} sec'.format(reps, title_suffix, features, accuracy, int(time.time()-start)))\n",
      "final_result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "100 runs with only 2 least important features\n",
        "Features: ['color_id', 'acidity']\n",
        "Accuracy: 0.7456040191824618\n",
        "time: 23 sec\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>predicted</th>\n",
        "      <th>Apple</th>\n",
        "      <th>Orange</th>\n",
        "      <th>Pear</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>actual</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Apple</th>\n",
        "      <td> 647</td>\n",
        "      <td>  104</td>\n",
        "      <td>  483</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Orange</th>\n",
        "      <td>  79</td>\n",
        "      <td> 1359</td>\n",
        "      <td>   25</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Pear</th>\n",
        "      <td> 353</td>\n",
        "      <td>   70</td>\n",
        "      <td> 1259</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "predicted  Apple  Orange  Pear\n",
        "actual                        \n",
        "Apple        647     104   483\n",
        "Orange        79    1359    25\n",
        "Pear         353      70  1259"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}