{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Week 4: Supervised Learning\n",
    "\n",
    "The end of week for marked the half way point of the structured curriculum portion of the program. The entire cohort is starting to get the drill down. Learn about a topic in the morning, impliment it by hand to see the nuts a bolts, and then finish up with the sklearn version to validate our results. Supervised Learning is all about classifying data to a category, or most often, to a 0 or a 1. For example, if you have data about high school students and you know if they got into college or not, you can use a model to predict whether a current high school student will get into a college.\n",
    "\n",
    "Topics of the week:\n",
    "1. kNN\n",
    "2. Decision Trees\n",
    "3. Entropy/Information Gain/Gini Impurity\n",
    "4. Random Forest\n",
    "5. Bagging/Boosting/Testing with Out Of Bag observations\n",
    "6. Maximum Margin/Support Vector Classifier/SVM/Tuning with Kernals\n",
    "7. Gradient Boosting/AdaBoosting\n",
    "8. Profit Curves\n",
    "\n",
    "For our code sample this week we are going to use Random Forests on a cell phone data set. We are going to try and predict if customers will churn or not based off of their cell usage statistics. Lets import our packages and see what our data looks like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from roc import plot_roc\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import confusion_matrix\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/churn.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#####Were going to be working with cell phone data lets check out our features and clean it up a bit. There are some columns with \"yes/no\" and we want to change those over to 0's and 1's"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Account Length</th>\n",
       "      <th>Int'l Plan</th>\n",
       "      <th>VMail Plan</th>\n",
       "      <th>VMail Message</th>\n",
       "      <th>Day Mins</th>\n",
       "      <th>Day Calls</th>\n",
       "      <th>Day Charge</th>\n",
       "      <th>Eve Mins</th>\n",
       "      <th>Eve Calls</th>\n",
       "      <th>Eve Charge</th>\n",
       "      <th>Night Mins</th>\n",
       "      <th>Night Calls</th>\n",
       "      <th>Night Charge</th>\n",
       "      <th>Intl Mins</th>\n",
       "      <th>Intl Calls</th>\n",
       "      <th>Intl Charge</th>\n",
       "      <th>CustServ Calls</th>\n",
       "      <th>Churn?</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td> 128</td>\n",
       "      <td> False</td>\n",
       "      <td>  True</td>\n",
       "      <td> 25</td>\n",
       "      <td> 265.1</td>\n",
       "      <td> 110</td>\n",
       "      <td> 45.07</td>\n",
       "      <td> 197.4</td>\n",
       "      <td>  99</td>\n",
       "      <td> 16.78</td>\n",
       "      <td> 244.7</td>\n",
       "      <td>  91</td>\n",
       "      <td> 11.01</td>\n",
       "      <td> 10.0</td>\n",
       "      <td> 3</td>\n",
       "      <td> 2.70</td>\n",
       "      <td> 1</td>\n",
       "      <td> False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 107</td>\n",
       "      <td> False</td>\n",
       "      <td>  True</td>\n",
       "      <td> 26</td>\n",
       "      <td> 161.6</td>\n",
       "      <td> 123</td>\n",
       "      <td> 27.47</td>\n",
       "      <td> 195.5</td>\n",
       "      <td> 103</td>\n",
       "      <td> 16.62</td>\n",
       "      <td> 254.4</td>\n",
       "      <td> 103</td>\n",
       "      <td> 11.45</td>\n",
       "      <td> 13.7</td>\n",
       "      <td> 3</td>\n",
       "      <td> 3.70</td>\n",
       "      <td> 1</td>\n",
       "      <td> False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> 137</td>\n",
       "      <td> False</td>\n",
       "      <td> False</td>\n",
       "      <td>  0</td>\n",
       "      <td> 243.4</td>\n",
       "      <td> 114</td>\n",
       "      <td> 41.38</td>\n",
       "      <td> 121.2</td>\n",
       "      <td> 110</td>\n",
       "      <td> 10.30</td>\n",
       "      <td> 162.6</td>\n",
       "      <td> 104</td>\n",
       "      <td>  7.32</td>\n",
       "      <td> 12.2</td>\n",
       "      <td> 5</td>\n",
       "      <td> 3.29</td>\n",
       "      <td> 0</td>\n",
       "      <td> False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>  84</td>\n",
       "      <td>  True</td>\n",
       "      <td> False</td>\n",
       "      <td>  0</td>\n",
       "      <td> 299.4</td>\n",
       "      <td>  71</td>\n",
       "      <td> 50.90</td>\n",
       "      <td>  61.9</td>\n",
       "      <td>  88</td>\n",
       "      <td>  5.26</td>\n",
       "      <td> 196.9</td>\n",
       "      <td>  89</td>\n",
       "      <td>  8.86</td>\n",
       "      <td>  6.6</td>\n",
       "      <td> 7</td>\n",
       "      <td> 1.78</td>\n",
       "      <td> 2</td>\n",
       "      <td> False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>  75</td>\n",
       "      <td>  True</td>\n",
       "      <td> False</td>\n",
       "      <td>  0</td>\n",
       "      <td> 166.7</td>\n",
       "      <td> 113</td>\n",
       "      <td> 28.34</td>\n",
       "      <td> 148.3</td>\n",
       "      <td> 122</td>\n",
       "      <td> 12.61</td>\n",
       "      <td> 186.9</td>\n",
       "      <td> 121</td>\n",
       "      <td>  8.41</td>\n",
       "      <td> 10.1</td>\n",
       "      <td> 3</td>\n",
       "      <td> 2.73</td>\n",
       "      <td> 3</td>\n",
       "      <td> False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Account Length Int'l Plan VMail Plan  VMail Message  Day Mins  Day Calls  \\\n",
       "0             128      False       True             25     265.1        110   \n",
       "1             107      False       True             26     161.6        123   \n",
       "2             137      False      False              0     243.4        114   \n",
       "3              84       True      False              0     299.4         71   \n",
       "4              75       True      False              0     166.7        113   \n",
       "\n",
       "   Day Charge  Eve Mins  Eve Calls  Eve Charge  Night Mins  Night Calls  \\\n",
       "0       45.07     197.4         99       16.78       244.7           91   \n",
       "1       27.47     195.5        103       16.62       254.4          103   \n",
       "2       41.38     121.2        110       10.30       162.6          104   \n",
       "3       50.90      61.9         88        5.26       196.9           89   \n",
       "4       28.34     148.3        122       12.61       186.9          121   \n",
       "\n",
       "   Night Charge  Intl Mins  Intl Calls  Intl Charge  CustServ Calls Churn?  \n",
       "0         11.01       10.0           3         2.70               1  False  \n",
       "1         11.45       13.7           3         3.70               1  False  \n",
       "2          7.32       12.2           5         3.29               0  False  \n",
       "3          8.86        6.6           7         1.78               2  False  \n",
       "4          8.41       10.1           3         2.73               3  False  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Int'l Plan\"] = df[\"Int'l Plan\"].apply(lambda word: word == 'yes')\n",
    "df[\"VMail Plan\"] = df[\"VMail Plan\"].apply(lambda word: word == 'yes')\n",
    "df[\"Churn?\"] = df[\"Churn?\"].apply(lambda word: word != 'False.')\n",
    "df = df.drop(['State', 'Area Code', 'Phone'], axis=1)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#####Next lets split up our x and y so we can ultimately make our train/test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = df.pop('Churn?').values\n",
    "X = df.values\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rf = RandomForestClassifier(oob_score=True)\n",
    "rf.fit(X_train, y_train)\n",
    "y_predicted = rf.predict(X_test)\n",
    "cf = confusion_matrix(y_test, y_predicted, labels=[True, False])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}