{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Week 4: Supervised Learning\n", "\n", "The end of week for marked the half way point of the structured curriculum portion of the program. The entire cohort is starting to get the drill down. Learn about a topic in the morning, impliment it by hand to see the nuts a bolts, and then finish up with the sklearn version to validate our results. Supervised Learning is all about classifying data to a category, or most often, to a 0 or a 1. For example, if you have data about high school students and you know if they got into college or not, you can use a model to predict whether a current high school student will get into a college.\n", "\n", "Topics of the week:\n", "1. kNN\n", "2. Decision Trees\n", "3. Entropy/Information Gain/Gini Impurity\n", "4. Random Forest\n", "5. Bagging/Boosting/Testing with Out Of Bag observations\n", "6. Maximum Margin/Support Vector Classifier/SVM/Tuning with Kernals\n", "7. Gradient Boosting/AdaBoosting\n", "8. Profit Curves\n", "\n", "For our code sample this week we are going to use Random Forests on a cell phone data set. We are going to try and predict if customers will churn or not based off of their cell usage statistics. Lets import our packages and see what our data looks like" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from roc import plot_roc\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import confusion_matrix\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv('data/churn.csv')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#####Were going to be working with cell phone data lets check out our features and clean it up a bit. There are some columns with \"yes/no\" and we want to change those over to 0's and 1's" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Account LengthInt'l PlanVMail PlanVMail MessageDay MinsDay CallsDay ChargeEve MinsEve CallsEve ChargeNight MinsNight CallsNight ChargeIntl MinsIntl CallsIntl ChargeCustServ CallsChurn?
0 128 False True 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False
1 107 False True 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False
2 137 False False 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False
3 84 True False 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False
4 75 True False 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False
\n", "
" ], "text/plain": [ " Account Length Int'l Plan VMail Plan VMail Message Day Mins Day Calls \\\n", "0 128 False True 25 265.1 110 \n", "1 107 False True 26 161.6 123 \n", "2 137 False False 0 243.4 114 \n", "3 84 True False 0 299.4 71 \n", "4 75 True False 0 166.7 113 \n", "\n", " Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls \\\n", "0 45.07 197.4 99 16.78 244.7 91 \n", "1 27.47 195.5 103 16.62 254.4 103 \n", "2 41.38 121.2 110 10.30 162.6 104 \n", "3 50.90 61.9 88 5.26 196.9 89 \n", "4 28.34 148.3 122 12.61 186.9 121 \n", "\n", " Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn? \n", "0 11.01 10.0 3 2.70 1 False \n", "1 11.45 13.7 3 3.70 1 False \n", "2 7.32 12.2 5 3.29 0 False \n", "3 8.86 6.6 7 1.78 2 False \n", "4 8.41 10.1 3 2.73 3 False " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Int'l Plan\"] = df[\"Int'l Plan\"].apply(lambda word: word == 'yes')\n", "df[\"VMail Plan\"] = df[\"VMail Plan\"].apply(lambda word: word == 'yes')\n", "df[\"Churn?\"] = df[\"Churn?\"].apply(lambda word: word != 'False.')\n", "df = df.drop(['State', 'Area Code', 'Phone'], axis=1)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#####Next lets split up our x and y so we can ultimately make our train/test split" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = df.pop('Churn?').values\n", "X = df.values\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rf = RandomForestClassifier(oob_score=True)\n", "rf.fit(X_train, y_train)\n", "y_predicted = rf.predict(X_test)\n", "cf = confusion_matrix(y_test, y_predicted, labels=[True, False])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }