{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Week 4: Supervised Learning\n", "\n", "The end of week for marked the half way point of the structured curriculum portion of the program. The entire cohort is starting to get the drill down. Learn about a topic in the morning, impliment it by hand to see the nuts a bolts, and then finish up with the sklearn version to validate our results. Supervised Learning is all about classifying data to a category, or most often, to a 0 or a 1. For example, if you have data about high school students and you know if they got into college or not, you can use a model to predict whether a current high school student will get into a college.\n", "\n", "Topics of the week:\n", "1. kNN\n", "2. Decision Trees\n", "3. Entropy/Information Gain/Gini Impurity\n", "4. Random Forest\n", "5. Bagging/Boosting/Testing with Out Of Bag observations\n", "6. Maximum Margin/Support Vector Classifier/SVM/Tuning with Kernals\n", "7. Gradient Boosting/AdaBoosting\n", "8. Profit Curves\n", "\n", "For our code sample this week we are going to use Random Forests on a cell phone data set. We are going to try and predict if customers will churn or not based off of their cell usage statistics. Lets import our packages and see what our data looks like" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from roc import plot_roc\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import confusion_matrix\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv('data/churn.csv')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#####Were going to be working with cell phone data lets check out our features and clean it up a bit. There are some columns with \"yes/no\" and we want to change those over to 0's and 1's" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Account Length | \n", "Int'l Plan | \n", "VMail Plan | \n", "VMail Message | \n", "Day Mins | \n", "Day Calls | \n", "Day Charge | \n", "Eve Mins | \n", "Eve Calls | \n", "Eve Charge | \n", "Night Mins | \n", "Night Calls | \n", "Night Charge | \n", "Intl Mins | \n", "Intl Calls | \n", "Intl Charge | \n", "CustServ Calls | \n", "Churn? | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "128 | \n", "False | \n", "True | \n", "25 | \n", "265.1 | \n", "110 | \n", "45.07 | \n", "197.4 | \n", "99 | \n", "16.78 | \n", "244.7 | \n", "91 | \n", "11.01 | \n", "10.0 | \n", "3 | \n", "2.70 | \n", "1 | \n", "False | \n", "
1 | \n", "107 | \n", "False | \n", "True | \n", "26 | \n", "161.6 | \n", "123 | \n", "27.47 | \n", "195.5 | \n", "103 | \n", "16.62 | \n", "254.4 | \n", "103 | \n", "11.45 | \n", "13.7 | \n", "3 | \n", "3.70 | \n", "1 | \n", "False | \n", "
2 | \n", "137 | \n", "False | \n", "False | \n", "0 | \n", "243.4 | \n", "114 | \n", "41.38 | \n", "121.2 | \n", "110 | \n", "10.30 | \n", "162.6 | \n", "104 | \n", "7.32 | \n", "12.2 | \n", "5 | \n", "3.29 | \n", "0 | \n", "False | \n", "
3 | \n", "84 | \n", "True | \n", "False | \n", "0 | \n", "299.4 | \n", "71 | \n", "50.90 | \n", "61.9 | \n", "88 | \n", "5.26 | \n", "196.9 | \n", "89 | \n", "8.86 | \n", "6.6 | \n", "7 | \n", "1.78 | \n", "2 | \n", "False | \n", "
4 | \n", "75 | \n", "True | \n", "False | \n", "0 | \n", "166.7 | \n", "113 | \n", "28.34 | \n", "148.3 | \n", "122 | \n", "12.61 | \n", "186.9 | \n", "121 | \n", "8.41 | \n", "10.1 | \n", "3 | \n", "2.73 | \n", "3 | \n", "False | \n", "