{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Agenda\n", "\n", "- Define the problem and the approach\n", "-

Data basics: loading data, looking at your data, basic commands

\n", "- Handling missing values\n", "- Intro to scikit-learn\n", "- Grouping and aggregating data\n", "- Feature selection\n", "- Fitting and evaluating a model\n", "- Deploying your work" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##In this notebook you will\n", "\n", "- Learn how to load data into Python\n", "- Learn the basics of working with data in `pandas`\n", "- Clean and manage your data\n", "- Wrangle missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Reading from a file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import pylab as pl\n", "import numpy as np\n", "import re" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "np.sum" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to use the read_csv function in pandas" ] }, { "cell_type": "code", "collapsed": false, "input": [ "?pd.read_csv" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "! head -n 2 ./data/credit-training.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents\r\n", "1,0.76612660900000007,45,2,0.8029821290000001,9120.0,13,0,6,0,2.0\r\n" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.read_csv(\"./data/credit-training.csv\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##What is df?\n", "Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#barf!!!\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n",
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 150000 entries, 0 to 149999\n",
        "Data columns (total 11 columns):\n",
        "SeriousDlqin2yrs                        150000  non-null values\n",
        "RevolvingUtilizationOfUnsecuredLines    150000  non-null values\n",
        "age                                     150000  non-null values\n",
        "NumberOfTime30-59DaysPastDueNotWorse    150000  non-null values\n",
        "DebtRatio                               150000  non-null values\n",
        "MonthlyIncome                           120269  non-null values\n",
        "NumberOfOpenCreditLinesAndLoans         150000  non-null values\n",
        "NumberOfTimes90DaysLate                 150000  non-null values\n",
        "NumberRealEstateLoansOrLines            150000  non-null values\n",
        "NumberOfTime60-89DaysPastDueNotWorse    150000  non-null values\n",
        "NumberOfDependents                      146076  non-null values\n",
        "dtypes: float64(4), int64(7)\n",
        "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "\n", "Int64Index: 150000 entries, 0 to 149999\n", "Data columns (total 11 columns):\n", "SeriousDlqin2yrs 150000 non-null values\n", "RevolvingUtilizationOfUnsecuredLines 150000 non-null values\n", "age 150000 non-null values\n", "NumberOfTime30-59DaysPastDueNotWorse 150000 non-null values\n", "DebtRatio 150000 non-null values\n", "MonthlyIncome 120269 non-null values\n", "NumberOfOpenCreditLinesAndLoans 150000 non-null values\n", "NumberOfTimes90DaysLate 150000 non-null values\n", "NumberRealEstateLoansOrLines 150000 non-null values\n", "NumberOfTime60-89DaysPastDueNotWorse 150000 non-null values\n", "NumberOfDependents 146076 non-null values\n", "dtypes: float64(4), int64(7)" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "#you'll learn more about head() later\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0 1 0.766127 45 2 0.802982 9120 13 0 6 0 2
1 0 0.957151 40 0 0.121876 2600 4 0 0 0 1
2 0 0.658180 38 1 0.085113 3042 2 1 0 0 0
3 0 0.233810 30 0 0.036050 3300 5 0 0 0 0
4 0 0.907239 49 1 0.024926 63588 7 0 1 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "0 1 0.766127 45 \n", "1 0 0.957151 40 \n", "2 0 0.658180 38 \n", "3 0 0.233810 30 \n", "4 0 0.907239 49 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "0 2 0.802982 9120 \n", "1 0 0.121876 2600 \n", "2 1 0.085113 3042 \n", "3 0 0.036050 3300 \n", "4 1 0.024926 63588 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "0 13 0 \n", "1 4 0 \n", "2 2 1 \n", "3 5 0 \n", "4 7 0 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "0 6 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 1 0 \n", "\n", " NumberOfDependents \n", "0 2 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 " ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "type(df)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can think of a Series is a column in your data. A series will have a particular datatype associated with it. Datatypes can be integers, strings, floating point numbers, etc." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print df.SeriousDlqin2yrs.head()\n", "type(df.SeriousDlqin2yrs)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 1\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "Name: SeriousDlqin2yrs, dtype: int64\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "pandas.core.series.Series" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "df.dtypes" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "SeriousDlqin2yrs int64\n", "RevolvingUtilizationOfUnsecuredLines float64\n", "age int64\n", "NumberOfTime30-59DaysPastDueNotWorse int64\n", "DebtRatio float64\n", "MonthlyIncome float64\n", "NumberOfOpenCreditLinesAndLoans int64\n", "NumberOfTimes90DaysLate int64\n", "NumberRealEstateLoansOrLines int64\n", "NumberOfTime60-89DaysPastDueNotWorse int64\n", "NumberOfDependents float64\n", "dtype: object" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Basic Commands\n", "Many of these commands will be familiar for those of you coming from a UNIX/Linux background. If not, don't sweat it, these commands are super easy, but extremely helpful for \"taking a look around\" at your data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Selecting a column" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df['DebtRatio']\n", "df.DebtRatio" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "0 0.802982\n", "1 0.121876\n", "2 0.085113\n", "3 0.036050\n", "4 0.024926\n", "5 0.375607\n", "6 5710.000000\n", "7 0.209940\n", "8 46.000000\n", "9 0.606291\n", "10 0.309476\n", "11 0.531529\n", "12 0.298354\n", "13 0.382965\n", "14 477.000000\n", "...\n", "149985 0.324962\n", "149986 0.080384\n", "149987 0.055692\n", "149988 0.347924\n", "149989 0.001408\n", "149990 0.609779\n", "149991 0.477658\n", "149992 4132.000000\n", "149993 0.000000\n", "149994 0.404293\n", "149995 0.225131\n", "149996 0.716562\n", "149997 3870.000000\n", "149998 0.000000\n", "149999 0.249908\n", "Name: DebtRatio, Length: 150000, dtype: float64" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###head(n=5)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0 1 0.766127 45 2 0.802982 9120 13 0 6 0 2
1 0 0.957151 40 0 0.121876 2600 4 0 0 0 1
2 0 0.658180 38 1 0.085113 3042 2 1 0 0 0
3 0 0.233810 30 0 0.036050 3300 5 0 0 0 0
4 0 0.907239 49 1 0.024926 63588 7 0 1 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "0 1 0.766127 45 \n", "1 0 0.957151 40 \n", "2 0 0.658180 38 \n", "3 0 0.233810 30 \n", "4 0 0.907239 49 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "0 2 0.802982 9120 \n", "1 0 0.121876 2600 \n", "2 1 0.085113 3042 \n", "3 0 0.036050 3300 \n", "4 1 0.024926 63588 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "0 13 0 \n", "1 4 0 \n", "2 2 1 \n", "3 5 0 \n", "4 7 0 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "0 6 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 1 0 \n", "\n", " NumberOfDependents \n", "0 2 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 " ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "df.head(1)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0 1 0.766127 45 2 0.802982 9120 13 0 6 0 2
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "0 1 0.766127 45 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "0 2 0.802982 9120 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "0 13 0 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "0 6 0 \n", "\n", " NumberOfDependents \n", "0 2 " ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "df.SeriousDlqin2yrs.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "0 1\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "Name: SeriousDlqin2yrs, dtype: int64" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###tail(n=5)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
149995 0 0.040674 74 0 0.225131 2100 4 0 1 0 0
149996 0 0.299745 44 0 0.716562 5584 4 0 1 0 2
149997 0 0.246044 58 0 3870.000000 NaN 18 0 1 0 0
149998 0 0.000000 30 0 0.000000 5716 4 0 0 0 0
149999 0 0.850283 64 0 0.249908 8158 8 0 2 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "149995 0 0.040674 74 \n", "149996 0 0.299745 44 \n", "149997 0 0.246044 58 \n", "149998 0 0.000000 30 \n", "149999 0 0.850283 64 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "149995 0 0.225131 2100 \n", "149996 0 0.716562 5584 \n", "149997 0 3870.000000 NaN \n", "149998 0 0.000000 5716 \n", "149999 0 0.249908 8158 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "149995 4 0 \n", "149996 4 0 \n", "149997 18 0 \n", "149998 4 0 \n", "149999 8 0 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "149995 1 0 \n", "149996 1 0 \n", "149997 1 0 \n", "149998 0 0 \n", "149999 2 0 \n", "\n", " NumberOfDependents \n", "149995 0 \n", "149996 2 \n", "149997 0 \n", "149998 0 \n", "149999 0 " ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "df.RevolvingUtilizationOfUnsecuredLines.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "149995 0.040674\n", "149996 0.299745\n", "149997 0.246044\n", "149998 0.000000\n", "149999 0.850283\n", "Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###describe(percentile_width=50)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 120269.000000 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 0.066840 6.048438 52.295207 0.421033 353.005076 6670.221237 8.452760 0.265973 1.018240 0.240387 0.757222
std 0.249746 249.755371 14.771866 4.192781 2037.818523 14384.674215 5.145951 4.169304 1.129771 4.155179 1.115086
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.029867 41.000000 0.000000 0.175074 3400.000000 5.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.154181 52.000000 0.000000 0.366508 5400.000000 8.000000 0.000000 1.000000 0.000000 0.000000
75% 0.000000 0.559046 63.000000 0.000000 0.868254 8249.000000 11.000000 0.000000 2.000000 0.000000 1.000000
max 1.000000 50708.000000 109.000000 98.000000 329664.000000 3008750.000000 58.000000 98.000000 54.000000 98.000000 20.000000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "count 150000.000000 150000.000000 150000.000000 \n", "mean 0.066840 6.048438 52.295207 \n", "std 0.249746 249.755371 14.771866 \n", "min 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.029867 41.000000 \n", "50% 0.000000 0.154181 52.000000 \n", "75% 0.000000 0.559046 63.000000 \n", "max 1.000000 50708.000000 109.000000 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "count 150000.000000 150000.000000 120269.000000 \n", "mean 0.421033 353.005076 6670.221237 \n", "std 4.192781 2037.818523 14384.674215 \n", "min 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.175074 3400.000000 \n", "50% 0.000000 0.366508 5400.000000 \n", "75% 0.000000 0.868254 8249.000000 \n", "max 98.000000 329664.000000 3008750.000000 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "count 150000.000000 150000.000000 \n", "mean 8.452760 0.265973 \n", "std 5.145951 4.169304 \n", "min 0.000000 0.000000 \n", "25% 5.000000 0.000000 \n", "50% 8.000000 0.000000 \n", "75% 11.000000 0.000000 \n", "max 58.000000 98.000000 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "count 150000.000000 150000.000000 \n", "mean 1.018240 0.240387 \n", "std 1.129771 4.155179 \n", "min 0.000000 0.000000 \n", "25% 0.000000 0.000000 \n", "50% 1.000000 0.000000 \n", "75% 2.000000 0.000000 \n", "max 54.000000 98.000000 \n", "\n", " NumberOfDependents \n", "count 146076.000000 \n", "mean 0.757222 \n", "std 1.115086 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 1.000000 \n", "max 20.000000 " ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "df.age.describe(percentile_width=25)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "count 150000.000000\n", "mean 52.295207\n", "std 14.771866\n", "min 0.000000\n", "37.5% 47.000000\n", "50% 52.000000\n", "62.5% 57.000000\n", "max 109.000000\n", "dtype: float64" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###unqiue() and nunique()" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.NumberOfDependents.unique()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "array([ 2., 1., 0., nan, 3., 4., 5., 6., 8., 7., 20.,\n", " 10., 9., 13.])" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "df.NumberOfDependents.nunique()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "13" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###pd.value_counts(values_to_count)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.value_counts(df.NumberOfDependents)\n", "df.NumberOfDependents.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "0 86902\n", "1 26316\n", "2 19522\n", "3 9483\n", "4 2862\n", "5 746\n", "6 158\n", "7 51\n", "8 24\n", "10 5\n", "9 5\n", "20 1\n", "13 1\n", "dtype: int64" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "pd.value_counts(df.NumberOfDependents, ascending=True)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "13 1\n", "20 1\n", "9 5\n", "10 5\n", "8 24\n", "7 51\n", "6 158\n", "5 746\n", "4 2862\n", "3 9483\n", "2 19522\n", "1 26316\n", "0 86902\n", "dtype: int64" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "pd.value_counts(df.NumberOfDependents, sort=False)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "0 86902\n", "1 26316\n", "2 19522\n", "3 9483\n", "4 2862\n", "5 746\n", "6 158\n", "7 51\n", "8 24\n", "9 5\n", "10 5\n", "13 1\n", "20 1\n", "dtype: int64" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "#chain value_counts together with head() to give you the top 3\n", "pd.value_counts(df.NumberOfDependents).head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "0 86902\n", "1 26316\n", "2 19522\n", "dtype: int64" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "pd.value_counts(df.NumberOfDependents).plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAELCAYAAAAx94awAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X9UU3eaP/A3Gma6bZUIraESpqkSQJSCbSd6ek4rXQoo\nnVo7dqC6y4+Wdiqe2VHPHhfbPXt2948x8ezZU/VUdns6sdrOFnDcrjpdzThasm23gi1C7RZnTSvy\nI4nprBHFXyDwfP8Q7hcEBOXmEpL365wcej+59/O+92Ifkic3SYSICIiIKKxMmegdICIi7bH4ExGF\nIRZ/IqIwxOJPRBSGWPyJiMIQiz8RURgatfhv3boVqampmD9/PrZu3QoA8Pv9yMrKQmJiIrKzs9He\n3q6sb7VaYTabkZycjEOHDinjdXV1SE1Nhdlsxtq1a5Xxzs5O5Ofnw2w2Y9GiRWhublbz+IiIaBi3\nLP7/8z//g1//+tf44osv8NVXX+Gjjz7Cd999B5vNhqysLJw6dQqZmZmw2WwAgMbGRlRVVaGxsREO\nhwNr1qxB/9sISktLYbfb4XK54HK54HA4AAB2ux0xMTFwuVxYv349ysrKAnzIRER0y+L/xz/+EQsX\nLsRdd92FqVOnYvHixfj3f/937N+/H0VFRQCAoqIi7N27FwCwb98+rFy5EpGRkTCZTEhISEBtbS28\nXi86OjpgsVgAAIWFhco2A+dasWIFjhw5ErCDJSKiG25Z/OfPn49PP/0Ufr8fV65cwYEDB9DW1gaf\nzweDwQAAMBgM8Pl8AACPxwOj0ahsbzQa4Xa7h4zHxcXB7XYDANxuN+Lj4wEAOp0OUVFR8Pv96h4l\nERENorvVncnJySgrK0N2djbuuecepKenY+rUqYPWiYiIQEREREB3koiI1HXL4g8AL7/8Ml5++WUA\nwN/+7d/CaDTCYDDg7NmziI2NhdfrxcyZMwHceETf2tqqbNvW1gaj0Yi4uDi0tbUNGe/fpqWlBbNm\nzUJ3dzcuXLiA6OjoIfuRkJCA7777bnxHS0QUZtLS0tDQ0DBkfNSrfb7//nsAQEtLCz788EOsWrUK\ny5Ytw65duwAAu3btwvLlywEAy5YtQ2VlJbq6utDU1ASXywWLxYLY2FhMnz4dtbW1EBG8//77eO65\n55Rt+ufas2cPMjMzh92P7777DiJyx7e///u/H9f2zJicOaGSEUrHEioZk+VYvvrqq2Fr6qiP/F94\n4QWcO3cOkZGRKC8vR1RUFDZu3Ii8vDzY7XaYTCbs3r0bAJCSkoK8vDykpKRAp9OhvLxcaQmVl5ej\nuLgYV69eRW5uLpYsWQIAKCkpQUFBAcxmM2JiYlBZWTnaLt2RM2fOBGReZgR3TqhkaJXDjODLCVTG\nqMX/k08+GTIWHR2Nw4cPD7v+G2+8gTfeeGPI+KOPPoqvv/56yPgPf/hD5Y8HERFpI2ze4VtcXMyM\nIMrQKidUMrTKYUbw5QQqI0JEJsWXuURERGCS7CoRUdAYqXaGzSN/p9PJjCDK0ConVDK0ymFG8OUE\nKmPUnn+wmz49Gh0d51WZa9q0Gbh4kW8wI6LQN+nbPjeuJlLrENhaIqLQEvZtHyIi+v/CqPg7A58w\nift/WmdolRMqGVrlMCP4cgKVEUbFn4iI+rHnP3g29vyJKKSw509ERIowKv7OwCdM4v6f1hla5YRK\nhlY5zAi+HPb8iYhINez5D56NPX8iCins+RMRkSKMir8z8AmTuP+ndYZWOaGSoVUOM4Ivhz1/IiJS\nzag9f6vVit/85jeYMmUKUlNT8e677+Ly5cvIz89Hc3Oz8k1eer1eWX/Hjh2YOnUqtm3bhuzsbABA\nXV0diouLce3aNeTm5mLr1q0AgM7OThQWFuL48eOIiYlBVVUVHnzwwaE7yp4/EdFtu6Oe/5kzZ/DO\nO+/g+PHj+Prrr9HT04PKykrYbDZkZWXh1KlTyMzMhM1mAwA0NjaiqqoKjY2NcDgcWLNmjRJaWloK\nu90Ol8sFl8sFh8MBALDb7YiJiYHL5cL69etRVlam9rETEdFNbln8p0+fjsjISFy5cgXd3d24cuUK\nZs2ahf3796OoqAgAUFRUhL179wIA9u3bh5UrVyIyMhImkwkJCQmora2F1+tFR0cHLBYLAKCwsFDZ\nZuBcK1aswJEjRwJ0qM4AzTsgYRL3/7TO0ConVDK0ymFG8OVMSM8/Ojoaf/3Xf40f/ehHmDVrFvR6\nPbKysuDz+WAwGAAABoMBPp8PAODxeGA0GpXtjUYj3G73kPG4uDi43W4AgNvtRnx8PABAp9MhKioK\nfj8/U5+IKJBuWfy/++47bNmyBWfOnIHH48GlS5fwm9/8ZtA6ERERfX33YJcR+IQMZgRbTqhkaJXD\njODLCVTGLb/J68svv8Tjjz+OmJgYAMBPf/pTHD16FLGxsTh79ixiY2Ph9Xoxc+ZMADce0be2tirb\nt7W1wWg0Ii4uDm1tbUPG+7dpaWnBrFmz0N3djQsXLiA6OnrY/SkuLobJZAIA6PV6pKenD7jX2fcz\nY5zLfUt9T7X6TzyXucxlLk+GZafTiZ07dwKAUi+HJbfQ0NAg8+bNkytXrkhvb68UFhbKW2+9JRs2\nbBCbzSYiIlarVcrKykRE5JtvvpG0tDTp7OyU06dPy+zZs6W3t1dERCwWi9TU1Ehvb68sXbpUDh48\nKCIi27dvl9WrV4uISEVFheTn5w+7LyPtKgABZAy36jGsc8vTMarq6upxbR9OGVrlhEqGVjnMCL6c\n8WaMVNdu+cg/LS0NhYWFeOyxxzBlyhQ88sgj+PnPf46Ojg7k5eXBbrcrl3oCQEpKCvLy8pCSkgKd\nTofy8nKlJVReXo7i4mJcvXoVubm5WLJkCQCgpKQEBQUFMJvNiImJQWVl5a12iYiIVMDP9hk8G6/z\nJ6KQws/2ISIiRRgVf2fgEybxNb9aZ2iVEyoZWuUwI/hyApURRsWfiIj6sec/eDb2/IkopLDnT0RE\nijAq/s7AJ0zi/p/WGVrlhEqGVjnMCL4c9vyJiEg17PkPno09fyIKKez5ExGRIoyKvzPwCZO4/6d1\nhlY5oZKhVQ4zgi+HPX8iIlINe/6DZ2PPn4hCCnv+RESkCKPi7wx8wiTu/2mdoVVOqGRolcOM4Mth\nz5+IiFTDnv/g2djzJ6KQcsc9///93//FggULlFtUVBS2bdsGv9+PrKwsJCYmIjs7G+3t7co2VqsV\nZrMZycnJOHTokDJeV1eH1NRUmM1mrF27Vhnv7OxEfn4+zGYzFi1ahObm5vEeLxER3cKoxT8pKQn1\n9fWor69HXV0d7r77bjz//POw2WzIysrCqVOnkJmZCZvNBgBobGxEVVUVGhsb4XA4sGbNGuWvTmlp\nKex2O1wuF1wuFxwOBwDAbrcjJiYGLpcL69evR1lZWQAO1RmAOW9KmMT9P60ztMoJlQytcpgRfDlB\n0fM/fPgwEhISEB8fj/3796OoqAgAUFRUhL179wIA9u3bh5UrVyIyMhImkwkJCQmora2F1+tFR0cH\nLBYLAKCwsFDZZuBcK1aswJEjR1Q7QCIiGuq2in9lZSVWrlwJAPD5fDAYDAAAg8EAn88HAPB4PDAa\njco2RqMRbrd7yHhcXBzcbjcAwO12Iz4+HgCg0+kQFRUFv98/jsMaTobK8w2TkMGMYMsJlQytcpgR\nfDmBytCNdcWuri787ne/w+bNm4fcFxER0ffCa2AVFxfDZDIBAPR6PdLT0wfc6+z7mTHO5b6lvqda\n/Seey1zmMpcnw7LT6cTOnTsBQKmXw5Ix2rt3r+Tk5CjLSUlJ4vV6RUTE4/FIUlKSiIhYrVaxWq3K\nejk5OVJTUyNer1eSk5OV8Q8++EBWr16trHP06FEREbl+/brcd999Q/JH2lUAAsgYbtVjWGfMp2NY\n1dXV49o+nDK0ygmVDK1ymBF8OePNGKmujbntU1FRobR8AGDZsmXYtWsXAGDXrl1Yvny5Ml5ZWYmu\nri40NTXB5XLBYrEgNjYW06dPR21tLUQE77//Pp577rkhc+3ZsweZmZlj3S0iIroDY7rO//Lly3jw\nwQfR1NSEadOmAQD8fj/y8vLQ0tICk8mE3bt3Q6/XAwA2bdqEHTt2QKfTYevWrcjJyQFw41LP4uJi\nXL16Fbm5udi2bRuAG5d6FhQUoL6+HjExMaisrBzydIXX+RMR3b4Ra+dYin8wYPEnIrp9/GA3Xucf\nVBla5YRKhlY5zAi+nEBlhFHxJyKifmz7DJ6NbR8iCils+xARkSKMir8z8AmTuP+ndYZWOaGSoVUO\nM4Ivhz1/IiJSDXv+g2djz5+IQgp7/kREpAij4u8MfMIk7v9pnaFVTqhkaJXDjODLYc+fiIhUw57/\n4NnY8yeikMKePxERKcKo+DsDnzCJ+39aZ2iVEyoZWuUwI/hy2PMnIiLVsOc/eDb2/IkopLDnT0RE\nijEV//b2drzwwguYO3cuUlJSUFtbC7/fj6ysLCQmJiI7Oxvt7e3K+larFWazGcnJyTh06JAyXldX\nh9TUVJjNZqxdu1YZ7+zsRH5+PsxmMxYtWoTm5mYVD7GfMwBz3pQwift/WmdolRMqGVrlMCP4cia0\n57927Vrk5ubi5MmTOHHiBJKTk2Gz2ZCVlYVTp04hMzMTNpsNANDY2Iiqqio0NjbC4XBgzZo1ylOO\n0tJS2O12uFwuuFwuOBwOAIDdbkdMTAxcLhfWr1+PsrKygBwsERH1Ge2b39vb2+Whhx4aMp6UlCRn\nz54VERGv1ytJSUkiIrJp0yax2WzKejk5OXL06FHxeDySnJysjFdUVMhrr72mrFNTUyMiItevX5f7\n7rtvSN5IuwpAAFHpNurpICKaVEaqa6M+8m9qasL999+Pl156CY888gheffVVXL58GT6fDwaDAQBg\nMBjg8/kAAB6PB0ajUdneaDTC7XYPGY+Li4Pb7QYAuN1uxMfHAwB0Oh2ioqLg9/vV+etGRERD6EZb\nobu7G8ePH8dbb72FH//4x1i3bp3S4ukXERHRd9VNYBUXF8NkMgEA9Ho90tPTB9zr7PuZMcLyFgDp\nt7i/f7lvqa/PlpGRMeblhoYGrFu37o63H8ty/1ig5h84d6Dm71/m+eL5CtT52rJlC9LT0wP6+xh4\nDMF0vpxOJ3bu3AkASr0c1mhPGbxer5hMJmX5008/ldzcXElOThav1ysiIh6PR2n7WK1WsVqtyvr9\nLR2v1zuo7fPBBx/I6tWrlXWOHj0qIoFs+1QHvO1TXV09ru3DKUOrnFDJ0CqHGcGXM96MkeramK7z\nf/LJJ/HrX/8aiYmJ+Id/+AdcuXIFABATE4OysjLYbDa0t7fDZrOhsbERq1atwrFjx+B2u/H000/j\n22+/RUREBBYuXIht27bBYrHgmWeewS9/+UssWbIE5eXl+Prrr/Ev//IvqKysxN69e1FZWTloH3id\nPxHR7Ruxdo6l+H/11Vd45ZVX0NXVhTlz5uDdd99FT08P8vLy0NLSApPJhN27d0Ov1wMANm3ahB07\ndkCn02Hr1q3IyckBcONSz+LiYly9ehW5ubnYtm0bgBuXehYUFKC+vh4xMTGorKwc8nSFxZ+I6PaN\nVDsnzeUtI+0q2PaZlBla5YRKhlY5zAi+nEC1ffgOXyKiMMTP9hk8G9s+RBRS+Nk+RESkCKPi7wx8\nwiT+nA+tM7TKCZUMrXKYEXw5gcoIo+JPRET92PMfPBt7/kQUUtjzJyIiRRgVf2fgEyZx/0/rDK1y\nQiVDqxxmBF8Oe/5ERKQa9vwHz8aePxGFFPb8iYhIEUbF3xn4hEnc/9M6Q6ucUMnQKocZwZfDnj8R\nEamGPf/Bs7HnT0QhhT1/IiJSjKn4m0wmPPzww1iwYAEsFgsAwO/3IysrC4mJicjOzkZ7e7uyvtVq\nhdlsRnJyMg4dOqSM19XVITU1FWazGWvXrlXGOzs7kZ+fD7PZjEWLFqG5uVmt4xvAGYA5b0qYxP0/\nrTO0ygmVDK1ymBF8ORPa84+IiIDT6UR9fT2OHTsGALDZbMjKysKpU6eQmZmpfKl7Y2Mjqqqq0NjY\nCIfDgTVr1ihPOUpLS2G32+FyueByueBwOAAAdrsdMTExcLlcWL9+PcrKygJxrERE1G8s3wRjMpnk\n//7v/waNJSUlydmzZ0Xkxpe893+B+6ZNm8Rmsynr9X85u8fjGfQF7hUVFfLaa68p69TU1IhIIL/A\nfSy3SfPFZkREYzJSXRvzI/+nn34ajz32GN555x0AgM/ng8FgAAAYDAb4fD4AgMfjgdFoVLY1Go1w\nu91DxuPi4uB2uwEAbrcb8fHxAACdToeoqCj4/f5x/lkjIqKRjKn4//d//zfq6+tx8OBBbN++HZ9+\n+umg+yMiIvquuglmzsAnTOL+n9YZWuWESoZWOcwIvpxAZejGstIDDzwAALj//vvx/PPP49ixYzAY\nDDh79ixiY2Ph9Xoxc+ZMADce0be2tirbtrW1wWg0Ii4uDm1tbUPG+7dpaWnBrFmz0N3djQsXLiA6\nOnrIfhQXF8NkMgEA9Ho90tPTB9zr7PuZMcJywyj39y/3LfWd8IyMjDEvNzQ03Nb6d7I8nv0LtmWe\nL56vQC03NDRoktdvoo934LLT6cTOnTsBQKmXwxqtX3T58mW5ePGiiIhcunRJHn/8cfn9738vGzZs\nUHr7VqtVysrKRETkm2++kbS0NOns7JTTp0/L7Nmzpbe3V0RELBaL1NTUSG9vryxdulQOHjwoIiLb\nt2+X1atXi8iN1wLy8/PH3LcCe/5ERCMaqa6N+sjf5/Ph+eefBwB0d3fjL/7iL5CdnY3HHnsMeXl5\nsNvtMJlM2L17NwAgJSUFeXl5SElJgU6nQ3l5udISKi8vR3FxMa5evYrc3FwsWbIEAFBSUoKCggKY\nzWbExMSgsrJytN0iIqLx0PiP0B0baVcx5kf+1QF/5F9dXT2u7cMpQ6ucUMnQKocZwZcz3oyR6hrf\n4UtEFIb42T6DZ+Nn+xBRSOFn+xARkSKMir8z8AmT+JpfrTO0ygmVDK1ymBF8OYHKCKPiT0RE/djz\nHzwbe/5EFFLY8yciIkUYFX9n4BMmcf9P6wytckIlQ6scZgRfDnv+RESkGvb8B8/Gnj8RhRT2/ImI\nSBFGxd8Z+IRJ3P/TOkOrnFDJ0CqHGcGXw54/ERGphj3/wbOx509EIYU9fyIiUoRR8XcGPmES9/+0\nztAqJ1QytMphRvDlTGjPv6enBwsWLMCzzz4LAPD7/cjKykJiYiKys7PR3t6urGu1WmE2m5GcnIxD\nhw4p43V1dUhNTYXZbMbatWuV8c7OTuTn58NsNmPRokVobm5W69iIiGgkY/kmmH/+53+WVatWybPP\nPisiIhs2bJDNmzeLiIjNZhvy/b1dXV3S1NQkc+bMUb6/98c//rHU1taKiAz5/t7S0lIREamsrBz2\n+3tv9W004Hf4EhGNaKS6Nuoj/7a2Nhw4cACvvPKK8qLB/v37UVRUBAAoKirC3r17AQD79u3DypUr\nERkZCZPJhISEBNTW1sLr9aKjowMWiwUAUFhYqGwzcK4VK1bgyJEj6v51IyKiIUYt/uvXr8c//dM/\nYcqU/7+qz+eDwWAAABgMBvh8PgCAx+OB0WhU1jMajXC73UPG4+Li4Ha7AQButxvx8fEAAJ1Oh6io\nKPj9fhUO7WbOAMx5U8Ik7v9pnaFVTqhkaJXDjODLmZCe/0cffYSZM2diwYIFI14CGRER0Xe5JRER\nTRa6W935+eefY//+/Thw4ACuXbuGixcvoqCgAAaDAWfPnkVsbCy8Xi9mzpwJ4MYj+tbWVmX7trY2\nGI1GxMXFoa2tbch4/zYtLS2YNWsWuru7ceHCBURHRw+7P8XFxTCZTAAAvV6P9PT0Afc6+35mjLDc\nPzbS/U4M1P/XNiMj47aWx7t9MCxnZGRoltcvmI7/dpd5voLvfPWPBcPxjnf5ds+X0+nEzp07AUCp\nl8Ma64sGTqdTfvKTn4jIjRd8bTabiIhYrdYhL/h2dnbK6dOnZfbs2coLvhaLRWpqaqS3t3fIC76r\nV68WEZGKigq+4EtEpKKR6tptXeff397ZuHEj/vCHPyAxMREff/wxNm7cCABISUlBXl4eUlJSsHTp\nUpSXlyvblJeX45VXXoHZbEZCQgKWLFkCACgpKcG5c+dgNpuxZcsW2Gy229ml2+AM0LwDEiZx/0/r\nDK1yQiVDqxxmBF9OoDJu2fYZaPHixVi8eDEAIDo6GocPHx52vTfeeANvvPHGkPFHH30UX3/99ZDx\nH/7wh9i9e/dYd4OIiFTAz/YZPBs/24eIQgo/24eIiBRhVPydgU+YxP0/rTO0ygmVDK1ymBF8OYHK\nCKPiT0RE/djzHzwbe/5EFFLY8yciIkUYFX9n4BMmcf9P6wytckIlQ6scZgRfDnv+RESkGvb8B8/G\nnj8RhRT2/ImISBFGxd8Z+IRJ3P/TOkOrnFDJ0CqHGcGXw54/ERGphj3/wbOx509EIYU9fyIiUoRR\n8XcGPmES9/+0ztAqJ1QytMphRvDlsOdPRESquWXP/9q1a1i8eDE6OzvR1dWF5557DlarFX6/H/n5\n+WhubobJZMLu3buh1+sBAFarFTt27MDUqVOxbds2ZGdnAwDq6upQXFyMa9euITc3F1u3bgUAdHZ2\norCwEMePH0dMTAyqqqrw4IMPDt1R9vyJiG7bHfX877rrLlRXV6OhoQEnTpxAdXU1PvvsM9hsNmRl\nZeHUqVPIzMxUvnqxsbERVVVVaGxshMPhwJo1a5TQ0tJS2O12uFwuuFwuOBwOAIDdbkdMTAxcLhfW\nr1+PsrIytY+diIhuMmrb5+677wYAdHV1oaenBzNmzMD+/ftRVFQEACgqKsLevXsBAPv27cPKlSsR\nGRkJk8mEhIQE1NbWwuv1oqOjAxaLBQBQWFiobDNwrhUrVuDIkSPqHyUA9vyDK0OrnFDJ0CqHGcGX\nM2E9/97eXqSnp8NgMOCpp57CvHnz4PP5YDAYAAAGgwE+nw8A4PF4YDQalW2NRiPcbveQ8bi4OLjd\nbgCA2+1GfHw8AECn0yEqKgp+v1+9IyQioiFG/QL3KVOmoKGhARcuXEBOTg6qq6sH3R8REdHXdw92\nGYFPyGBGsOWESoZWOcwIvpxAZYxa/PtFRUXhmWeeQV1dHQwGA86ePYvY2Fh4vV7MnDkTwI1H9K2t\nrco2bW1tMBqNiIuLQ1tb25Dx/m1aWlowa9YsdHd348KFC4iOjh52H4qLi2EymQAAer0e6enpA+51\n9v3MGOdy31LfU63+E89lLnOZy5Nh2el0YufOnQCg1MthyS386U9/kvPnz4uIyJUrV+SJJ56Qw4cP\ny4YNG8Rms4mIiNVqlbKyMhER+eabbyQtLU06Ozvl9OnTMnv2bOnt7RUREYvFIjU1NdLb2ytLly6V\ngwcPiojI9u3bZfXq1SIiUlFRIfn5+cPuy0i7CkAAGcOtegzr3PJ0jKq6unpc24dThlY5oZKhVQ4z\ngi9nvBkj1bVbPvL3er0oKipCb28vent7UVBQgMzMTCxYsAB5eXmw2+3KpZ4AkJKSgry8PKSkpECn\n06G8vFxpCZWXl6O4uBhXr15Fbm4ulixZAgAoKSlBQUEBzGYzYmJiUFlZeatdIiIiFfCzfQbPxuv8\niSik8LN9iIhIEUbF3xn4hEl8za/WGVrlhEqGVjnMCL6cQGWEUfEnIqJ+7PkPno09fyIKKez5ExGR\nIoyKv/OOt5w+PVp5J/N4b9OnD/8GtjEfxSTuMU5ETqhkaJXDjODLYc9/AnV0nMeN1tJot+pR17kx\nFxHRxGLPf/BsfF2BiEIKe/5ERKQIo+LvDImMydxjnIicUMnQKocZwZfDnj8REamGPf/Bs7HnT0Qh\nhT1/IiJShFHxd4ZExmTuMU5ETqhkaJXDjODLYc+fiIhUw57/4NnY8yeikHLHPf/W1lY89dRTmDdv\nHubPn49t27YBAPx+P7KyspCYmIjs7Gy0t7cr21itVpjNZiQnJ+PQoUPKeF1dHVJTU2E2m7F27Vpl\nvLOzE/n5+TCbzVi0aBGam5vHdbBERHRroxb/yMhIvPnmm/jmm29QU1OD7du34+TJk7DZbMjKysKp\nU6eQmZkJm80GAGhsbERVVRUaGxvhcDiwZs0a5a9OaWkp7HY7XC4XXC4XHA4HAMButyMmJgYulwvr\n169HWVlZAA7VGYA5tc+YzD3GicgJlQytcpgRfDkT1vOPjY1Feno6AODee+/F3Llz4Xa7sX//fhQV\nFQEAioqKsHfvXgDAvn37sHLlSkRGRsJkMiEhIQG1tbXwer3o6OiAxWIBABQWFirbDJxrxYoVOHLk\niPpHSkREitt6wffMmTOor6/HwoUL4fP5YDAYAAAGgwE+nw8A4PF4YDQalW2MRiPcbveQ8bi4OLjd\nbgCA2+1GfHw8AECn0yEqKgp+v398RzZEhsrzTUxGRkZoZGiVEyoZWuUwI/hyApWhG+uKly5dwooV\nK7B161ZMmzZt0H39H1ccaMXFxTCZTAAAvV6vPCO5wdn3M2Ocy31LfU+1/v+JD+z8XOYyl7msxrLT\n6cTOnTsBQKmXw5Ix6OrqkuzsbHnzzTeVsaSkJPF6vSIi4vF4JCkpSURErFarWK1WZb2cnBypqakR\nr9crycnJyvgHH3wgq1evVtY5evSoiIhcv35d7rvvviH7MNKuAhBAxnCrHsM6E5cxVtXV1ePaPlgy\ntMoJlQytcpgRfDnjzRip5oza9hERlJSUICUlBevWrVPGly1bhl27dgEAdu3aheXLlyvjlZWV6Orq\nQlNTE1wuFywWC2JjYzF9+nTU1tZCRPD+++/jueeeGzLXnj17kJmZOdpuhaSxfGnMU089pcmXxhBR\naBv1Ov/PPvsMTz75JB5++GGltWO1WmGxWJCXl4eWlhaYTCbs3r0ber0eALBp0ybs2LEDOp0OW7du\nRU5ODoAbl3oWFxfj6tWryM3NVS4b7ezsREFBAerr6xETE4PKysohT1fC4Tp/vp+AiNQ2Yu0crfgH\nCxZ/9XKIKHzwg91C5Dr/UHkvgVY5oZKhVQ4zgi8nUBlhVPyJiKgf2z6DZ2Pbh4hCCts+RESkCKPi\n72TGWBNwundmAAAP/ElEQVTYLw26DK1ymBF8Oez5ExGRatjzHzwbe/5EFFLY8yciIkUYFX8nM8aa\nwH5p0GVolcOM4Mthz5+IiFTDnv/g2djzJ6KQwp4/EREpwqj4O5kx1gT2S4MuQ6scZgRfDnv+RESk\nGvb8B8/Gnj8RhRT2/ImISDFq8X/55ZdhMBiQmpqqjPn9fmRlZSExMRHZ2dlob29X7rNarTCbzUhO\nTsahQ4eU8bq6OqSmpsJsNmPt2rXKeGdnJ/Lz82E2m7Fo0SI0NzerdWw3cQZo3tDLYL80+DK0ymFG\n8OVMWM//pZdegsPhGDRms9mQlZWFU6dOITMzEzabDQDQ2NiIqqoqNDY2wuFwYM2aNcrTjdLSUtjt\ndrhcLrhcLmVOu92OmJgYuFwurF+/HmVlZWofIxER3Wws3/7e1NQk8+fPV5aTkpLk7NmzIiLi9Xol\nKSlJREQ2bdokNptNWS8nJ0eOHj0qHo9HkpOTlfGKigp57bXXlHVqampEROT69ety3333DbsPI+0q\nAAFEpdvEZWiZQ0ThY6RacEc9f5/PB4PBAAAwGAzw+XwAAI/HA6PRqKxnNBrhdruHjMfFxcHtdgMA\n3G434uPjAQA6nQ5RUVHw+/13sltERDRGuvFOEBER0XeVSuAVFxfDZDIBAPR6PdLT0wfc6+z7mTHC\n8hYA6be4v3+5b6mvz5aRMdr6A5cbAKy7o/mH9vVutX3GGPbnxpw3zz+W5YH7cifbj3W5oaEB69at\nC9j8A48hUPPzfN3+shbna8uWLUhPTw/o72PgMQTT+XI6ndi5cycAKPVyWGN52jBc28fr9YqIiMfj\nUdo+VqtVrFarsl5/S8fr9Q5q+3zwwQeyevVqZZ2jR4+KSKDbPtUatH3uPGPsOWPJGDln2rQZfTnj\nv02bNmPEYxmL6urqcW0fThla5TAj+HLGmzFiXRvLxjcX/w0bNii9favVKmVlZSIi8s0330haWpp0\ndnbK6dOnZfbs2dLb2ysiIhaLRWpqaqS3t1eWLl0qBw8eFBGR7du3K38IKioqJD8///YOQIM+uRYZ\noXYsRBQc7rj4v/jii/LAAw9IZGSkGI1G2bFjh5w7d04yMzPFbDZLVlaWnD9/Xln/V7/6lcyZM0eS\nkpLE4XAo419++aXMnz9f5syZI3/1V3+ljF+7dk1+9rOfSUJCgixcuFCamppu7wBCqGCG0rEQUXAY\n1yP/YMC2z+1kjPdYxpcxVpPhKXOwZGiVw4zgywlU24fv8CUiCkP8bJ/Bs01YhlY5/PwgovDCz/Yh\nIiJFGBV/JzOCKmNyfy6K1hla5TAj+HIClRFGxZ+IiPqx5z94Nvb8VcggouDBnj8RESnCqPg7mRFU\nGZO7X6p1hlY5zAi+HPb8iYhINez5D56NPX8VMogoeLDnT0REijAq/k5mBFXG5O6Xap2hVQ4zgi+H\nPX8iIlINe/6DZ2PPX4WM6dOj0dFxXpWEadNm4OJFfq0n0Z0asXay+A+ajcV/kmQQ0dgE/Qu+DocD\nycnJMJvN2Lx5cwASnAGYkxnBnjOZe7ITkcOM4MsJ6Z5/T08PfvGLX8DhcKCxsREVFRU4efKkyikN\nKs/HjMmQ09AQGhla5TAj+HIClREUxf/YsWNISEiAyWRCZGQkXnzxRezbt0/llHaV52PGZMhpbw+N\nDK1ymBF8OYHKCIri73a7ER8frywbjUa43e4J3COaDKZPj0ZERMQtb//4j/846jrTp0dP9KEQaS4o\niv+NFwgD7Qwzgipj/Dk3riiSUW5Fo64z3iuTzpw5M67tgymHGcGXE7CMcX0zsEqOHj0qOTk5yvKm\nTZvEZrMNWictLW20/8t544033ni76ZaWljZs3Q2KSz27u7uRlJSEI0eOYNasWbBYLKioqMDcuXMn\neteIiEKSbqJ3AAB0Oh3eeust5OTkoKenByUlJSz8REQBFBSP/ImISFtB8YIvERFpKyjaPoHg8/nQ\n1taGiIgIxMXFwWAwTPQu3REtjkOrcxUqx8LzFXy0OI6TJ09i3759ymXoRqMRy5Ytm7Qt6pBr+9TX\n16O0tBTt7e0wGo0AgLa2Nuj1epSXl+ORRx5RLSuQ/+C0OA6tzlWoHAvP1+0LdMHU6jg2b96MiooK\nvPjii0pOa2srqqqqkJ+fj9dff12VHE3/wATo6s0J8/DDD0tNTc2Q8aNHj8rDDz+sSsbx48dl4cKF\nkpSUJJmZmZKZmSlJSUmycOFCqaurUyVDi+PQIkOrnFDJ0CpHiwybzSZpaWlitVrl/fffl/fff182\nbdokaWlpsmnTJlUytPqdJCQkSFdX15Dxzs5OmTNnjioZWpyvgUKu+CckJIx4n1q/JC3+wWlxHFpk\naJUTKhla5WiVEeiCqdXvJCkpSZqamoaMNzU1SWJioioZWpyvgUKu57906VLk5uaiqKgI8fHxEBG0\ntrbivffew5IlS1TJuHLlChYuXDhkfNGiRbh8+bIqGVochxYZWuWESoZWOVpkTJ06FW63GyaTadC4\nx+PB1KlTVcnQ6neyZcsWPP3000hISFA+iqa1tRUulwtvvfWWKhlanK+BQq7nDwAHDhzA/v37lb5Z\nXFwcli1bhtzcXFXm/+Uvf4lvv/122H9ws2fPVu0fQ6CPQ6sMrXJCJUOrnEBnOBwO/OIXvxixYC5d\nulSVHK1+Jz09PTh27BjcbrfyOt9jjz0GnU6dx9Bana9+IVn8taDVPziiySzQBTPUaHq+VG8kBbF/\n/dd/nehdUIUWx6HVuQqVY+H5Cj5aHUdubq4mOWrjm7xU9vbbb0/0LhAFvWeeeWaid0E177zzTsAz\nAnG+QrLtM5Fvxnj77bfx2muvqTLXyZMn4fF4sHDhQtx7773K+MGDB1Xr/3322WeIjo5GSkoKnE4n\nvvzySyxYsACZmZmqzD+SwsJCvPfeewGb/9NPP8WxY8eQmpqK7OxsVeasqanB3LlzERUVhStXrsBm\ns+H48eOYN28eXn/9dej1elVytm3bhueff37Qd1yorbOzE5WVlYiLi8PTTz+Nf/u3f8Pnn3+OlJQU\nvPrqq/jBD34QsGzgxouYs2bNUmWu7777Dh9++CHa2towZcoUJCUlYdWqVZg+fboq8wcDNc9Xv5Ar\n/lq9GWMkO3bswMsvvzzuebZt24bt27dj7ty5qK+vx9atW7F8+XIAwIIFC1BfXz/ujNdffx3V1dXo\n6enBU089hU8++QTPPPMM/vCHP+DZZ5/Fhg0bxp0BAM8+++yQL5H++OOP8ed//ueIiIjA/v37x51h\nsVhw7NgxADceiW3fvh3PP/88Dh06hJ/85Ceq/N5TUlJw4sQJ6HQ6vPrqq7jnnnvwwgsv4PDhwzhx\n4gQ+/PDDcWcAQFRUFO6++27MmTMHq1atws9+9jPcf//9qszdb9WqVejp6cGVK1eg1+tx6dIl/PSn\nP8Xhw4cBALt27VI1L1C2bt2Kjz76CIsXL8Z//ud/YsGCBdDr9fiP//gPlJeX46mnnlIlp729HTab\nDXv37oXP50NERARmzpyJ5cuXY+PGjar94dfUhDadAkDra2VvZjQaVZln3rx50tHRISI3riV+9NFH\n5c033xQRkfT0dFUy5s6dK9evX5fLly/LvffeK+3t7SIicuXKFUlNTVUlQ+TG/q5atUo+/vhjcTqd\nUl1dLbGxseJ0OsXpdKqW0e/RRx+V77//XkRELl26JPPmzVMlIzk5WfnvBQsWDLpPzTcUpaenS09P\nj/z+97+Xl156Se677z7JycmRnTt3ysWLF1XJmD9/voiIXL9+Xe6//365fv26iIj09vYq943X+fPn\npaysTJKSkkSv18uMGTMkKSlJysrK5Pz586pkzJs3T7q7u0VE5PLly/Lkk0+KiEhzc/OIn2N/J7Ky\nssRms4nX65Xe3l4REfF4PGK1WiUrK0uVjJaWFikpKVHOT3FxscybN0/+8i//Unw+nyoZA4Vcz7//\nWtmbqXmtbGpq6og3n8+nSoaIKK0ek8kEp9OJgwcPYv369YMeQY/HD37wA+h0OuVRZlRUFADgz/7s\nzzBlinr/NL788ks8+uij+NWvfoXp06cjIyMDd911FxYvXozFixerktHT0wO/349z586hp6dHeaR8\nzz33qHalxLx587Bjxw4AQFpaGr744gsAwKlTp1Rvk0yZMgXZ2dnYsWMH3G43SktLcfDgQTz00EOq\nzN/b24vOzk50dHTg6tWruHDhAgDg2rVr6O3tVSUjLy8PM2bMgNPphN/vh9/vR3V1NfR6PfLy8lTJ\niIiIwPXr1wHc2Pf+99n86Ec/UsbVcObMGZSVlSE2Nlb55sEHHngAGzduVO2btoqLi5GWloaoqCgs\nWrQISUlJOHDgACwWC0pLS1XJGET1PycT7ODBgzJnzhzJycmRV155RV555RXJycmR2bNny4EDB1TJ\nmDlzphw/flyampqG3B544AFVMjIyMqS+vn7QWFdXlxQUFEhERIQqGRaLRS5fviwiIj09Pcr4+fPn\nhzyyVUNra6u88MILsmbNGtWeIfV78MEHxWQyiclkkoceekg8Ho+IiFy8eFG1R4Dnz5+XwsJCeeih\nh8RisYhOpxOTySRPPPGENDQ0qJIhcutndpcuXVIlY9OmTfLQQw9JYmKivP322zJ37lwpKSmRefPm\nyebNm1XJMJvNd3Tf7diyZYvMnz9fSkpKJDExUex2u4iI+Hw+eeKJJ1TJEBF5+umnZfPmzXL27Fll\nzOv1is1mk8zMTFUyBv47jY+PH/E+tYRc8RcR6e7uls8//1x++9vfyp49e+To0aPK01o1vPTSS/LJ\nJ58Me9+LL76oSkZLS4t4vd4h4729vfLpp5+qknH16tVhx//0pz/JiRMnVMkYzu9+9zt5/fXXAzb/\nQJcvX5bTp0+rOmd7e7vU19fLF198MezvaLz++Mc/qj7ncJqamuTcuXMiIvLtt99KZWWlqn/EtCiY\nIiJff/21/Pa3v5WTJ0+qNufNzp07Jxs2bFBaWHq9XpKSkmTDhg3KORyvga3DN954Y9B9arXiBgq5\nF3yJKDj4/X7YbDbs379faYcaDAYsW7YMGzduRHR09ATv4e05efIk3G43Fi5ciGnTpinjDodDlY+S\n+Lu/+zv8zd/8zaC5AcDlcuH111/Hnj17xp0xEIs/EWnu3XffxUsvvTTRuzFmWlx9dytqXUU4EIs/\nEWkuPj4era2tE70bYzZ//nzU1NTg3nvvxZkzZ7BixQoUFBRg3bp1mhT/QJwvfsAGEQVEamrqiPep\ndVWcVuSmq+/+67/+CytWrEBzc7NqV99pfb5Y/IkoIL7//ns4HA7MmDFjyH2PP/74BOzRnZs5cyYa\nGhqQnp4OALj33nvx0UcfoaSkBCdOnFAlQ+vzxeJPRAHxzDPP4NKlS1iwYMGQ+9R6f4dW3nvvPURG\nRg4ai4yMxK5du/Dzn/9clQytzxd7/kREYSjk3uFLRESjY/EnIgpDLP5ERGGIxZ+IKAyx+BMRhaH/\nB2+07pE4sVYVAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##pd.crosstab(rows, cols)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.crosstab(df.NumberOfTimes90DaysLate, df.SeriousDlqin2yrs)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriousDlqin2yrs01
NumberOfTimes90DaysLate
0 135108 6554
1 3478 1765
2 779 776
3 282 385
4 96 195
5 48 83
6 32 48
7 7 31
8 6 15
9 5 14
10 3 5
11 2 3
12 1 1
13 2 2
14 1 1
15 2 0
17 0 1
96 1 4
98 121 143
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "SeriousDlqin2yrs 0 1\n", "NumberOfTimes90DaysLate \n", "0 135108 6554\n", "1 3478 1765\n", "2 779 776\n", "3 282 385\n", "4 96 195\n", "5 48 83\n", "6 32 48\n", "7 7 31\n", "8 6 15\n", "9 5 14\n", "10 3 5\n", "11 2 3\n", "12 1 1\n", "13 2 2\n", "14 1 1\n", "15 2 0\n", "17 0 1\n", "96 1 4\n", "98 121 143" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Use `pd.crosstab` to make a table that contains customer's ages in the lefthand column and the number of dependents they have in the right" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.crosstab(df.age, df.NumberOfDependents)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n",
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 84 entries, 0 to 107\n",
        "Data columns (total 13 columns):\n",
        "0.0     84  non-null values\n",
        "1.0     84  non-null values\n",
        "2.0     84  non-null values\n",
        "3.0     84  non-null values\n",
        "4.0     84  non-null values\n",
        "5.0     84  non-null values\n",
        "6.0     84  non-null values\n",
        "7.0     84  non-null values\n",
        "8.0     84  non-null values\n",
        "9.0     84  non-null values\n",
        "10.0    84  non-null values\n",
        "13.0    84  non-null values\n",
        "20.0    84  non-null values\n",
        "dtypes: int64(13)\n",
        "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "\n", "Int64Index: 84 entries, 0 to 107\n", "Data columns (total 13 columns):\n", "0.0 84 non-null values\n", "1.0 84 non-null values\n", "2.0 84 non-null values\n", "3.0 84 non-null values\n", "4.0 84 non-null values\n", "5.0 84 non-null values\n", "6.0 84 non-null values\n", "7.0 84 non-null values\n", "8.0 84 non-null values\n", "9.0 84 non-null values\n", "10.0 84 non-null values\n", "13.0 84 non-null values\n", "20.0 84 non-null values\n", "dtypes: int64(13)" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Basic Cleanliness" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fix for formatting of the column names. I personally like snake_case (and so does Python). I found [this handy function](http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case) on stackoverflow for converting camelCase to snake_case.\n", "\n", "Now we can apply the camel_to_snake function on each column name." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load https://gist.github.com/glamp/6529725/raw/e38ffd2fc4cb840be21098486ffe5df991946736/camel_to_snake.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "def camel_to_snake(column_name):\n", " \"\"\"\n", " converts a string that is camelCase into snake_case\n", " Example:\n", " print camel_to_snake(\"javaLovesCamelCase\")\n", " > java_loves_camel_case\n", " See Also:\n", " http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case\n", " \"\"\"\n", " s1 = re.sub('(.)([A-Z][a-z]+)', r'\\1_\\2', column_name)\n", " return re.sub('([a-z0-9])([A-Z])', r'\\1_\\2', s1).lower()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "camel_to_snake(\"javaLovesCamelCase\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ "'java_loves_camel_case'" ] } ], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "df.columns = [camel_to_snake(col) for col in df.columns]\n", "df.columns.tolist()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "['serious_dlqin2yrs',\n", " 'revolving_utilization_of_unsecured_lines',\n", " 'age',\n", " 'number_of_time30-59_days_past_due_not_worse',\n", " 'debt_ratio',\n", " 'monthly_income',\n", " 'number_of_open_credit_lines_and_loans',\n", " 'number_of_times90_days_late',\n", " 'number_real_estate_loans_or_lines',\n", " 'number_of_time60-89_days_past_due_not_worse',\n", " 'number_of_dependents']" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Slicing and Indexing Data\n", "pandas (like R) uses a system of boolean indexing. What this means is that when selecting particular rows or columns in your dataset..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Grabbing columns" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df['monthly_income'].head()\n", "df.monthly_income.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ "0 9120\n", "1 2600\n", "2 3042\n", "3 3300\n", "4 63588\n", "Name: monthly_income, dtype: float64" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "df[['monthly_income', 'serious_dlqin2yrs']].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthly_incomeserious_dlqin2yrs
0 9120 1
1 2600 0
2 3042 0
3 3300 0
4 63588 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ " monthly_income serious_dlqin2yrs\n", "0 9120 1\n", "1 2600 0\n", "2 3042 0\n", "3 3300 0\n", "4 63588 0" ] } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "columns_i_want = ['monthly_income', 'serious_dlqin2yrs']\n", "df[columns_i_want].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthly_incomeserious_dlqin2yrs
0 9120 1
1 2600 0
2 3042 0
3 3300 0
4 63588 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ " monthly_income serious_dlqin2yrs\n", "0 9120 1\n", "1 2600 0\n", "2 3042 0\n", "3 3300 0\n", "4 63588 0" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Adding Columns" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.newcolumn = 1\n", "# this will throw an error\n", "df['newcolumn']" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "u'no item named newcolumn'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnewcolumn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# this will throw an error\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'newcolumn'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2001\u001b[0m \u001b[0;31m# get column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2002\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_unique\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2003\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_item_cache\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2004\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2005\u001b[0m \u001b[0;31m# duplicate columns\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc\u001b[0m in \u001b[0;36m_get_item_cache\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 665\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 666\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 667\u001b[0;31m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 668\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_box_item_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 669\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc\u001b[0m in \u001b[0;36mget\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1653\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1654\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_unique\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1655\u001b[0;31m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mblock\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_find_block\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1656\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mblock\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1657\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc\u001b[0m in \u001b[0;36m_find_block\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1933\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1934\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_find_block\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1935\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_check_have\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1936\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mblock\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mblocks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1937\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mitem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mblock\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc\u001b[0m in \u001b[0;36m_check_have\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1940\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_check_have\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1941\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mitem\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1942\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'no item named %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mcom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpprint_thing\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1943\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1944\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mreindex_axis\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_axis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: u'no item named newcolumn'" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "df['one'] = 1\n", "df.one.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 38, "text": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "Name: one, dtype: int64" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Removing a column" ] }, { "cell_type": "code", "collapsed": false, "input": [ "del df['one']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Boolean Indexing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.monthly_income > 5000" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 40, "text": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 True\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 True\n", "10 False\n", "11 True\n", "12 True\n", "13 True\n", "14 False\n", "...\n", "149985 False\n", "149986 False\n", "149987 False\n", "149988 True\n", "149989 True\n", "149990 False\n", "149991 True\n", "149992 False\n", "149993 False\n", "149994 False\n", "149995 False\n", "149996 True\n", "149997 False\n", "149998 True\n", "149999 True\n", "Name: monthly_income, Length: 150000, dtype: bool" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "gt_5k = df[df.monthly_income > 5000]\n", "print len(gt_5k),\"people with monthly_income > 5000\"\n", "df[df.monthly_income > 5000].head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "64410 people with monthly_income > 5000\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
serious_dlqin2yrsrevolving_utilization_of_unsecured_linesagenumber_of_time30-59_days_past_due_not_worsedebt_ratiomonthly_incomenumber_of_open_credit_lines_and_loansnumber_of_times90_days_latenumber_real_estate_loans_or_linesnumber_of_time60-89_days_past_due_not_worsenumber_of_dependents
0 1 0.766127 45 2 0.802982 9120 13 0 6 0 2
4 0 0.907239 49 1 0.024926 63588 7 0 1 0 0
9 0 0.189169 57 0 0.606291 23684 9 0 4 0 2
11 0 0.018798 51 0 0.531529 6501 7 0 2 0 2
12 0 0.010352 46 0 0.298354 12454 13 0 2 0 2
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 41, "text": [ " serious_dlqin2yrs revolving_utilization_of_unsecured_lines age \\\n", "0 1 0.766127 45 \n", "4 0 0.907239 49 \n", "9 0 0.189169 57 \n", "11 0 0.018798 51 \n", "12 0 0.010352 46 \n", "\n", " number_of_time30-59_days_past_due_not_worse debt_ratio monthly_income \\\n", "0 2 0.802982 9120 \n", "4 1 0.024926 63588 \n", "9 0 0.606291 23684 \n", "11 0 0.531529 6501 \n", "12 0 0.298354 12454 \n", "\n", " number_of_open_credit_lines_and_loans number_of_times90_days_late \\\n", "0 13 0 \n", "4 7 0 \n", "9 9 0 \n", "11 7 0 \n", "12 13 0 \n", "\n", " number_real_estate_loans_or_lines \\\n", "0 6 \n", "4 1 \n", "9 4 \n", "11 2 \n", "12 2 \n", "\n", " number_of_time60-89_days_past_due_not_worse number_of_dependents \n", "0 0 2 \n", "4 0 0 \n", "9 0 2 \n", "11 0 2 \n", "12 0 2 " ] } ], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "df.ix[40:45, 0:2]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
serious_dlqin2yrsrevolving_utilization_of_unsecured_lines
40 0 0.719985
41 0 0.072898
42 0 0.469057
43 0 0.718705
44 0 0.368876
45 0 0.000603
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 42, "text": [ " serious_dlqin2yrs revolving_utilization_of_unsecured_lines\n", "40 0 0.719985\n", "41 0 0.072898\n", "42 0 0.469057\n", "43 0 0.718705\n", "44 0 0.368876\n", "45 0 0.000603" ] } ], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "mask = (df.monthly_income > 5000) & (df.serious_dlqin2yrs==1)\n", "df[mask].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
serious_dlqin2yrsrevolving_utilization_of_unsecured_linesagenumber_of_time30-59_days_past_due_not_worsedebt_ratiomonthly_incomenumber_of_open_credit_lines_and_loansnumber_of_times90_days_latenumber_real_estate_loans_or_linesnumber_of_time60-89_days_past_due_not_worsenumber_of_dependents
0 1 0.766127 45 2 0.802982 9120 13 0 6 0 2
13 1 0.964673 40 3 0.382965 13700 9 3 1 1 2
74 1 0.133063 49 1 0.182882 10257 9 0 2 0 3
162 1 1.046279 47 1 1.104301 5416 6 0 2 0 1
184 1 0.085338 51 0 0.477504 6200 11 0 1 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 43, "text": [ " serious_dlqin2yrs revolving_utilization_of_unsecured_lines age \\\n", "0 1 0.766127 45 \n", "13 1 0.964673 40 \n", "74 1 0.133063 49 \n", "162 1 1.046279 47 \n", "184 1 0.085338 51 \n", "\n", " number_of_time30-59_days_past_due_not_worse debt_ratio monthly_income \\\n", "0 2 0.802982 9120 \n", "13 3 0.382965 13700 \n", "74 1 0.182882 10257 \n", "162 1 1.104301 5416 \n", "184 0 0.477504 6200 \n", "\n", " number_of_open_credit_lines_and_loans number_of_times90_days_late \\\n", "0 13 0 \n", "13 9 3 \n", "74 9 0 \n", "162 6 0 \n", "184 11 0 \n", "\n", " number_real_estate_loans_or_lines \\\n", "0 6 \n", "13 1 \n", "74 2 \n", "162 2 \n", "184 1 \n", "\n", " number_of_time60-89_days_past_due_not_worse number_of_dependents \n", "0 0 2 \n", "13 1 2 \n", "74 0 3 \n", "162 0 1 \n", "184 0 0 " ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Write some code that filters to the following:\n", "\n", "- people 35 or older\n", "- who have not been delinquent in the past 2 years\n", "- who have less than 10 open credit lines/loans" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mask = (df.age >= 35) & (df.serious_dlqin2yrs==0) & (df.number_of_open_credit_lines_and_loans < 10)\n", "len(df[mask])==76151" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ "True" ] } ], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Write something that filters:\n", "\n", "- people who have been delinquent in the past 2 years\n", "- have are in the 90th percentile for monthly_income\n", "\n", "*HINT: use `quantile`*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mask = (df.monthly_income == df.monthly_income.quantile(0.90)) & (df.serious_dlqin2yrs==0)\n", "len(df[mask])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 51, "text": [ "147" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Handling Missing Values\n", "One of the most frustrating parts of data science can be handling null or missing data. pandas has a lot of built in features for making is super easy to handle missing data. The first thing we need to do is determine which fields have missing data. To do that we're going to use `pd.melt`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###[Long vs. Wide Data](http://en.wikipedia.org/wiki/Wide_and_narrow_data)\n", "Depending on the problem you're solving, you may need to rotate between having your data in wide/long format.\n", "\n", "Wide data is probably what you think of when the work \"spreadsheet\" comes to mind. We're talking about data in which each row represents 1 datapoint and each value is in a particular column. This is well suited for things like modeling and producing summary statistics.\n", "\n", "I often find that having data in `long` format is often best for doing the same task against multiple variables. Things like plotting distributions of each variable, making frequency tables, or, in our case, determining what portion of a dataframe's variables are null." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###pd.melt()\n", "For converting data from `wide` to `long` format.\n", "```\n", ">>> df\n", "A B C\n", "a 1 2\n", "b 3 4\n", "c 5 6\n", "\n", ">>> pd.melt(df, id_vars=['A'], value_vars=['B'])\n", "A variable value\n", "a B 1\n", "b B 3\n", "c B 5\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "?pd.melt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 64 }, { "cell_type": "code", "collapsed": false, "input": [ "# By not specifying id_vars, we're going to melt EVERYTHING\n", "df_lng = pd.melt(df)\n", "# now our data is a series of (key, value) rows. \n", "#think of when you've done this in Excel so that you can\n", "#create a pivot table \n", "df_lng.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variablevalue
0 serious_dlqin2yrs 1
1 serious_dlqin2yrs 0
2 serious_dlqin2yrs 0
3 serious_dlqin2yrs 0
4 serious_dlqin2yrs 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 45, "text": [ " variable value\n", "0 serious_dlqin2yrs 1\n", "1 serious_dlqin2yrs 0\n", "2 serious_dlqin2yrs 0\n", "3 serious_dlqin2yrs 0\n", "4 serious_dlqin2yrs 0" ] } ], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "null_variables = df_lng.value.isnull()\n", "null_variables.sum()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ "29731" ] } ], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "# crosstab creates a frequency table between 2 variables\n", "# it's going to automatically enumerate the possibilities between\n", "# the two Series and show you a count of occurrences \n", "#in each possible bucket\n", "pd.crosstab(df_lng.variable, null_variables)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueFalseTrue
variable
age 150000 0
debt_ratio 150000 0
monthly_income 120269 29731
number_of_dependents 150000 0
number_of_open_credit_lines_and_loans 150000 0
number_of_time30-59_days_past_due_not_worse 150000 0
number_of_time60-89_days_past_due_not_worse 150000 0
number_of_times90_days_late 150000 0
number_real_estate_loans_or_lines 150000 0
revolving_utilization_of_unsecured_lines 150000 0
serious_dlqin2yrs 150000 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 47, "text": [ "value False True \n", "variable \n", "age 150000 0\n", "debt_ratio 150000 0\n", "monthly_income 120269 29731\n", "number_of_dependents 150000 0\n", "number_of_open_credit_lines_and_loans 150000 0\n", "number_of_time30-59_days_past_due_not_worse 150000 0\n", "number_of_time60-89_days_past_due_not_worse 150000 0\n", "number_of_times90_days_late 150000 0\n", "number_real_estate_loans_or_lines 150000 0\n", "revolving_utilization_of_unsecured_lines 150000 0\n", "serious_dlqin2yrs 150000 0" ] } ], "prompt_number": 47 }, { "cell_type": "code", "collapsed": false, "input": [ "# let's abstract that code into a function so we can easily \n", "# recalculate it\n", "def print_null_freq(df):\n", " \"\"\"\n", " for a given DataFrame, calculates how many values for \n", " each variable is null and prints the resulting table to stdout\n", " \"\"\"\n", " df_lng = pd.melt(df)\n", " null_variables = df_lng.value.isnull()\n", " return pd.crosstab(df_lng.variable, null_variables)\n", "print_null_freq(df)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueFalseTrue
variable
age 150000 0
debt_ratio 150000 0
monthly_income 120269 29731
number_of_dependents 150000 0
number_of_open_credit_lines_and_loans 150000 0
number_of_time30-59_days_past_due_not_worse 150000 0
number_of_time60-89_days_past_due_not_worse 150000 0
number_of_times90_days_late 150000 0
number_real_estate_loans_or_lines 150000 0
revolving_utilization_of_unsecured_lines 150000 0
serious_dlqin2yrs 150000 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 48, "text": [ "value False True \n", "variable \n", "age 150000 0\n", "debt_ratio 150000 0\n", "monthly_income 120269 29731\n", "number_of_dependents 150000 0\n", "number_of_open_credit_lines_and_loans 150000 0\n", "number_of_time30-59_days_past_due_not_worse 150000 0\n", "number_of_time60-89_days_past_due_not_worse 150000 0\n", "number_of_times90_days_late 150000 0\n", "number_real_estate_loans_or_lines 150000 0\n", "revolving_utilization_of_unsecured_lines 150000 0\n", "serious_dlqin2yrs 150000 0" ] } ], "prompt_number": 48 }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Use pd.melt to create a data frame in the following format:\n", "```\n", " serious_dlqin2yrs variable\t value\n", "0\t 1\t age\t 45\n", "1\t 0\t age\t 40\n", "2\t 0\t age\t 38\n", "3\t 0\t age\t 30\n", "4\t 0\t age\t 49\n", "...\t ...\t ...\t ...\n", "299999 1\t debt_ratio 0.423\n", "300000 0\t debt_ratio 0.8923\n", "```\n", "Only include values for `age` and `debt_ratio`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "melted = pd.melt(..., id_vars=[...], value_vars=[...])\n", "\n", "print len(melted)==300000\n", "print melted.variable.unique()==np.array(['age', 'debt_ratio'])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n", "[ True True]\n" ] } ], "prompt_number": 50 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Filling NA's" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series([1, 2, None, 4])\n", "s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 49, "text": [ "0 1\n", "1 2\n", "2 NaN\n", "3 4" ] } ], "prompt_number": 49 }, { "cell_type": "code", "collapsed": false, "input": [ "s.fillna(3)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 50, "text": [ "0 1\n", "1 2\n", "2 3\n", "3 4" ] } ], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "s.ffill()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 51, "text": [ "0 1\n", "1 2\n", "2 2\n", "3 4" ] } ], "prompt_number": 51 }, { "cell_type": "code", "collapsed": false, "input": [ "s.bfill()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 52, "text": [ "0 1\n", "1 2\n", "2 4\n", "3 4" ] } ], "prompt_number": 52 }, { "cell_type": "code", "collapsed": false, "input": [ "s.fillna(s.mean())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 53, "text": [ "0 1.000000\n", "1 2.000000\n", "2 2.333333\n", "3 4.000000" ] } ], "prompt_number": 53 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look at `df` you can see that there are 2 columns which don't have a full 150,000 values: `monthly_income` and `number_of_dependents`. In order to incorporate these variables into our analysis, we need to specify how to treat these missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For number_of_dependents let's keep things simple and intuitive. if someone didn't specify how many dependents they had then let's assume it's becasue they don't have any to begin with." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.number_of_dependents = df.number_of_dependents.fillna(0)\n", "# proof that the number_of_dependents no longer contains nulls\n", "print_null_freq(df)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueFalseTrue
variable
age 150000 0
debt_ratio 150000 0
monthly_income 120269 29731
number_of_dependents 150000 0
number_of_open_credit_lines_and_loans 150000 0
number_of_time30-59_days_past_due_not_worse 150000 0
number_of_time60-89_days_past_due_not_worse 150000 0
number_of_times90_days_late 150000 0
number_real_estate_loans_or_lines 150000 0
revolving_utilization_of_unsecured_lines 150000 0
serious_dlqin2yrs 150000 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 54, "text": [ "value False True \n", "variable \n", "age 150000 0\n", "debt_ratio 150000 0\n", "monthly_income 120269 29731\n", "number_of_dependents 150000 0\n", "number_of_open_credit_lines_and_loans 150000 0\n", "number_of_time30-59_days_past_due_not_worse 150000 0\n", "number_of_time60-89_days_past_due_not_worse 150000 0\n", "number_of_times90_days_late 150000 0\n", "number_real_estate_loans_or_lines 150000 0\n", "revolving_utilization_of_unsecured_lines 150000 0\n", "serious_dlqin2yrs 150000 0" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taking a look at `monthly_income` we see that it's a bit more complicated than `number_of_dependents`. We have a few options for replacing missing data. We could do something like set it to the mean or median or the dataset but this might skew our distribution. We could also set it to 0 but this might not be right either. Instead we're going to use a technique called imputation. We'll go into this more after we take a look at `scikit-learn`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.monthly_income.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 55, "text": [ "count 120269.000000\n", "mean 6670.221237\n", "std 14384.674215\n", "min 0.000000\n", "25% 3400.000000\n", "50% 5400.000000\n", "75% 8249.000000\n", "max 3008750.000000" ] } ], "prompt_number": 55 }, { "cell_type": "code", "collapsed": false, "input": [ "df.to_csv(\"./data/credit-data-post-import.csv\", index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 56 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##We just did the following\n", "\n", "- Used `read_csv` to load data Python\n", "- Used `head`, `tail`, `unique`, and other functions to inspect our data\n", "- Sliced data and selected data frames\n", "- Cleaned our data and learned the basics of handling missing values" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }