{
"metadata": {
"name": "",
"signature": "sha256:85afef5cf90a62abad47b2010737fca1c44e468a30d78f409c1270a5d8ca0ac6"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example of cleaning up data in Pandas\n",
"This example is actually the exercises provided by the following link:\n",
"A2. Linear Regression - Data Exploration - Lending Club, which is part of the site for Learn Data Science.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### First import ipython settings"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Import the data to be cleaned and inspect them"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Amount.Requested | \n",
" Amount.Funded.By.Investors | \n",
" Interest.Rate | \n",
" Loan.Length | \n",
" Loan.Purpose | \n",
" Debt.To.Income.Ratio | \n",
" State | \n",
" Home.Ownership | \n",
" Monthly.Income | \n",
" FICO.Range | \n",
" Open.CREDIT.Lines | \n",
" Revolving.CREDIT.Balance | \n",
" Inquiries.in.the.Last.6.Months | \n",
" Employment.Length | \n",
"
\n",
" \n",
" \n",
" \n",
" 81174 | \n",
" 20000 | \n",
" 20000 | \n",
" 8.90% | \n",
" 36 months | \n",
" debt_consolidation | \n",
" 14.90% | \n",
" SC | \n",
" MORTGAGE | \n",
" 6541.67 | \n",
" 735-739 | \n",
" 14 | \n",
" 14272 | \n",
" 2 | \n",
" < 1 year | \n",
"
\n",
" \n",
" 99592 | \n",
" 19200 | \n",
" 19200 | \n",
" 12.12% | \n",
" 36 months | \n",
" debt_consolidation | \n",
" 28.36% | \n",
" TX | \n",
" MORTGAGE | \n",
" 4583.33 | \n",
" 715-719 | \n",
" 12 | \n",
" 11140 | \n",
" 1 | \n",
" 2 years | \n",
"
\n",
" \n",
" 80059 | \n",
" 35000 | \n",
" 35000 | \n",
" 21.98% | \n",
" 60 months | \n",
" debt_consolidation | \n",
" 23.81% | \n",
" CA | \n",
" MORTGAGE | \n",
" 11500.00 | \n",
" 690-694 | \n",
" 14 | \n",
" 21977 | \n",
" 1 | \n",
" 2 years | \n",
"
\n",
" \n",
" 15825 | \n",
" 10000 | \n",
" 9975 | \n",
" 9.99% | \n",
" 36 months | \n",
" debt_consolidation | \n",
" 14.30% | \n",
" KS | \n",
" MORTGAGE | \n",
" 3833.33 | \n",
" 695-699 | \n",
" 10 | \n",
" 9346 | \n",
" 0 | \n",
" 5 years | \n",
"
\n",
" \n",
" 33182 | \n",
" 12000 | \n",
" 12000 | \n",
" 11.71% | \n",
" 36 months | \n",
" credit_card | \n",
" 18.78% | \n",
" NJ | \n",
" RENT | \n",
" 3195.00 | \n",
" 695-699 | \n",
" 11 | \n",
" 14469 | \n",
" 0 | \n",
" 9 years | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 45,
"text": [
" Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length \\\n",
"81174 20000 20000 8.90% 36 months \n",
"99592 19200 19200 12.12% 36 months \n",
"80059 35000 35000 21.98% 60 months \n",
"15825 10000 9975 9.99% 36 months \n",
"33182 12000 12000 11.71% 36 months \n",
"\n",
" Loan.Purpose Debt.To.Income.Ratio State Home.Ownership \\\n",
"81174 debt_consolidation 14.90% SC MORTGAGE \n",
"99592 debt_consolidation 28.36% TX MORTGAGE \n",
"80059 debt_consolidation 23.81% CA MORTGAGE \n",
"15825 debt_consolidation 14.30% KS MORTGAGE \n",
"33182 credit_card 18.78% NJ RENT \n",
"\n",
" Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance \\\n",
"81174 6541.67 735-739 14 14272 \n",
"99592 4583.33 715-719 12 11140 \n",
"80059 11500.00 690-694 14 21977 \n",
"15825 3833.33 695-699 10 9346 \n",
"33182 3195.00 695-699 11 14469 \n",
"\n",
" Inquiries.in.the.Last.6.Months Employment.Length \n",
"81174 2 < 1 year \n",
"99592 1 2 years \n",
"80059 1 2 years \n",
"15825 0 5 years \n",
"33182 0 9 years "
]
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.dtypes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 46,
"text": [
"Amount.Requested int64\n",
"Amount.Funded.By.Investors float64\n",
"Interest.Rate object\n",
"Loan.Length object\n",
"Loan.Purpose object\n",
"Debt.To.Income.Ratio object\n",
"State object\n",
"Home.Ownership object\n",
"Monthly.Income float64\n",
"FICO.Range object\n",
"Open.CREDIT.Lines float64\n",
"Revolving.CREDIT.Balance float64\n",
"Inquiries.in.the.Last.6.Months float64\n",
"Employment.Length object\n",
"dtype: object"
]
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We will find out that several columns are not regonized as numeric.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fix the Interest.Rate and Debt.To.Income.Ratio\n",
"First check whether any of the rate doesn't contains %"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(df[ ( df['Interest.Rate'].str.contains('%') == False) ] ) , len(df[ ( df['Debt.To.Income.Ratio'].str.contains('%') == False) ] )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 47,
"text": [
"(0, 0)"
]
}
],
"prompt_number": 47
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"remove_percent_converter = lambda x: float(x.replace('%', ''))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fix Loan.Length\n",
"First check whether any of the loan doesn't contains \"months\"."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(df[df['Loan.Length'].str.contains('months') == False])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 49,
"text": [
"0"
]
}
],
"prompt_number": 49
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"remove_month_converter = lambda x: float(x.replace('months', ''))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Apply the converters"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"clean_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv'\n",
"\t, converters={'Interest.Rate': remove_percent_converter\n",
"\t,'Debt.To.Income.Ratio': remove_percent_converter\n",
"\t, 'Loan.Length': remove_month_converter})"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fix FICO.Score\n",
"For FICO, we are trying to get the lower part of the Range as our preferred value.\n",
"We will do it by:\n",
"1. Creating a function to get the lower part of the Range\n",
"2. Create a new column to store this value"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fico_function = lambda x: float(x.split('-')[0])\n",
"clean_df['FICO.Score'] = clean_df['FICO.Range'].map(fico_function)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finalize the result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Rename column"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"clean_df = clean_df.rename(columns={'Amount.Requested':'Loan.Amount'})"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a new index column"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"clean_df['NewIndex'] = pd.Series(range(1, clean_df['Amount.Funded.By.Investors'].count()+1), index=clean_df.index)\n",
"clean_df = clean_df.set_index('NewIndex')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Filter only the columns we need"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"final_columns = [\"Interest.Rate\",\"FICO.Score\",\"Loan.Length\",\"Monthly.Income\",\"Loan.Amount\"]\n",
"final_df = clean_df[final_columns]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove extremes (strange monthly income)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"final_df = final_df[final_df['Monthly.Income'] < 100000]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing to the expected results\n",
"Getting the expected results and format it"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"checking_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loanf.csv')\n",
"checking_df = checking_df.sort_index()\n",
"checking_df.index.name = 'NewIndex'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 57
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some magic\n",
"To see the difference of our cleaned data to the expected one."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ne_stacked = pd.concat([final_df, checking_df])\n",
"ne_stacked = ne_stacked.reset_index(drop=True)\n",
"df_gpby = ne_stacked.groupby(list(ne_stacked.columns))\n",
"\n",
"idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]\n",
"\n",
"ne_stacked.reindex(idx)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Interest.Rate | \n",
" FICO.Score | \n",
" Loan.Length | \n",
" Monthly.Income | \n",
" Loan.Amount | \n",
"
\n",
" \n",
" \n",
" \n",
" 4349 | \n",
" 7.62 | \n",
" 805 | \n",
" 36 | \n",
" 102750 | \n",
" 33000 | \n",
"
\n",
" \n",
" 2864 | \n",
" 7.43 | \n",
" 800 | \n",
" 36 | \n",
" NaN | \n",
" 5000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 62,
"text": [
" Interest.Rate FICO.Score Loan.Length Monthly.Income Loan.Amount\n",
"4349 7.62 805 36 102750 33000\n",
"2864 7.43 800 36 NaN 5000"
]
}
],
"prompt_number": 62
}
],
"metadata": {}
}
]
}