{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Linear Regression - Data Cleanup\n", "================================" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "1. Remove '%'\n", " * strip the last letter\n", " * map to float\n", "\n", "\n", "2. Remove ' months' suffix\n", " * strip the tailing letters\n", " * map to int\n", "\n", "3. Remove bad data\n", " * n/a values\n", " * outlier income\n", " * histograms\n", " * describe\n", "\n", "4. Map FICO.Range to single int\n", " * split on '-'\n", " * use bottom value\n", " * convert to int\n", " * show how to do mid-point\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Amount.RequestedAmount.Funded.By.InvestorsInterest.RateLoan.LengthLoan.PurposeDebt.To.Income.RatioStateHome.OwnershipMonthly.IncomeFICO.RangeOpen.CREDIT.LinesRevolving.CREDIT.BalanceInquiries.in.the.Last.6.MonthsEmployment.Length
81174 20000 20000.00 8.90% 36 months debt_consolidation 14.90% SC MORTGAGE 6541.67 735-739 14 14272 2 < 1 year
99592 19200 19200.00 12.12% 36 months debt_consolidation 28.36% TX MORTGAGE 4583.33 715-719 12 11140 1 2 years
80059 35000 35000.00 21.98% 60 months debt_consolidation 23.81% CA MORTGAGE 11500.00 690-694 14 21977 1 2 years
15825 10000 9975.00 9.99% 36 months debt_consolidation 14.30% KS MORTGAGE 3833.33 695-699 10 9346 0 5 years
33182 12000 12000.00 11.71% 36 months credit_card 18.78% NJ RENT 3195.00 695-699 11 14469 0 9 years
62403 6000 6000.00 15.31% 36 months other 20.05% CT OWN 4891.67 670-674 17 10391 2 3 years
48808 10000 10000.00 7.90% 36 months debt_consolidation 26.09% MA RENT 2916.67 720-724 10 15957 0 10+ years
22090 33500 33450.00 17.14% 60 months credit_card 14.70% LA MORTGAGE 13863.42 705-709 12 27874 0 10+ years
76404 14675 14675.00 14.33% 36 months credit_card 26.92% CA RENT 3150.00 685-689 9 7246 1 8 years
15867 7000 7000.00 6.91% 36 months credit_card 7.10% CA RENT 5000.00 715-719 8 7612 0 3 years
94971 2000 2000.00 19.72% 36 months moving 10.29% FL RENT 3575.00 670-674 10 12036 0 6 years
36911 10625 10625.00 14.27% 36 months debt_consolidation 12.54% CA MORTGAGE 4250.00 665-669 14 10767 0 < 1 year
41200 28000 27975.00 21.67% 60 months debt_consolidation 13.07% CT MORTGAGE 14166.67 670-674 12 10311 0 1 year
83869 35000 34950.00 8.90% 36 months debt_consolidation 20.46% CT RENT 9166.67 735-739 19 21536 0 1 year
53853 9600 9600.00 7.62% 36 months debt_consolidation 3.45% DC RENT 11250.00 725-729 13 4606 0 < 1 year
21399 25000 24975.00 15.65% 60 months debt_consolidation 21.99% CA RENT 5416.67 730-734 6 13929 0 9 years
62127 10000 10000.00 12.12% 36 months debt_consolidation 17.72% CA RENT 9000.00 695-699 18 20317 0 7 years
23446 14000 13900.25 10.37% 60 months debt_consolidation 11.95% OH RENT 4333.33 740-744 6 7419 0 9 years
44987 10000 10000.00 9.76% 36 months credit_card 7.13% FL RENT 2733.33 730-734 7 6112 2 3 years
17977 5200 5175.00 9.99% 60 months debt_consolidation 10.29% AL MORTGAGE 3750.00 760-764 10 16094 0 < 1 year
86099 22000 21975.00 21.98% 36 months debt_consolidation 11.19% TX MORTGAGE 6666.67 665-669 9 23124 0 10+ years
99483 30000 30000.00 19.05% 60 months credit_card 21.25% FL MORTGAGE 6250.00 695-699 12 34927 0 6 years
28798 6500 6500.00 17.99% 60 months car 19.63% FL RENT 4100.00 665-669 11 11697 1 2 years
24168 17400 17400.00 11.99% 36 months credit_card 12.47% AZ RENT 6833.33 695-699 7 26587 0 7 years
10356 4000 4000.00 16.82% 60 months vacation 13.71% GA MORTGAGE 4500.00 670-674 5 20804 0 3 years
46027 7200 7200.00 7.90% 36 months debt_consolidation 24.82% TX RENT 5416.67 705-709 8 12017 0 7 years
2238 8000 8000.00 14.42% 36 months debt_consolidation 24.63% MA RENT 2964.17 675-679 9 8928 2 6 years
65278 8000 8000.00 15.31% 36 months debt_consolidation 15.46% CA MORTGAGE 2916.67 675-679 13 7152 1 5 years
4227 3000 3000.00 8.59% 36 months other 3.72% MA MORTGAGE 4167.00 765-769 4 7074 0 5 years
50182 14500 14500.00 7.90% 36 months debt_consolidation 4.85% GA MORTGAGE 3958.33 760-764 4 9598 0 4 years
.............................................
84265 20000 20000.00 22.95% 60 months debt_consolidation 7.10% NJ RENT 6750.00 665-669 6 16104 1 7 years
80231 19000 19000.00 7.90% 36 months debt_consolidation 9.76% RI MORTGAGE 5166.67 770-774 18 43617 2 10+ years
49533 17300 17250.00 22.45% 60 months wedding 3.58% PA MORTGAGE 5500.00 685-689 11 2306 3 4 years
102514 7000 711.54 15.13% 36 months major_purchase 18.91% CO MORTGAGE 3833.00 650-654 13 12634 0 2 years
78618 7200 7200.00 18.75% 36 months debt_consolidation 16.21% MI RENT 8333.33 660-664 8 28916 0 10+ years
86953 10000 10000.00 14.09% 36 months major_purchase 9.71% CO RENT 4583.33 675-679 6 3859 0 < 1 year
80129 4000 3925.00 14.09% 36 months credit_card 12.27% FL MORTGAGE 8583.33 675-679 9 36943 1 10+ years
85216 17500 17500.00 8.90% 36 months debt_consolidation 10.94% UT MORTGAGE 25000.00 730-734 9 34545 0 6 years
38247 20000 20000.00 11.71% 36 months credit_card 9.58% SD MORTGAGE 5416.67 725-729 12 18267 1 4 years
91245 16200 16200.00 15.80% 60 months debt_consolidation 7.92% PA MORTGAGE 4833.33 680-684 12 12313 2 10+ years
53041 10000 10000.00 6.03% 36 months small_business 13.03% FL RENT 5000.00 760-764 8 3952 0 2 years
63051 27000 27000.00 6.62% 36 months debt_consolidation 12.21% OH MORTGAGE 9250.00 810-814 12 4211 0 5 years
14446 4500 4475.00 7.51% 36 months small_business 20.27% VA MORTGAGE 7075.50 720-724 15 68618 2 10+ years
68628 15875 15875.00 14.33% 36 months small_business 17.44% MD MORTGAGE 3416.67 675-679 11 15891 0 2 years
98758 15000 15000.00 10.16% 36 months credit_card 28.28% OH MORTGAGE 6666.67 690-694 15 14880 0 10+ years
13070 25000 24950.00 10.75% 36 months debt_consolidation 20.48% OR MORTGAGE 7083.33 765-769 10 25429 0 6 years
45836 7000 7000.00 17.27% 36 months other 18.38% NY OWN 2464.37 665-669 9 7089 0 3 years
52330 15000 15000.00 19.99% 36 months wedding 18.05% CA RENT 8000.00 660-664 6 45976 1 2 years
48243 17000 17000.00 15.81% 36 months debt_consolidation 17.01% CO RENT 3833.33 685-689 6 15484 1 6 years
63256 19075 19075.00 18.75% 36 months debt_consolidation 15.23% NY RENT 5166.67 670-674 17 13749 3 10+ years
42124 10000 10000.00 11.71% 36 months debt_consolidation 8.40% CA RENT 4500.00 710-714 8 8404 1 3 years
78043 8475 8475.00 7.62% 36 months debt_consolidation 15.88% CA RENT 3983.33 720-724 9 6882 0 n/a
925 6400 6350.00 10.08% 36 months debt_consolidation 8.11% NJ MORTGAGE 5166.67 710-714 5 5815 2 10+ years
74047 30000 30000.00 23.28% 60 months other 12.10% IL MORTGAGE 7083.33 675-679 16 17969 1 10+ years
49957 24000 23975.00 14.65% 36 months debt_consolidation 15.29% WA MORTGAGE 6666.67 685-689 13 17521 0 5 years
23735 30000 29950.00 16.77% 60 months debt_consolidation 19.23% NY MORTGAGE 9250.00 705-709 15 45880 1 8 years
65882 16000 16000.00 14.09% 60 months home_improvement 21.54% MD OWN 8903.25 740-744 18 18898 1 10+ years
55610 10000 10000.00 13.99% 36 months debt_consolidation 4.89% PA MORTGAGE 2166.67 680-684 4 4544 0 10+ years
38576 6000 6000.00 12.42% 36 months major_purchase 16.66% NJ RENT 3500.00 675-679 8 7753 0 5 years
3116 9000 5242.75 13.79% 36 months debt_consolidation 6.76% NY RENT 3875.00 670-674 7 7589 0 10+ years
\n", "

2500 rows × 14 columns

\n", "
" ], "text/plain": [ " Amount.Requested Amount.Funded.By.Investors Interest.Rate \\\n", "81174 20000 20000.00 8.90% \n", "99592 19200 19200.00 12.12% \n", "80059 35000 35000.00 21.98% \n", "15825 10000 9975.00 9.99% \n", "33182 12000 12000.00 11.71% \n", "62403 6000 6000.00 15.31% \n", "48808 10000 10000.00 7.90% \n", "22090 33500 33450.00 17.14% \n", "76404 14675 14675.00 14.33% \n", "15867 7000 7000.00 6.91% \n", "94971 2000 2000.00 19.72% \n", "36911 10625 10625.00 14.27% \n", "41200 28000 27975.00 21.67% \n", "83869 35000 34950.00 8.90% \n", "53853 9600 9600.00 7.62% \n", "21399 25000 24975.00 15.65% \n", "62127 10000 10000.00 12.12% \n", "23446 14000 13900.25 10.37% \n", "44987 10000 10000.00 9.76% \n", "17977 5200 5175.00 9.99% \n", "86099 22000 21975.00 21.98% \n", "99483 30000 30000.00 19.05% \n", "28798 6500 6500.00 17.99% \n", "24168 17400 17400.00 11.99% \n", "10356 4000 4000.00 16.82% \n", "46027 7200 7200.00 7.90% \n", "2238 8000 8000.00 14.42% \n", "65278 8000 8000.00 15.31% \n", "4227 3000 3000.00 8.59% \n", "50182 14500 14500.00 7.90% \n", "... ... ... ... \n", "84265 20000 20000.00 22.95% \n", "80231 19000 19000.00 7.90% \n", "49533 17300 17250.00 22.45% \n", "102514 7000 711.54 15.13% \n", "78618 7200 7200.00 18.75% \n", "86953 10000 10000.00 14.09% \n", "80129 4000 3925.00 14.09% \n", "85216 17500 17500.00 8.90% \n", "38247 20000 20000.00 11.71% \n", "91245 16200 16200.00 15.80% \n", "53041 10000 10000.00 6.03% \n", "63051 27000 27000.00 6.62% \n", "14446 4500 4475.00 7.51% \n", "68628 15875 15875.00 14.33% \n", "98758 15000 15000.00 10.16% \n", "13070 25000 24950.00 10.75% \n", "45836 7000 7000.00 17.27% \n", "52330 15000 15000.00 19.99% \n", "48243 17000 17000.00 15.81% \n", "63256 19075 19075.00 18.75% \n", "42124 10000 10000.00 11.71% \n", "78043 8475 8475.00 7.62% \n", "925 6400 6350.00 10.08% \n", "74047 30000 30000.00 23.28% \n", "49957 24000 23975.00 14.65% \n", "23735 30000 29950.00 16.77% \n", "65882 16000 16000.00 14.09% \n", "55610 10000 10000.00 13.99% \n", "38576 6000 6000.00 12.42% \n", "3116 9000 5242.75 13.79% \n", "\n", " Loan.Length Loan.Purpose Debt.To.Income.Ratio State \\\n", "81174 36 months debt_consolidation 14.90% SC \n", "99592 36 months debt_consolidation 28.36% TX \n", "80059 60 months debt_consolidation 23.81% CA \n", "15825 36 months debt_consolidation 14.30% KS \n", "33182 36 months credit_card 18.78% NJ \n", "62403 36 months other 20.05% CT \n", "48808 36 months debt_consolidation 26.09% MA \n", "22090 60 months credit_card 14.70% LA \n", "76404 36 months credit_card 26.92% CA \n", "15867 36 months credit_card 7.10% CA \n", "94971 36 months moving 10.29% FL \n", "36911 36 months debt_consolidation 12.54% CA \n", "41200 60 months debt_consolidation 13.07% CT \n", "83869 36 months debt_consolidation 20.46% CT \n", "53853 36 months debt_consolidation 3.45% DC \n", "21399 60 months debt_consolidation 21.99% CA \n", "62127 36 months debt_consolidation 17.72% CA \n", "23446 60 months debt_consolidation 11.95% OH \n", "44987 36 months credit_card 7.13% FL \n", "17977 60 months debt_consolidation 10.29% AL \n", "86099 36 months debt_consolidation 11.19% TX \n", "99483 60 months credit_card 21.25% FL \n", "28798 60 months car 19.63% FL \n", "24168 36 months credit_card 12.47% AZ \n", "10356 60 months vacation 13.71% GA \n", "46027 36 months debt_consolidation 24.82% TX \n", "2238 36 months debt_consolidation 24.63% MA \n", "65278 36 months debt_consolidation 15.46% CA \n", "4227 36 months other 3.72% MA \n", "50182 36 months debt_consolidation 4.85% GA \n", "... ... ... ... ... \n", "84265 60 months debt_consolidation 7.10% NJ \n", "80231 36 months debt_consolidation 9.76% RI \n", "49533 60 months wedding 3.58% PA \n", "102514 36 months major_purchase 18.91% CO \n", "78618 36 months debt_consolidation 16.21% MI \n", "86953 36 months major_purchase 9.71% CO \n", "80129 36 months credit_card 12.27% FL \n", "85216 36 months debt_consolidation 10.94% UT \n", "38247 36 months credit_card 9.58% SD \n", "91245 60 months debt_consolidation 7.92% PA \n", "53041 36 months small_business 13.03% FL \n", "63051 36 months debt_consolidation 12.21% OH \n", "14446 36 months small_business 20.27% VA \n", "68628 36 months small_business 17.44% MD \n", "98758 36 months credit_card 28.28% OH \n", "13070 36 months debt_consolidation 20.48% OR \n", "45836 36 months other 18.38% NY \n", "52330 36 months wedding 18.05% CA \n", "48243 36 months debt_consolidation 17.01% CO \n", "63256 36 months debt_consolidation 15.23% NY \n", "42124 36 months debt_consolidation 8.40% CA \n", "78043 36 months debt_consolidation 15.88% CA \n", "925 36 months debt_consolidation 8.11% NJ \n", "74047 60 months other 12.10% IL \n", "49957 36 months debt_consolidation 15.29% WA \n", "23735 60 months debt_consolidation 19.23% NY \n", "65882 60 months home_improvement 21.54% MD \n", "55610 36 months debt_consolidation 4.89% PA \n", "38576 36 months major_purchase 16.66% NJ \n", "3116 36 months debt_consolidation 6.76% NY \n", "\n", " Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines \\\n", "81174 MORTGAGE 6541.67 735-739 14 \n", "99592 MORTGAGE 4583.33 715-719 12 \n", "80059 MORTGAGE 11500.00 690-694 14 \n", "15825 MORTGAGE 3833.33 695-699 10 \n", "33182 RENT 3195.00 695-699 11 \n", "62403 OWN 4891.67 670-674 17 \n", "48808 RENT 2916.67 720-724 10 \n", "22090 MORTGAGE 13863.42 705-709 12 \n", "76404 RENT 3150.00 685-689 9 \n", "15867 RENT 5000.00 715-719 8 \n", "94971 RENT 3575.00 670-674 10 \n", "36911 MORTGAGE 4250.00 665-669 14 \n", "41200 MORTGAGE 14166.67 670-674 12 \n", "83869 RENT 9166.67 735-739 19 \n", "53853 RENT 11250.00 725-729 13 \n", "21399 RENT 5416.67 730-734 6 \n", "62127 RENT 9000.00 695-699 18 \n", "23446 RENT 4333.33 740-744 6 \n", "44987 RENT 2733.33 730-734 7 \n", "17977 MORTGAGE 3750.00 760-764 10 \n", "86099 MORTGAGE 6666.67 665-669 9 \n", "99483 MORTGAGE 6250.00 695-699 12 \n", "28798 RENT 4100.00 665-669 11 \n", "24168 RENT 6833.33 695-699 7 \n", "10356 MORTGAGE 4500.00 670-674 5 \n", "46027 RENT 5416.67 705-709 8 \n", "2238 RENT 2964.17 675-679 9 \n", "65278 MORTGAGE 2916.67 675-679 13 \n", "4227 MORTGAGE 4167.00 765-769 4 \n", "50182 MORTGAGE 3958.33 760-764 4 \n", "... ... ... ... ... \n", "84265 RENT 6750.00 665-669 6 \n", "80231 MORTGAGE 5166.67 770-774 18 \n", "49533 MORTGAGE 5500.00 685-689 11 \n", "102514 MORTGAGE 3833.00 650-654 13 \n", "78618 RENT 8333.33 660-664 8 \n", "86953 RENT 4583.33 675-679 6 \n", "80129 MORTGAGE 8583.33 675-679 9 \n", "85216 MORTGAGE 25000.00 730-734 9 \n", "38247 MORTGAGE 5416.67 725-729 12 \n", "91245 MORTGAGE 4833.33 680-684 12 \n", "53041 RENT 5000.00 760-764 8 \n", "63051 MORTGAGE 9250.00 810-814 12 \n", "14446 MORTGAGE 7075.50 720-724 15 \n", "68628 MORTGAGE 3416.67 675-679 11 \n", "98758 MORTGAGE 6666.67 690-694 15 \n", "13070 MORTGAGE 7083.33 765-769 10 \n", "45836 OWN 2464.37 665-669 9 \n", "52330 RENT 8000.00 660-664 6 \n", "48243 RENT 3833.33 685-689 6 \n", "63256 RENT 5166.67 670-674 17 \n", "42124 RENT 4500.00 710-714 8 \n", "78043 RENT 3983.33 720-724 9 \n", "925 MORTGAGE 5166.67 710-714 5 \n", "74047 MORTGAGE 7083.33 675-679 16 \n", "49957 MORTGAGE 6666.67 685-689 13 \n", "23735 MORTGAGE 9250.00 705-709 15 \n", "65882 OWN 8903.25 740-744 18 \n", "55610 MORTGAGE 2166.67 680-684 4 \n", "38576 RENT 3500.00 675-679 8 \n", "3116 RENT 3875.00 670-674 7 \n", "\n", " Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months \\\n", "81174 14272 2 \n", "99592 11140 1 \n", "80059 21977 1 \n", "15825 9346 0 \n", "33182 14469 0 \n", "62403 10391 2 \n", "48808 15957 0 \n", "22090 27874 0 \n", "76404 7246 1 \n", "15867 7612 0 \n", "94971 12036 0 \n", "36911 10767 0 \n", "41200 10311 0 \n", "83869 21536 0 \n", "53853 4606 0 \n", "21399 13929 0 \n", "62127 20317 0 \n", "23446 7419 0 \n", "44987 6112 2 \n", "17977 16094 0 \n", "86099 23124 0 \n", "99483 34927 0 \n", "28798 11697 1 \n", "24168 26587 0 \n", "10356 20804 0 \n", "46027 12017 0 \n", "2238 8928 2 \n", "65278 7152 1 \n", "4227 7074 0 \n", "50182 9598 0 \n", "... ... ... \n", "84265 16104 1 \n", "80231 43617 2 \n", "49533 2306 3 \n", "102514 12634 0 \n", "78618 28916 0 \n", "86953 3859 0 \n", "80129 36943 1 \n", "85216 34545 0 \n", "38247 18267 1 \n", "91245 12313 2 \n", "53041 3952 0 \n", "63051 4211 0 \n", "14446 68618 2 \n", "68628 15891 0 \n", "98758 14880 0 \n", "13070 25429 0 \n", "45836 7089 0 \n", "52330 45976 1 \n", "48243 15484 1 \n", "63256 13749 3 \n", "42124 8404 1 \n", "78043 6882 0 \n", "925 5815 2 \n", "74047 17969 1 \n", "49957 17521 0 \n", "23735 45880 1 \n", "65882 18898 1 \n", "55610 4544 0 \n", "38576 7753 0 \n", "3116 7589 0 \n", "\n", " Employment.Length \n", "81174 < 1 year \n", "99592 2 years \n", "80059 2 years \n", "15825 5 years \n", "33182 9 years \n", "62403 3 years \n", "48808 10+ years \n", "22090 10+ years \n", "76404 8 years \n", "15867 3 years \n", "94971 6 years \n", "36911 < 1 year \n", "41200 1 year \n", "83869 1 year \n", "53853 < 1 year \n", "21399 9 years \n", "62127 7 years \n", "23446 9 years \n", "44987 3 years \n", "17977 < 1 year \n", "86099 10+ years \n", "99483 6 years \n", "28798 2 years \n", "24168 7 years \n", "10356 3 years \n", "46027 7 years \n", "2238 6 years \n", "65278 5 years \n", "4227 5 years \n", "50182 4 years \n", "... ... \n", "84265 7 years \n", "80231 10+ years \n", "49533 4 years \n", "102514 2 years \n", "78618 10+ years \n", "86953 < 1 year \n", "80129 10+ years \n", "85216 6 years \n", "38247 4 years \n", "91245 10+ years \n", "53041 2 years \n", "63051 5 years \n", "14446 10+ years \n", "68628 2 years \n", "98758 10+ years \n", "13070 6 years \n", "45836 3 years \n", "52330 2 years \n", "48243 6 years \n", "63256 10+ years \n", "42124 3 years \n", "78043 n/a \n", "925 10+ years \n", "74047 10+ years \n", "49957 5 years \n", "23735 8 years \n", "65882 10+ years \n", "55610 10+ years \n", "38576 5 years \n", "3116 10+ years \n", "\n", "[2500 rows x 14 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%pylab inline\n", "import pandas as pd\n", "from numpy import nan as NA\n", "loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')\n", "loansData" ] }, { "cell_type": "code", "execution_count": 330, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 8.90%\n", "99592 12.12%\n", "80059 21.98%\n", "15825 9.99%\n", "33182 11.71%\n", "Name: Interest.Rate, dtype: object" ] }, "execution_count": 330, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* removing % signs from rates" ] }, { "cell_type": "code", "execution_count": 331, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 8.90%\n", "99592 12.12%\n", "80059 21.98%\n", "15825 9.99%\n", "33182 11.71%\n", "Name: Interest.Rate, dtype: object" ] }, "execution_count": 331, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ir = loansData['Interest.Rate']\n", "ir[0:5]" ] }, { "cell_type": "code", "execution_count": 332, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n",
       "<class 'pandas.core.frame.DataFrame'>\n",
       "Int64Index: 2500 entries, 81174 to 3116\n",
       "Data columns (total 14 columns):\n",
       "Amount.Requested                  2500  non-null values\n",
       "Amount.Funded.By.Investors        2500  non-null values\n",
       "Interest.Rate                     2500  non-null values\n",
       "Loan.Length                       2500  non-null values\n",
       "Loan.Purpose                      2500  non-null values\n",
       "Debt.To.Income.Ratio              2500  non-null values\n",
       "State                             2500  non-null values\n",
       "Home.Ownership                    2500  non-null values\n",
       "Monthly.Income                    2499  non-null values\n",
       "FICO.Range                        2500  non-null values\n",
       "Open.CREDIT.Lines                 2498  non-null values\n",
       "Revolving.CREDIT.Balance          2498  non-null values\n",
       "Inquiries.in.the.Last.6.Months    2498  non-null values\n",
       "Employment.Length                 2500  non-null values\n",
       "dtypes: float64(5), int64(1), object(8)\n",
       "
" ], "text/plain": [ "\n", "Int64Index: 2500 entries, 81174 to 3116\n", "Data columns (total 14 columns):\n", "Amount.Requested 2500 non-null values\n", "Amount.Funded.By.Investors 2500 non-null values\n", "Interest.Rate 2500 non-null values\n", "Loan.Length 2500 non-null values\n", "Loan.Purpose 2500 non-null values\n", "Debt.To.Income.Ratio 2500 non-null values\n", "State 2500 non-null values\n", "Home.Ownership 2500 non-null values\n", "Monthly.Income 2499 non-null values\n", "FICO.Range 2500 non-null values\n", "Open.CREDIT.Lines 2498 non-null values\n", "Revolving.CREDIT.Balance 2498 non-null values\n", "Inquiries.in.the.Last.6.Months 2498 non-null values\n", "Employment.Length 2500 non-null values\n", "dtypes: float64(5), int64(1), object(8)" ] }, "execution_count": 332, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(ir[0:1])\n", "irbak = ir\n", "type(irbak)\n", "loansDataBak = loansData\n", "loansDataBak.reset_index()\n", "loansDataBak\n", "loansDataBak['Interest.Rate']\n", "loansData" ] }, { "cell_type": "code", "execution_count": 333, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['8.90%', '12.12%', '21.98%', '9.99%', '11.71%']" ] }, "execution_count": 333, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb = loansDataBak.reset_index() # explain\n", "ldb['Interest.Rate'][0:5]\n", "irates = ldb['Interest.Rate'][0:]\n", "type(irates)\n", "\n", "[irates[k] for k in (0,1,2,3,4)]\n", "#int_rates = [ float(x[:-1]) for x in irates[k] ]\n", "#\"\"\"\n", "#---------------------------------------------------------------------------\n", "#ValueError Traceback (most recent call last)\n", "# in ()\n", "# 5 \n", "# 6 [irates[k] for k in (0,1,2,3,4)]\n", "#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]\n", "#\n", "#ValueError: could not convert string to float: \n", "#\"\"\"\n", "\n" ] }, { "cell_type": "code", "execution_count": 334, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 334, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(ldb)" ] }, { "cell_type": "code", "execution_count": 335, "metadata": { "collapsed": false }, "outputs": [], "source": [ "srates = ldb['Interest.Rate']" ] }, { "cell_type": "code", "execution_count": 336, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'" ] }, { "cell_type": "code", "execution_count": 337, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'" ] }, { "cell_type": "code", "execution_count": 338, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(nas)" ] }, { "cell_type": "code", "execution_count": 339, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'8.90'" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "srates[0][:-1]" ] }, { "cell_type": "code", "execution_count": 340, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "8.9" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float(srates[0][:-1])" ] }, { "cell_type": "code", "execution_count": 341, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nopct = [ x[:-1] for x in srates ]" ] }, { "cell_type": "code", "execution_count": 342, "metadata": { "collapsed": false }, "outputs": [], "source": [ "flrates = [float(x) for x in nopct]" ] }, { "cell_type": "code", "execution_count": 343, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[8.9, 12.12, 21.98, 9.99, 11.71]" ] }, "execution_count": 343, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flrates[0:5]" ] }, { "cell_type": "code", "execution_count": 344, "metadata": { "collapsed": false }, "outputs": [], "source": [ "flrate = map(float, nopct)" ] }, { "cell_type": "code", "execution_count": 345, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[8.9, 12.12, 21.98, 9.99, 11.71]" ] }, "execution_count": 345, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flrate[0:5]" ] }, { "cell_type": "code", "execution_count": 346, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2500" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(flrate)" ] }, { "cell_type": "code", "execution_count": 347, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n",
       "<class 'pandas.core.frame.DataFrame'>\n",
       "Int64Index: 2500 entries, 0 to 2499\n",
       "Data columns (total 15 columns):\n",
       "index                             2500  non-null values\n",
       "Amount.Requested                  2500  non-null values\n",
       "Amount.Funded.By.Investors        2500  non-null values\n",
       "Interest.Rate                     2500  non-null values\n",
       "Loan.Length                       2500  non-null values\n",
       "Loan.Purpose                      2500  non-null values\n",
       "Debt.To.Income.Ratio              2500  non-null values\n",
       "State                             2500  non-null values\n",
       "Home.Ownership                    2500  non-null values\n",
       "Monthly.Income                    2499  non-null values\n",
       "FICO.Range                        2500  non-null values\n",
       "Open.CREDIT.Lines                 2498  non-null values\n",
       "Revolving.CREDIT.Balance          2498  non-null values\n",
       "Inquiries.in.the.Last.6.Months    2498  non-null values\n",
       "Employment.Length                 2500  non-null values\n",
       "dtypes: float64(5), int64(2), object(8)\n",
       "
" ], "text/plain": [ "\n", "Int64Index: 2500 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2500 non-null values\n", "Amount.Requested 2500 non-null values\n", "Amount.Funded.By.Investors 2500 non-null values\n", "Interest.Rate 2500 non-null values\n", "Loan.Length 2500 non-null values\n", "Loan.Purpose 2500 non-null values\n", "Debt.To.Income.Ratio 2500 non-null values\n", "State 2500 non-null values\n", "Home.Ownership 2500 non-null values\n", "Monthly.Income 2499 non-null values\n", "FICO.Range 2500 non-null values\n", "Open.CREDIT.Lines 2498 non-null values\n", "Revolving.CREDIT.Balance 2498 non-null values\n", "Inquiries.in.the.Last.6.Months 2498 non-null values\n", "Employment.Length 2500 non-null values\n", "dtypes: float64(5), int64(2), object(8)" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb" ] }, { "cell_type": "code", "execution_count": 348, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldb['Interest.Rate'] = flrate" ] }, { "cell_type": "code", "execution_count": 349, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n",
       "<class 'pandas.core.frame.DataFrame'>\n",
       "Int64Index: 5 entries, 0 to 4\n",
       "Data columns (total 15 columns):\n",
       "index                             5  non-null values\n",
       "Amount.Requested                  5  non-null values\n",
       "Amount.Funded.By.Investors        5  non-null values\n",
       "Interest.Rate                     5  non-null values\n",
       "Loan.Length                       5  non-null values\n",
       "Loan.Purpose                      5  non-null values\n",
       "Debt.To.Income.Ratio              5  non-null values\n",
       "State                             5  non-null values\n",
       "Home.Ownership                    5  non-null values\n",
       "Monthly.Income                    5  non-null values\n",
       "FICO.Range                        5  non-null values\n",
       "Open.CREDIT.Lines                 5  non-null values\n",
       "Revolving.CREDIT.Balance          5  non-null values\n",
       "Inquiries.in.the.Last.6.Months    5  non-null values\n",
       "Employment.Length                 5  non-null values\n",
       "dtypes: float64(6), int64(2), object(7)\n",
       "
" ], "text/plain": [ " index Amount.Requested Amount.Funded.By.Investors Interest.Rate \\\n", "0 81174 20000 20000 8.90 \n", "1 99592 19200 19200 12.12 \n", "2 80059 35000 35000 21.98 \n", "3 15825 10000 9975 9.99 \n", "4 33182 12000 12000 11.71 \n", "\n", " Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership \\\n", "0 36 months debt_consolidation 14.90% SC MORTGAGE \n", "1 36 months debt_consolidation 28.36% TX MORTGAGE \n", "2 60 months debt_consolidation 23.81% CA MORTGAGE \n", "3 36 months debt_consolidation 14.30% KS MORTGAGE \n", "4 36 months credit_card 18.78% NJ RENT \n", "\n", " Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance \\\n", "0 6541.67 735-739 14 14272 \n", "1 4583.33 715-719 12 11140 \n", "2 11500.00 690-694 14 21977 \n", "3 3833.33 695-699 10 9346 \n", "4 3195.00 695-699 11 14469 \n", "\n", " Inquiries.in.the.Last.6.Months Employment.Length \n", "0 2 < 1 year \n", "1 1 2 years \n", "2 1 2 years \n", "3 0 5 years \n", "4 0 9 years " ] }, "execution_count": 349, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb[0:5]" ] }, { "cell_type": "code", "execution_count": 350, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 8.90\n", "1 12.12\n", "2 21.98\n", "3 9.99\n", "4 11.71\n", "Name: Interest.Rate, dtype: float64" ] }, "execution_count": 350, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb['Interest.Rate'][0:5]" ] }, { "cell_type": "code", "execution_count": 351, "metadata": { "collapsed": false }, "outputs": [], "source": [ "srates = loansData['Interest.Rate']" ] }, { "cell_type": "code", "execution_count": 352, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nopct = [ x[:-1] for x in srates ]" ] }, { "cell_type": "code", "execution_count": 353, "metadata": { "collapsed": false }, "outputs": [], "source": [ "flrates = [float(x) for x in nopct]\n" ] }, { "cell_type": "code", "execution_count": 354, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rates = [float(x[:-1]) for x in srates] # use this" ] }, { "cell_type": "code", "execution_count": 355, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 355, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flrates == rates" ] }, { "cell_type": "code", "execution_count": 356, "metadata": { "collapsed": false }, "outputs": [], "source": [ "loansData['Interest.Rate'] = flrates" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 8.90\n", "99592 12.12\n", "80059 21.98\n", "15825 9.99\n", "33182 11.71\n", "Name: Interest.Rate, dtype: float64" ] }, "execution_count": 357, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Interest.Rate'][0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion of step 1\n", "* ok! whew! we're done with the % symbol stuff\n", "* we learnt quite a few things along the way that will be useful in the next part" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Remove the months" ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false }, "outputs": [], "source": [ "withmons = ldb['Loan.Length'] " ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 359, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wmons = withmons[0:]" ] }, { "cell_type": "code", "execution_count": 360, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 36 months\n", "1 36 months\n", "2 60 months\n", "3 36 months\n", "4 36 months\n", "Name: Loan.Length, dtype: object" ] }, "execution_count": 360, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0:5]" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 36 months\n", "1 36 months\n", "2 60 months\n", "3 36 months\n", "4 36 months\n", "5 36 months\n", "6 36 months\n", "7 60 months\n", "8 36 months\n", "9 36 months\n", "10 36 months\n", "11 36 months\n", "12 60 months\n", "13 36 months\n", "14 36 months\n", "...\n", "2485 36 months\n", "2486 36 months\n", "2487 36 months\n", "2488 36 months\n", "2489 36 months\n", "2490 36 months\n", "2491 36 months\n", "2492 36 months\n", "2493 60 months\n", "2494 36 months\n", "2495 60 months\n", "2496 60 months\n", "2497 36 months\n", "2498 36 months\n", "2499 36 months\n", "Name: Loan.Length, Length: 2500, dtype: object" ] }, "execution_count": 361, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons" ] }, { "cell_type": "code", "execution_count": 362, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['36', 'months']" ] }, "execution_count": 362, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0].split()" ] }, { "cell_type": "code", "execution_count": 363, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'36'" ] }, "execution_count": 363, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0].split()[0]" ] }, { "cell_type": "code", "execution_count": 364, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 364, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int(wmons[0].split()[0])" ] }, { "cell_type": "code", "execution_count": 365, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = wmons[0].split()\n" ] }, { "cell_type": "code", "execution_count": 366, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'36'" ] }, "execution_count": 366, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x[0]" ] }, { "cell_type": "code", "execution_count": 367, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 367, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int(x[0])" ] }, { "cell_type": "code", "execution_count": 368, "metadata": { "collapsed": false }, "outputs": [], "source": [ "intmons = [ int(x.split()[0]) for x in wmons ] " ] }, { "cell_type": "code", "execution_count": 369, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]" ] }, "execution_count": 369, "metadata": {}, "output_type": "execute_result" } ], "source": [ "intmons[0:10]" ] }, { "cell_type": "code", "execution_count": 370, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 36 months\n", "99592 36 months\n", "80059 60 months\n", "15825 36 months\n", "33182 36 months\n", "62403 36 months\n", "48808 36 months\n", "22090 60 months\n", "76404 36 months\n", "15867 36 months\n", "94971 36 months\n", "36911 36 months\n", "41200 60 months\n", "83869 36 months\n", "53853 36 months\n", "...\n", "13070 36 months\n", "45836 36 months\n", "52330 36 months\n", "48243 36 months\n", "63256 36 months\n", "42124 36 months\n", "78043 36 months\n", "925 36 months\n", "74047 60 months\n", "49957 36 months\n", "23735 60 months\n", "65882 60 months\n", "55610 36 months\n", "38576 36 months\n", "3116 36 months\n", "Name: Loan.Length, Length: 2500, dtype: object" ] }, "execution_count": 370, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Loan.Length']" ] }, { "cell_type": "code", "execution_count": 371, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36 months 1952\n", "60 months 548\n", "dtype: int64" ] }, "execution_count": 371, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Loan.Length'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion of Step 2\n", "* Here we used the techniques we learned in Step 1.\n", " * Pull out a column from a data frame\n", " * Operate on it, perform some transformations\n", " * Replace the column in the original dataframe with this new column \n", "* We applied them to removing the ' months' suffix in the Loan.Length column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Remove bad data" ] }, { "cell_type": "code", "execution_count": 372, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 2499.000000\n", "mean 5688.931321\n", "std 3963.118185\n", "min 588.500000\n", "25% 3500.000000\n", "50% 5000.000000\n", "75% 6800.000000\n", "max 102750.000000\n", "dtype: float64" ] }, "execution_count": 372, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First remove implausible values. We see the max value to be 102750. \n", "This is a MONTHLY income of 100K dollars, which is certainly possible, but .... \n", "highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context. " ] }, { "cell_type": "code", "execution_count": 373, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "54487" ] }, "execution_count": 373, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].idxmax() # find the place where the max occurs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe. \n", "In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars." ] }, { "cell_type": "code", "execution_count": 374, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 6541.67\n", "99592 4583.33\n", "80059 11500.00\n", "15825 3833.33\n", "33182 3195.00\n", "62403 4891.67\n", "48808 2916.67\n", "22090 13863.42\n", "76404 3150.00\n", "15867 5000.00\n", "94971 3575.00\n", "36911 4250.00\n", "41200 14166.67\n", "83869 9166.67\n", "53853 11250.00\n", "...\n", "13070 7083.33\n", "45836 2464.37\n", "52330 8000.00\n", "48243 3833.33\n", "63256 5166.67\n", "42124 4500.00\n", "78043 3983.33\n", "925 5166.67\n", "74047 7083.33\n", "49957 6666.67\n", "23735 9250.00\n", "65882 8903.25\n", "55610 2166.67\n", "38576 3500.00\n", "3116 3875.00\n", "Name: Monthly.Income, Length: 2498, dtype: float64" ] }, "execution_count": 374, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]" ] }, { "cell_type": "code", "execution_count": 374, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 375, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 2499.000000\n", "mean 5688.931321\n", "std 3963.118185\n", "min 588.500000\n", "25% 3500.000000\n", "50% 5000.000000\n", "75% 6800.000000\n", "max 102750.000000\n", "dtype: float64" ] }, "execution_count": 375, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].describe()" ] }, { "cell_type": "code", "execution_count": 376, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldlt100 = ldb[ldb['Monthly.Income'] < 100000]" ] }, { "cell_type": "code", "execution_count": 377, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n",
       "<class 'pandas.core.frame.DataFrame'>\n",
       "Int64Index: 2498 entries, 0 to 2499\n",
       "Data columns (total 15 columns):\n",
       "index                             2498  non-null values\n",
       "Amount.Requested                  2498  non-null values\n",
       "Amount.Funded.By.Investors        2498  non-null values\n",
       "Interest.Rate                     2498  non-null values\n",
       "Loan.Length                       2498  non-null values\n",
       "Loan.Purpose                      2498  non-null values\n",
       "Debt.To.Income.Ratio              2498  non-null values\n",
       "State                             2498  non-null values\n",
       "Home.Ownership                    2498  non-null values\n",
       "Monthly.Income                    2498  non-null values\n",
       "FICO.Range                        2498  non-null values\n",
       "Open.CREDIT.Lines                 2497  non-null values\n",
       "Revolving.CREDIT.Balance          2497  non-null values\n",
       "Inquiries.in.the.Last.6.Months    2497  non-null values\n",
       "Employment.Length                 2498  non-null values\n",
       "dtypes: float64(6), int64(2), object(7)\n",
       "
" ], "text/plain": [ "\n", "Int64Index: 2498 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2498 non-null values\n", "Amount.Requested 2498 non-null values\n", "Amount.Funded.By.Investors 2498 non-null values\n", "Interest.Rate 2498 non-null values\n", "Loan.Length 2498 non-null values\n", "Loan.Purpose 2498 non-null values\n", "Debt.To.Income.Ratio 2498 non-null values\n", "State 2498 non-null values\n", "Home.Ownership 2498 non-null values\n", "Monthly.Income 2498 non-null values\n", "FICO.Range 2498 non-null values\n", "Open.CREDIT.Lines 2497 non-null values\n", "Revolving.CREDIT.Balance 2497 non-null values\n", "Inquiries.in.the.Last.6.Months 2497 non-null values\n", "Employment.Length 2498 non-null values\n", "dtypes: float64(6), int64(2), object(7)" ] }, "execution_count": 377, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldlt100" ] }, { "cell_type": "code", "execution_count": 378, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2498" ] }, "execution_count": 378, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ldlt100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now drop any rows that have 'NA' values ie data not available. \n", "In database terminiology these would be 'NULL' values." ] }, { "cell_type": "code", "execution_count": 379, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldb2 = ldlt100.dropna()" ] }, { "cell_type": "code", "execution_count": 380, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n",
       "<class 'pandas.core.frame.DataFrame'>\n",
       "Int64Index: 2497 entries, 0 to 2499\n",
       "Data columns (total 15 columns):\n",
       "index                             2497  non-null values\n",
       "Amount.Requested                  2497  non-null values\n",
       "Amount.Funded.By.Investors        2497  non-null values\n",
       "Interest.Rate                     2497  non-null values\n",
       "Loan.Length                       2497  non-null values\n",
       "Loan.Purpose                      2497  non-null values\n",
       "Debt.To.Income.Ratio              2497  non-null values\n",
       "State                             2497  non-null values\n",
       "Home.Ownership                    2497  non-null values\n",
       "Monthly.Income                    2497  non-null values\n",
       "FICO.Range                        2497  non-null values\n",
       "Open.CREDIT.Lines                 2497  non-null values\n",
       "Revolving.CREDIT.Balance          2497  non-null values\n",
       "Inquiries.in.the.Last.6.Months    2497  non-null values\n",
       "Employment.Length                 2497  non-null values\n",
       "dtypes: float64(6), int64(2), object(7)\n",
       "
" ], "text/plain": [ "\n", "Int64Index: 2497 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2497 non-null values\n", "Amount.Requested 2497 non-null values\n", "Amount.Funded.By.Investors 2497 non-null values\n", "Interest.Rate 2497 non-null values\n", "Loan.Length 2497 non-null values\n", "Loan.Purpose 2497 non-null values\n", "Debt.To.Income.Ratio 2497 non-null values\n", "State 2497 non-null values\n", "Home.Ownership 2497 non-null values\n", "Monthly.Income 2497 non-null values\n", "FICO.Range 2497 non-null values\n", "Open.CREDIT.Lines 2497 non-null values\n", "Revolving.CREDIT.Balance 2497 non-null values\n", "Inquiries.in.the.Last.6.Months 2497 non-null values\n", "Employment.Length 2497 non-null values\n", "dtypes: float64(6), int64(2), object(7)" ] }, "execution_count": 380, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb2" ] }, { "cell_type": "code", "execution_count": 381, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2497" ] }, "execution_count": 381, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ldb2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we dropped one row that had an NA value somewhere." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion of step 3\n", "* We used techniques we learned in step 1 to pick out a columns and operate on it\n", "* We also learned how to filter the data based on expressions involving column values\n", "* Finally we learnt how to drop NA values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have removed the \"bad\" data, let's take on the final data-cleaning task for this data set - converting the ranges \n", "to single integers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Change FICO range to a single value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724. \n", "We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.\n" ] }, { "cell_type": "code", "execution_count": 382, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ficostr = ldb2['FICO.Range'] " ] }, { "cell_type": "code", "execution_count": 383, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 735-739\n", "1 715-719\n", "2 690-694\n", "3 695-699\n", "4 695-699\n", "5 670-674\n", "6 720-724\n", "7 705-709\n", "8 685-689\n", "9 715-719\n", "Name: FICO.Range, dtype: object" ] }, "execution_count": 383, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ficostr[0:10]" ] }, { "cell_type": "code", "execution_count": 384, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'735-739'" ] }, "execution_count": 384, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ficostr[0]" ] }, { "cell_type": "code", "execution_count": 385, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ficoint = [ int(x.split('-')[0]) for x in ficostr ]" ] }, { "cell_type": "code", "execution_count": 386, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[735, 715, 690, 695, 695, 670, 720, 705, 685, 715]" ] }, "execution_count": 386, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ficoint[0:10]" ] }, { "cell_type": "code", "execution_count": 387, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldb2['FICO.Range'] = ficoint" ] }, { "cell_type": "code", "execution_count": 388, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2497" ] }, "execution_count": 388, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ficoint)" ] }, { "cell_type": "code", "execution_count": 389, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 735\n", "1 715\n", "2 690\n", "3 695\n", "4 695\n", "5 670\n", "6 720\n", "7 705\n", "8 685\n", "9 715\n", "10 670\n", "11 665\n", "12 670\n", "13 735\n", "14 725\n", "...\n", "2485 765\n", "2486 665\n", "2487 660\n", "2488 685\n", "2489 670\n", "2490 710\n", "2491 720\n", "2492 710\n", "2493 675\n", "2494 685\n", "2495 705\n", "2496 740\n", "2497 680\n", "2498 675\n", "2499 670\n", "Name: FICO.Range, Length: 2497, dtype: int64" ] }, "execution_count": 389, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb2['FICO.Range']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conclusion of Step 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We used techniques similar to the ones in Step 1\n", " * we picked a column - the FICO.Range column\n", " * we split the values on the separator '-'\n", " * we picked the fist value, i.e. the lower limit\n", " * we converted it to an int" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a dataset that we can use for our data exploration and analysis" ] }, { "cell_type": "code", "execution_count": 389, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }