{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Linear Regression - Data Cleanup\n", "================================" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "1. Remove '%'\n", " * strip the last letter\n", " * map to float\n", "\n", "\n", "2. Remove ' months' suffix\n", " * strip the tailing letters\n", " * map to int\n", "\n", "3. Remove bad data\n", " * n/a values\n", " * outlier income\n", " * histograms\n", " * describe\n", "\n", "4. Map FICO.Range to single int\n", " * split on '-'\n", " * use bottom value\n", " * convert to int\n", " * show how to do mid-point\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] }, { "data": { "text/html": [ "
\n", " | Amount.Requested | \n", "Amount.Funded.By.Investors | \n", "Interest.Rate | \n", "Loan.Length | \n", "Loan.Purpose | \n", "Debt.To.Income.Ratio | \n", "State | \n", "Home.Ownership | \n", "Monthly.Income | \n", "FICO.Range | \n", "Open.CREDIT.Lines | \n", "Revolving.CREDIT.Balance | \n", "Inquiries.in.the.Last.6.Months | \n", "Employment.Length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81174 | \n", "20000 | \n", "20000.00 | \n", "8.90% | \n", "36 months | \n", "debt_consolidation | \n", "14.90% | \n", "SC | \n", "MORTGAGE | \n", "6541.67 | \n", "735-739 | \n", "14 | \n", "14272 | \n", "2 | \n", "< 1 year | \n", "
99592 | \n", "19200 | \n", "19200.00 | \n", "12.12% | \n", "36 months | \n", "debt_consolidation | \n", "28.36% | \n", "TX | \n", "MORTGAGE | \n", "4583.33 | \n", "715-719 | \n", "12 | \n", "11140 | \n", "1 | \n", "2 years | \n", "
80059 | \n", "35000 | \n", "35000.00 | \n", "21.98% | \n", "60 months | \n", "debt_consolidation | \n", "23.81% | \n", "CA | \n", "MORTGAGE | \n", "11500.00 | \n", "690-694 | \n", "14 | \n", "21977 | \n", "1 | \n", "2 years | \n", "
15825 | \n", "10000 | \n", "9975.00 | \n", "9.99% | \n", "36 months | \n", "debt_consolidation | \n", "14.30% | \n", "KS | \n", "MORTGAGE | \n", "3833.33 | \n", "695-699 | \n", "10 | \n", "9346 | \n", "0 | \n", "5 years | \n", "
33182 | \n", "12000 | \n", "12000.00 | \n", "11.71% | \n", "36 months | \n", "credit_card | \n", "18.78% | \n", "NJ | \n", "RENT | \n", "3195.00 | \n", "695-699 | \n", "11 | \n", "14469 | \n", "0 | \n", "9 years | \n", "
62403 | \n", "6000 | \n", "6000.00 | \n", "15.31% | \n", "36 months | \n", "other | \n", "20.05% | \n", "CT | \n", "OWN | \n", "4891.67 | \n", "670-674 | \n", "17 | \n", "10391 | \n", "2 | \n", "3 years | \n", "
48808 | \n", "10000 | \n", "10000.00 | \n", "7.90% | \n", "36 months | \n", "debt_consolidation | \n", "26.09% | \n", "MA | \n", "RENT | \n", "2916.67 | \n", "720-724 | \n", "10 | \n", "15957 | \n", "0 | \n", "10+ years | \n", "
22090 | \n", "33500 | \n", "33450.00 | \n", "17.14% | \n", "60 months | \n", "credit_card | \n", "14.70% | \n", "LA | \n", "MORTGAGE | \n", "13863.42 | \n", "705-709 | \n", "12 | \n", "27874 | \n", "0 | \n", "10+ years | \n", "
76404 | \n", "14675 | \n", "14675.00 | \n", "14.33% | \n", "36 months | \n", "credit_card | \n", "26.92% | \n", "CA | \n", "RENT | \n", "3150.00 | \n", "685-689 | \n", "9 | \n", "7246 | \n", "1 | \n", "8 years | \n", "
15867 | \n", "7000 | \n", "7000.00 | \n", "6.91% | \n", "36 months | \n", "credit_card | \n", "7.10% | \n", "CA | \n", "RENT | \n", "5000.00 | \n", "715-719 | \n", "8 | \n", "7612 | \n", "0 | \n", "3 years | \n", "
94971 | \n", "2000 | \n", "2000.00 | \n", "19.72% | \n", "36 months | \n", "moving | \n", "10.29% | \n", "FL | \n", "RENT | \n", "3575.00 | \n", "670-674 | \n", "10 | \n", "12036 | \n", "0 | \n", "6 years | \n", "
36911 | \n", "10625 | \n", "10625.00 | \n", "14.27% | \n", "36 months | \n", "debt_consolidation | \n", "12.54% | \n", "CA | \n", "MORTGAGE | \n", "4250.00 | \n", "665-669 | \n", "14 | \n", "10767 | \n", "0 | \n", "< 1 year | \n", "
41200 | \n", "28000 | \n", "27975.00 | \n", "21.67% | \n", "60 months | \n", "debt_consolidation | \n", "13.07% | \n", "CT | \n", "MORTGAGE | \n", "14166.67 | \n", "670-674 | \n", "12 | \n", "10311 | \n", "0 | \n", "1 year | \n", "
83869 | \n", "35000 | \n", "34950.00 | \n", "8.90% | \n", "36 months | \n", "debt_consolidation | \n", "20.46% | \n", "CT | \n", "RENT | \n", "9166.67 | \n", "735-739 | \n", "19 | \n", "21536 | \n", "0 | \n", "1 year | \n", "
53853 | \n", "9600 | \n", "9600.00 | \n", "7.62% | \n", "36 months | \n", "debt_consolidation | \n", "3.45% | \n", "DC | \n", "RENT | \n", "11250.00 | \n", "725-729 | \n", "13 | \n", "4606 | \n", "0 | \n", "< 1 year | \n", "
21399 | \n", "25000 | \n", "24975.00 | \n", "15.65% | \n", "60 months | \n", "debt_consolidation | \n", "21.99% | \n", "CA | \n", "RENT | \n", "5416.67 | \n", "730-734 | \n", "6 | \n", "13929 | \n", "0 | \n", "9 years | \n", "
62127 | \n", "10000 | \n", "10000.00 | \n", "12.12% | \n", "36 months | \n", "debt_consolidation | \n", "17.72% | \n", "CA | \n", "RENT | \n", "9000.00 | \n", "695-699 | \n", "18 | \n", "20317 | \n", "0 | \n", "7 years | \n", "
23446 | \n", "14000 | \n", "13900.25 | \n", "10.37% | \n", "60 months | \n", "debt_consolidation | \n", "11.95% | \n", "OH | \n", "RENT | \n", "4333.33 | \n", "740-744 | \n", "6 | \n", "7419 | \n", "0 | \n", "9 years | \n", "
44987 | \n", "10000 | \n", "10000.00 | \n", "9.76% | \n", "36 months | \n", "credit_card | \n", "7.13% | \n", "FL | \n", "RENT | \n", "2733.33 | \n", "730-734 | \n", "7 | \n", "6112 | \n", "2 | \n", "3 years | \n", "
17977 | \n", "5200 | \n", "5175.00 | \n", "9.99% | \n", "60 months | \n", "debt_consolidation | \n", "10.29% | \n", "AL | \n", "MORTGAGE | \n", "3750.00 | \n", "760-764 | \n", "10 | \n", "16094 | \n", "0 | \n", "< 1 year | \n", "
86099 | \n", "22000 | \n", "21975.00 | \n", "21.98% | \n", "36 months | \n", "debt_consolidation | \n", "11.19% | \n", "TX | \n", "MORTGAGE | \n", "6666.67 | \n", "665-669 | \n", "9 | \n", "23124 | \n", "0 | \n", "10+ years | \n", "
99483 | \n", "30000 | \n", "30000.00 | \n", "19.05% | \n", "60 months | \n", "credit_card | \n", "21.25% | \n", "FL | \n", "MORTGAGE | \n", "6250.00 | \n", "695-699 | \n", "12 | \n", "34927 | \n", "0 | \n", "6 years | \n", "
28798 | \n", "6500 | \n", "6500.00 | \n", "17.99% | \n", "60 months | \n", "car | \n", "19.63% | \n", "FL | \n", "RENT | \n", "4100.00 | \n", "665-669 | \n", "11 | \n", "11697 | \n", "1 | \n", "2 years | \n", "
24168 | \n", "17400 | \n", "17400.00 | \n", "11.99% | \n", "36 months | \n", "credit_card | \n", "12.47% | \n", "AZ | \n", "RENT | \n", "6833.33 | \n", "695-699 | \n", "7 | \n", "26587 | \n", "0 | \n", "7 years | \n", "
10356 | \n", "4000 | \n", "4000.00 | \n", "16.82% | \n", "60 months | \n", "vacation | \n", "13.71% | \n", "GA | \n", "MORTGAGE | \n", "4500.00 | \n", "670-674 | \n", "5 | \n", "20804 | \n", "0 | \n", "3 years | \n", "
46027 | \n", "7200 | \n", "7200.00 | \n", "7.90% | \n", "36 months | \n", "debt_consolidation | \n", "24.82% | \n", "TX | \n", "RENT | \n", "5416.67 | \n", "705-709 | \n", "8 | \n", "12017 | \n", "0 | \n", "7 years | \n", "
2238 | \n", "8000 | \n", "8000.00 | \n", "14.42% | \n", "36 months | \n", "debt_consolidation | \n", "24.63% | \n", "MA | \n", "RENT | \n", "2964.17 | \n", "675-679 | \n", "9 | \n", "8928 | \n", "2 | \n", "6 years | \n", "
65278 | \n", "8000 | \n", "8000.00 | \n", "15.31% | \n", "36 months | \n", "debt_consolidation | \n", "15.46% | \n", "CA | \n", "MORTGAGE | \n", "2916.67 | \n", "675-679 | \n", "13 | \n", "7152 | \n", "1 | \n", "5 years | \n", "
4227 | \n", "3000 | \n", "3000.00 | \n", "8.59% | \n", "36 months | \n", "other | \n", "3.72% | \n", "MA | \n", "MORTGAGE | \n", "4167.00 | \n", "765-769 | \n", "4 | \n", "7074 | \n", "0 | \n", "5 years | \n", "
50182 | \n", "14500 | \n", "14500.00 | \n", "7.90% | \n", "36 months | \n", "debt_consolidation | \n", "4.85% | \n", "GA | \n", "MORTGAGE | \n", "3958.33 | \n", "760-764 | \n", "4 | \n", "9598 | \n", "0 | \n", "4 years | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
84265 | \n", "20000 | \n", "20000.00 | \n", "22.95% | \n", "60 months | \n", "debt_consolidation | \n", "7.10% | \n", "NJ | \n", "RENT | \n", "6750.00 | \n", "665-669 | \n", "6 | \n", "16104 | \n", "1 | \n", "7 years | \n", "
80231 | \n", "19000 | \n", "19000.00 | \n", "7.90% | \n", "36 months | \n", "debt_consolidation | \n", "9.76% | \n", "RI | \n", "MORTGAGE | \n", "5166.67 | \n", "770-774 | \n", "18 | \n", "43617 | \n", "2 | \n", "10+ years | \n", "
49533 | \n", "17300 | \n", "17250.00 | \n", "22.45% | \n", "60 months | \n", "wedding | \n", "3.58% | \n", "PA | \n", "MORTGAGE | \n", "5500.00 | \n", "685-689 | \n", "11 | \n", "2306 | \n", "3 | \n", "4 years | \n", "
102514 | \n", "7000 | \n", "711.54 | \n", "15.13% | \n", "36 months | \n", "major_purchase | \n", "18.91% | \n", "CO | \n", "MORTGAGE | \n", "3833.00 | \n", "650-654 | \n", "13 | \n", "12634 | \n", "0 | \n", "2 years | \n", "
78618 | \n", "7200 | \n", "7200.00 | \n", "18.75% | \n", "36 months | \n", "debt_consolidation | \n", "16.21% | \n", "MI | \n", "RENT | \n", "8333.33 | \n", "660-664 | \n", "8 | \n", "28916 | \n", "0 | \n", "10+ years | \n", "
86953 | \n", "10000 | \n", "10000.00 | \n", "14.09% | \n", "36 months | \n", "major_purchase | \n", "9.71% | \n", "CO | \n", "RENT | \n", "4583.33 | \n", "675-679 | \n", "6 | \n", "3859 | \n", "0 | \n", "< 1 year | \n", "
80129 | \n", "4000 | \n", "3925.00 | \n", "14.09% | \n", "36 months | \n", "credit_card | \n", "12.27% | \n", "FL | \n", "MORTGAGE | \n", "8583.33 | \n", "675-679 | \n", "9 | \n", "36943 | \n", "1 | \n", "10+ years | \n", "
85216 | \n", "17500 | \n", "17500.00 | \n", "8.90% | \n", "36 months | \n", "debt_consolidation | \n", "10.94% | \n", "UT | \n", "MORTGAGE | \n", "25000.00 | \n", "730-734 | \n", "9 | \n", "34545 | \n", "0 | \n", "6 years | \n", "
38247 | \n", "20000 | \n", "20000.00 | \n", "11.71% | \n", "36 months | \n", "credit_card | \n", "9.58% | \n", "SD | \n", "MORTGAGE | \n", "5416.67 | \n", "725-729 | \n", "12 | \n", "18267 | \n", "1 | \n", "4 years | \n", "
91245 | \n", "16200 | \n", "16200.00 | \n", "15.80% | \n", "60 months | \n", "debt_consolidation | \n", "7.92% | \n", "PA | \n", "MORTGAGE | \n", "4833.33 | \n", "680-684 | \n", "12 | \n", "12313 | \n", "2 | \n", "10+ years | \n", "
53041 | \n", "10000 | \n", "10000.00 | \n", "6.03% | \n", "36 months | \n", "small_business | \n", "13.03% | \n", "FL | \n", "RENT | \n", "5000.00 | \n", "760-764 | \n", "8 | \n", "3952 | \n", "0 | \n", "2 years | \n", "
63051 | \n", "27000 | \n", "27000.00 | \n", "6.62% | \n", "36 months | \n", "debt_consolidation | \n", "12.21% | \n", "OH | \n", "MORTGAGE | \n", "9250.00 | \n", "810-814 | \n", "12 | \n", "4211 | \n", "0 | \n", "5 years | \n", "
14446 | \n", "4500 | \n", "4475.00 | \n", "7.51% | \n", "36 months | \n", "small_business | \n", "20.27% | \n", "VA | \n", "MORTGAGE | \n", "7075.50 | \n", "720-724 | \n", "15 | \n", "68618 | \n", "2 | \n", "10+ years | \n", "
68628 | \n", "15875 | \n", "15875.00 | \n", "14.33% | \n", "36 months | \n", "small_business | \n", "17.44% | \n", "MD | \n", "MORTGAGE | \n", "3416.67 | \n", "675-679 | \n", "11 | \n", "15891 | \n", "0 | \n", "2 years | \n", "
98758 | \n", "15000 | \n", "15000.00 | \n", "10.16% | \n", "36 months | \n", "credit_card | \n", "28.28% | \n", "OH | \n", "MORTGAGE | \n", "6666.67 | \n", "690-694 | \n", "15 | \n", "14880 | \n", "0 | \n", "10+ years | \n", "
13070 | \n", "25000 | \n", "24950.00 | \n", "10.75% | \n", "36 months | \n", "debt_consolidation | \n", "20.48% | \n", "OR | \n", "MORTGAGE | \n", "7083.33 | \n", "765-769 | \n", "10 | \n", "25429 | \n", "0 | \n", "6 years | \n", "
45836 | \n", "7000 | \n", "7000.00 | \n", "17.27% | \n", "36 months | \n", "other | \n", "18.38% | \n", "NY | \n", "OWN | \n", "2464.37 | \n", "665-669 | \n", "9 | \n", "7089 | \n", "0 | \n", "3 years | \n", "
52330 | \n", "15000 | \n", "15000.00 | \n", "19.99% | \n", "36 months | \n", "wedding | \n", "18.05% | \n", "CA | \n", "RENT | \n", "8000.00 | \n", "660-664 | \n", "6 | \n", "45976 | \n", "1 | \n", "2 years | \n", "
48243 | \n", "17000 | \n", "17000.00 | \n", "15.81% | \n", "36 months | \n", "debt_consolidation | \n", "17.01% | \n", "CO | \n", "RENT | \n", "3833.33 | \n", "685-689 | \n", "6 | \n", "15484 | \n", "1 | \n", "6 years | \n", "
63256 | \n", "19075 | \n", "19075.00 | \n", "18.75% | \n", "36 months | \n", "debt_consolidation | \n", "15.23% | \n", "NY | \n", "RENT | \n", "5166.67 | \n", "670-674 | \n", "17 | \n", "13749 | \n", "3 | \n", "10+ years | \n", "
42124 | \n", "10000 | \n", "10000.00 | \n", "11.71% | \n", "36 months | \n", "debt_consolidation | \n", "8.40% | \n", "CA | \n", "RENT | \n", "4500.00 | \n", "710-714 | \n", "8 | \n", "8404 | \n", "1 | \n", "3 years | \n", "
78043 | \n", "8475 | \n", "8475.00 | \n", "7.62% | \n", "36 months | \n", "debt_consolidation | \n", "15.88% | \n", "CA | \n", "RENT | \n", "3983.33 | \n", "720-724 | \n", "9 | \n", "6882 | \n", "0 | \n", "n/a | \n", "
925 | \n", "6400 | \n", "6350.00 | \n", "10.08% | \n", "36 months | \n", "debt_consolidation | \n", "8.11% | \n", "NJ | \n", "MORTGAGE | \n", "5166.67 | \n", "710-714 | \n", "5 | \n", "5815 | \n", "2 | \n", "10+ years | \n", "
74047 | \n", "30000 | \n", "30000.00 | \n", "23.28% | \n", "60 months | \n", "other | \n", "12.10% | \n", "IL | \n", "MORTGAGE | \n", "7083.33 | \n", "675-679 | \n", "16 | \n", "17969 | \n", "1 | \n", "10+ years | \n", "
49957 | \n", "24000 | \n", "23975.00 | \n", "14.65% | \n", "36 months | \n", "debt_consolidation | \n", "15.29% | \n", "WA | \n", "MORTGAGE | \n", "6666.67 | \n", "685-689 | \n", "13 | \n", "17521 | \n", "0 | \n", "5 years | \n", "
23735 | \n", "30000 | \n", "29950.00 | \n", "16.77% | \n", "60 months | \n", "debt_consolidation | \n", "19.23% | \n", "NY | \n", "MORTGAGE | \n", "9250.00 | \n", "705-709 | \n", "15 | \n", "45880 | \n", "1 | \n", "8 years | \n", "
65882 | \n", "16000 | \n", "16000.00 | \n", "14.09% | \n", "60 months | \n", "home_improvement | \n", "21.54% | \n", "MD | \n", "OWN | \n", "8903.25 | \n", "740-744 | \n", "18 | \n", "18898 | \n", "1 | \n", "10+ years | \n", "
55610 | \n", "10000 | \n", "10000.00 | \n", "13.99% | \n", "36 months | \n", "debt_consolidation | \n", "4.89% | \n", "PA | \n", "MORTGAGE | \n", "2166.67 | \n", "680-684 | \n", "4 | \n", "4544 | \n", "0 | \n", "10+ years | \n", "
38576 | \n", "6000 | \n", "6000.00 | \n", "12.42% | \n", "36 months | \n", "major_purchase | \n", "16.66% | \n", "NJ | \n", "RENT | \n", "3500.00 | \n", "675-679 | \n", "8 | \n", "7753 | \n", "0 | \n", "5 years | \n", "
3116 | \n", "9000 | \n", "5242.75 | \n", "13.79% | \n", "36 months | \n", "debt_consolidation | \n", "6.76% | \n", "NY | \n", "RENT | \n", "3875.00 | \n", "670-674 | \n", "7 | \n", "7589 | \n", "0 | \n", "10+ years | \n", "
2500 rows × 14 columns
\n", "\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 2500 entries, 81174 to 3116\n", "Data columns (total 14 columns):\n", "Amount.Requested 2500 non-null values\n", "Amount.Funded.By.Investors 2500 non-null values\n", "Interest.Rate 2500 non-null values\n", "Loan.Length 2500 non-null values\n", "Loan.Purpose 2500 non-null values\n", "Debt.To.Income.Ratio 2500 non-null values\n", "State 2500 non-null values\n", "Home.Ownership 2500 non-null values\n", "Monthly.Income 2499 non-null values\n", "FICO.Range 2500 non-null values\n", "Open.CREDIT.Lines 2498 non-null values\n", "Revolving.CREDIT.Balance 2498 non-null values\n", "Inquiries.in.the.Last.6.Months 2498 non-null values\n", "Employment.Length 2500 non-null values\n", "dtypes: float64(5), int64(1), object(8)\n", "" ], "text/plain": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 2500 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2500 non-null values\n", "Amount.Requested 2500 non-null values\n", "Amount.Funded.By.Investors 2500 non-null values\n", "Interest.Rate 2500 non-null values\n", "Loan.Length 2500 non-null values\n", "Loan.Purpose 2500 non-null values\n", "Debt.To.Income.Ratio 2500 non-null values\n", "State 2500 non-null values\n", "Home.Ownership 2500 non-null values\n", "Monthly.Income 2499 non-null values\n", "FICO.Range 2500 non-null values\n", "Open.CREDIT.Lines 2498 non-null values\n", "Revolving.CREDIT.Balance 2498 non-null values\n", "Inquiries.in.the.Last.6.Months 2498 non-null values\n", "Employment.Length 2500 non-null values\n", "dtypes: float64(5), int64(2), object(8)\n", "" ], "text/plain": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 5 entries, 0 to 4\n", "Data columns (total 15 columns):\n", "index 5 non-null values\n", "Amount.Requested 5 non-null values\n", "Amount.Funded.By.Investors 5 non-null values\n", "Interest.Rate 5 non-null values\n", "Loan.Length 5 non-null values\n", "Loan.Purpose 5 non-null values\n", "Debt.To.Income.Ratio 5 non-null values\n", "State 5 non-null values\n", "Home.Ownership 5 non-null values\n", "Monthly.Income 5 non-null values\n", "FICO.Range 5 non-null values\n", "Open.CREDIT.Lines 5 non-null values\n", "Revolving.CREDIT.Balance 5 non-null values\n", "Inquiries.in.the.Last.6.Months 5 non-null values\n", "Employment.Length 5 non-null values\n", "dtypes: float64(6), int64(2), object(7)\n", "" ], "text/plain": [ " index Amount.Requested Amount.Funded.By.Investors Interest.Rate \\\n", "0 81174 20000 20000 8.90 \n", "1 99592 19200 19200 12.12 \n", "2 80059 35000 35000 21.98 \n", "3 15825 10000 9975 9.99 \n", "4 33182 12000 12000 11.71 \n", "\n", " Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership \\\n", "0 36 months debt_consolidation 14.90% SC MORTGAGE \n", "1 36 months debt_consolidation 28.36% TX MORTGAGE \n", "2 60 months debt_consolidation 23.81% CA MORTGAGE \n", "3 36 months debt_consolidation 14.30% KS MORTGAGE \n", "4 36 months credit_card 18.78% NJ RENT \n", "\n", " Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance \\\n", "0 6541.67 735-739 14 14272 \n", "1 4583.33 715-719 12 11140 \n", "2 11500.00 690-694 14 21977 \n", "3 3833.33 695-699 10 9346 \n", "4 3195.00 695-699 11 14469 \n", "\n", " Inquiries.in.the.Last.6.Months Employment.Length \n", "0 2 < 1 year \n", "1 1 2 years \n", "2 1 2 years \n", "3 0 5 years \n", "4 0 9 years " ] }, "execution_count": 349, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb[0:5]" ] }, { "cell_type": "code", "execution_count": 350, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 8.90\n", "1 12.12\n", "2 21.98\n", "3 9.99\n", "4 11.71\n", "Name: Interest.Rate, dtype: float64" ] }, "execution_count": 350, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldb['Interest.Rate'][0:5]" ] }, { "cell_type": "code", "execution_count": 351, "metadata": { "collapsed": false }, "outputs": [], "source": [ "srates = loansData['Interest.Rate']" ] }, { "cell_type": "code", "execution_count": 352, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nopct = [ x[:-1] for x in srates ]" ] }, { "cell_type": "code", "execution_count": 353, "metadata": { "collapsed": false }, "outputs": [], "source": [ "flrates = [float(x) for x in nopct]\n" ] }, { "cell_type": "code", "execution_count": 354, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rates = [float(x[:-1]) for x in srates] # use this" ] }, { "cell_type": "code", "execution_count": 355, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 355, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flrates == rates" ] }, { "cell_type": "code", "execution_count": 356, "metadata": { "collapsed": false }, "outputs": [], "source": [ "loansData['Interest.Rate'] = flrates" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 8.90\n", "99592 12.12\n", "80059 21.98\n", "15825 9.99\n", "33182 11.71\n", "Name: Interest.Rate, dtype: float64" ] }, "execution_count": 357, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Interest.Rate'][0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion of step 1\n", "* ok! whew! we're done with the % symbol stuff\n", "* we learnt quite a few things along the way that will be useful in the next part" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Remove the months" ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false }, "outputs": [], "source": [ "withmons = ldb['Loan.Length'] " ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 359, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wmons = withmons[0:]" ] }, { "cell_type": "code", "execution_count": 360, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 36 months\n", "1 36 months\n", "2 60 months\n", "3 36 months\n", "4 36 months\n", "Name: Loan.Length, dtype: object" ] }, "execution_count": 360, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0:5]" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 36 months\n", "1 36 months\n", "2 60 months\n", "3 36 months\n", "4 36 months\n", "5 36 months\n", "6 36 months\n", "7 60 months\n", "8 36 months\n", "9 36 months\n", "10 36 months\n", "11 36 months\n", "12 60 months\n", "13 36 months\n", "14 36 months\n", "...\n", "2485 36 months\n", "2486 36 months\n", "2487 36 months\n", "2488 36 months\n", "2489 36 months\n", "2490 36 months\n", "2491 36 months\n", "2492 36 months\n", "2493 60 months\n", "2494 36 months\n", "2495 60 months\n", "2496 60 months\n", "2497 36 months\n", "2498 36 months\n", "2499 36 months\n", "Name: Loan.Length, Length: 2500, dtype: object" ] }, "execution_count": 361, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons" ] }, { "cell_type": "code", "execution_count": 362, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['36', 'months']" ] }, "execution_count": 362, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0].split()" ] }, { "cell_type": "code", "execution_count": 363, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'36'" ] }, "execution_count": 363, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wmons[0].split()[0]" ] }, { "cell_type": "code", "execution_count": 364, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 364, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int(wmons[0].split()[0])" ] }, { "cell_type": "code", "execution_count": 365, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = wmons[0].split()\n" ] }, { "cell_type": "code", "execution_count": 366, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'36'" ] }, "execution_count": 366, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x[0]" ] }, { "cell_type": "code", "execution_count": 367, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 367, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int(x[0])" ] }, { "cell_type": "code", "execution_count": 368, "metadata": { "collapsed": false }, "outputs": [], "source": [ "intmons = [ int(x.split()[0]) for x in wmons ] " ] }, { "cell_type": "code", "execution_count": 369, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]" ] }, "execution_count": 369, "metadata": {}, "output_type": "execute_result" } ], "source": [ "intmons[0:10]" ] }, { "cell_type": "code", "execution_count": 370, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 36 months\n", "99592 36 months\n", "80059 60 months\n", "15825 36 months\n", "33182 36 months\n", "62403 36 months\n", "48808 36 months\n", "22090 60 months\n", "76404 36 months\n", "15867 36 months\n", "94971 36 months\n", "36911 36 months\n", "41200 60 months\n", "83869 36 months\n", "53853 36 months\n", "...\n", "13070 36 months\n", "45836 36 months\n", "52330 36 months\n", "48243 36 months\n", "63256 36 months\n", "42124 36 months\n", "78043 36 months\n", "925 36 months\n", "74047 60 months\n", "49957 36 months\n", "23735 60 months\n", "65882 60 months\n", "55610 36 months\n", "38576 36 months\n", "3116 36 months\n", "Name: Loan.Length, Length: 2500, dtype: object" ] }, "execution_count": 370, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Loan.Length']" ] }, { "cell_type": "code", "execution_count": 371, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36 months 1952\n", "60 months 548\n", "dtype: int64" ] }, "execution_count": 371, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Loan.Length'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion of Step 2\n", "* Here we used the techniques we learned in Step 1.\n", " * Pull out a column from a data frame\n", " * Operate on it, perform some transformations\n", " * Replace the column in the original dataframe with this new column \n", "* We applied them to removing the ' months' suffix in the Loan.Length column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Remove bad data" ] }, { "cell_type": "code", "execution_count": 372, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 2499.000000\n", "mean 5688.931321\n", "std 3963.118185\n", "min 588.500000\n", "25% 3500.000000\n", "50% 5000.000000\n", "75% 6800.000000\n", "max 102750.000000\n", "dtype: float64" ] }, "execution_count": 372, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First remove implausible values. We see the max value to be 102750. \n", "This is a MONTHLY income of 100K dollars, which is certainly possible, but .... \n", "highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context. " ] }, { "cell_type": "code", "execution_count": 373, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "54487" ] }, "execution_count": 373, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].idxmax() # find the place where the max occurs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe. \n", "In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars." ] }, { "cell_type": "code", "execution_count": 374, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "81174 6541.67\n", "99592 4583.33\n", "80059 11500.00\n", "15825 3833.33\n", "33182 3195.00\n", "62403 4891.67\n", "48808 2916.67\n", "22090 13863.42\n", "76404 3150.00\n", "15867 5000.00\n", "94971 3575.00\n", "36911 4250.00\n", "41200 14166.67\n", "83869 9166.67\n", "53853 11250.00\n", "...\n", "13070 7083.33\n", "45836 2464.37\n", "52330 8000.00\n", "48243 3833.33\n", "63256 5166.67\n", "42124 4500.00\n", "78043 3983.33\n", "925 5166.67\n", "74047 7083.33\n", "49957 6666.67\n", "23735 9250.00\n", "65882 8903.25\n", "55610 2166.67\n", "38576 3500.00\n", "3116 3875.00\n", "Name: Monthly.Income, Length: 2498, dtype: float64" ] }, "execution_count": 374, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]" ] }, { "cell_type": "code", "execution_count": 374, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 375, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 2499.000000\n", "mean 5688.931321\n", "std 3963.118185\n", "min 588.500000\n", "25% 3500.000000\n", "50% 5000.000000\n", "75% 6800.000000\n", "max 102750.000000\n", "dtype: float64" ] }, "execution_count": 375, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loansData['Monthly.Income'].describe()" ] }, { "cell_type": "code", "execution_count": 376, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldlt100 = ldb[ldb['Monthly.Income'] < 100000]" ] }, { "cell_type": "code", "execution_count": 377, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 2498 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2498 non-null values\n", "Amount.Requested 2498 non-null values\n", "Amount.Funded.By.Investors 2498 non-null values\n", "Interest.Rate 2498 non-null values\n", "Loan.Length 2498 non-null values\n", "Loan.Purpose 2498 non-null values\n", "Debt.To.Income.Ratio 2498 non-null values\n", "State 2498 non-null values\n", "Home.Ownership 2498 non-null values\n", "Monthly.Income 2498 non-null values\n", "FICO.Range 2498 non-null values\n", "Open.CREDIT.Lines 2497 non-null values\n", "Revolving.CREDIT.Balance 2497 non-null values\n", "Inquiries.in.the.Last.6.Months 2497 non-null values\n", "Employment.Length 2498 non-null values\n", "dtypes: float64(6), int64(2), object(7)\n", "" ], "text/plain": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 2497 entries, 0 to 2499\n", "Data columns (total 15 columns):\n", "index 2497 non-null values\n", "Amount.Requested 2497 non-null values\n", "Amount.Funded.By.Investors 2497 non-null values\n", "Interest.Rate 2497 non-null values\n", "Loan.Length 2497 non-null values\n", "Loan.Purpose 2497 non-null values\n", "Debt.To.Income.Ratio 2497 non-null values\n", "State 2497 non-null values\n", "Home.Ownership 2497 non-null values\n", "Monthly.Income 2497 non-null values\n", "FICO.Range 2497 non-null values\n", "Open.CREDIT.Lines 2497 non-null values\n", "Revolving.CREDIT.Balance 2497 non-null values\n", "Inquiries.in.the.Last.6.Months 2497 non-null values\n", "Employment.Length 2497 non-null values\n", "dtypes: float64(6), int64(2), object(7)\n", "" ], "text/plain": [ "