{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 05\n", "\n", "## Data preparation and model evaluation exercise with credit scoring\n", "\n", "Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. \n", "\n", "Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)\n", "\n", "Attribute Information:\n", "\n", "|Variable Name\t|\tDescription\t|\tType|\n", "|----|----|----|\n", "|SeriousDlqin2yrs\t|\tPerson experienced 90 days past due delinquency or worse \t|\tY/N|\n", "|RevolvingUtilizationOfUnsecuredLines\t|\tTotal balance on credit divided by the sum of credit limits\t|\tpercentage|\n", "|age\t|\tAge of borrower in years\t|\tinteger|\n", "|NumberOfTime30-59DaysPastDueNotWorse\t|\tNumber of times borrower has been 30-59 days past due |\tinteger|\n", "|DebtRatio\t|\tMonthly debt payments\t|\tpercentage|\n", "|MonthlyIncome\t|\tMonthly income\t|\treal|\n", "|NumberOfOpenCreditLinesAndLoans\t|\tNumber of Open loans |\tinteger|\n", "|NumberOfTimes90DaysLate\t|\tNumber of times borrower has been 90 days or more past due.\t|\tinteger|\n", "|NumberRealEstateLoansOrLines\t|\tNumber of mortgage and real estate loans\t|\tinteger|\n", "|NumberOfTime60-89DaysPastDueNotWorse\t|\tNumber of times borrower has been 60-89 days past due |integer|\n", "|NumberOfDependents\t|\tNumber of dependents in family\t|\tinteger|\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data into Pandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0010.76612745.02.00.8029829120.013.00.06.00.02.0
1100.95715140.00.00.1218762600.04.00.00.00.01.0
2200.65818038.01.00.0851133042.02.01.00.00.00.0
3300.23381030.00.00.0360503300.05.00.00.00.00.0
4400.90723949.01.00.02492663588.07.00.01.00.00.0
\n", "
" ], "text/plain": [ " Unnamed: 0 SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", "0 0 1 0.766127 45.0 \n", "1 1 0 0.957151 40.0 \n", "2 2 0 0.658180 38.0 \n", "3 3 0 0.233810 30.0 \n", "4 4 0 0.907239 49.0 \n", "\n", " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", "0 2.0 0.802982 9120.0 \n", "1 0.0 0.121876 2600.0 \n", "2 1.0 0.085113 3042.0 \n", "3 0.0 0.036050 3300.0 \n", "4 1.0 0.024926 63588.0 \n", "\n", " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", "0 13.0 0.0 \n", "1 4.0 0.0 \n", "2 2.0 1.0 \n", "3 5.0 0.0 \n", "4 7.0 0.0 \n", "\n", " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", "0 6.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 1.0 0.0 \n", "\n", " NumberOfDependents \n", "0 2.0 \n", "1 1.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.set_option('display.max_columns', 500)\n", "import zipfile\n", "with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:\n", " f = z.open('KaggleCredit2.csv')\n", " data = pd.io.parsers.read_table(f, sep=',')\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = data['SeriousDlqin2yrs']\n", "X = data.drop('SeriousDlqin2yrs', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 5.1\n", "\n", "Input the missing values of the Age and Number of Dependents " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 5.2\n", "\n", "From the set of features\n", "\n", "Select the features that maximize the **F1Score** the model using K-Fold cross-validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 5.3\n", "\n", "Now which is the best set of features selected by AUC" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }