{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## License \n", "\n", "Copyright 2017 - 2020 Patrick Hall and the H2O.ai team\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");\n", "you may not use this file except in compliance with the License.\n", "You may obtain a copy of the License at\n", "\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "\n", "Unless required by applicable law or agreed to in writing, software\n", "distributed under the License is distributed on an \"AS IS\" BASIS,\n", "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "See the License for the specific language governing permissions and\n", "limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DISCLAIMER:** This notebook is not legal compliance advice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing machine learning models for accuracy, trustworthiness, and stability with Python and H2O\n", "#### Performing residual analysis and sensitivity analysis to validate complex models\n", "\n", "This notebook provides a basic introduction to two traditional data analysis and model diagnostic techniques that can be applied to machine learning models: residual analysis and sensitivity analysis. The notebook starts by loading the UCI credit card default dataset and using h2o to train a GBM model to predict credit card defaults. Then, residual analysis is used to discover and debug an issue with the GBM, and the GBM is retrained and improved. The notebook concludes by conducting sensitivity analysis to test the GBM credit card default model for fairness and stability. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Python imports\n", "In general, NumPy and Pandas will be used for data manipulation purposes and h2o will be used for modeling tasks. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# h2o Python API with specific classes\n", "import h2o \n", "from h2o.estimators.gbm import H2OGradientBoostingEstimator\n", "\n", "import numpy as np # array, vector, matrix calculations\n", "import pandas as pd # DataFrame handling\n", "\n", "pd.options.display.max_columns = 999 # enable display of all columns in notebook\n", "\n", "# plotting functionality\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# display plots in notebook\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Start h2o\n", "H2o is both a library and a server. The machine learning algorithms in the library take advantage of the multithreaded and distributed architecture provided by the server to train machine learning algorithms extremely efficiently. The API for the library was imported above in cell 1, but the server still needs to be started." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.\n", "Attempting to start a local H2O server...\n", " Java Version: java version \"1.8.0_201\"; Java(TM) SE Runtime Environment (build 1.8.0_201-b09); Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)\n", " Starting server from /home/patrickh/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar\n", " Ice root: /tmp/tmpfq80fyby\n", " JVM stdout: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.out\n", " JVM stderr: /tmp/tmpfq80fyby/h2o_patrickh_started_from_python.err\n", " Server is running at http://127.0.0.1:54321\n", "Connecting to H2O server at http://127.0.0.1:54321 ... successful.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
H2O cluster uptime:01 secs
H2O cluster timezone:America/New_York
H2O data parsing timezone:UTC
H2O cluster version:3.26.0.3
H2O cluster version age:6 days
H2O cluster name:H2O_from_python_patrickh_ov75l0
H2O cluster total nodes:1
H2O cluster free memory:1.778 Gb
H2O cluster total cores:8
H2O cluster allowed cores:8
H2O cluster status:accepting new members, healthy
H2O connection url:http://127.0.0.1:54321
H2O connection proxy:None
H2O internal security:False
H2O API Extensions:Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
Python version:3.6.4 final
" ], "text/plain": [ "-------------------------- ---------------------------------------------------\n", "H2O cluster uptime: 01 secs\n", "H2O cluster timezone: America/New_York\n", "H2O data parsing timezone: UTC\n", "H2O cluster version: 3.26.0.3\n", "H2O cluster version age: 6 days\n", "H2O cluster name: H2O_from_python_patrickh_ov75l0\n", "H2O cluster total nodes: 1\n", "H2O cluster free memory: 1.778 Gb\n", "H2O cluster total cores: 8\n", "H2O cluster allowed cores: 8\n", "H2O cluster status: accepting new members, healthy\n", "H2O connection url: http://127.0.0.1:54321\n", "H2O connection proxy:\n", "H2O internal security: False\n", "H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4\n", "Python version: 3.6.4 final\n", "-------------------------- ---------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "h2o.init(max_mem_size='2G') # start h2o\n", "h2o.remove_all() # remove any existing data structures from h2o memory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Download, explore, and prepare UCI credit card default data\n", "\n", "UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients\n", "\n", "The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables: \n", "\n", "* **`LIMIT_BAL`**: Amount of given credit (NT dollar)\n", "* **`SEX`**: 1 = male; 2 = female\n", "* **`EDUCATION`**: 1 = graduate school; 2 = university; 3 = high school; 4 = others \n", "* **`MARRIAGE`**: 1 = married; 2 = single; 3 = others\n", "* **`AGE`**: Age in years \n", "* **`PAY_0`, `PAY_2` - `PAY_6`**: History of past payment; `PAY_0` = the repayment status in September, 2005; `PAY_2` = the repayment status in August, 2005; ...; `PAY_6` = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \n", "* **`BILL_AMT1` - `BILL_AMT6`**: Amount of bill statement (NT dollar). `BILL_AMNT1` = amount of bill statement in September, 2005; `BILL_AMT2` = amount of bill statement in August, 2005; ...; `BILL_AMT6` = amount of bill statement in April, 2005. \n", "* **`PAY_AMT1` - `PAY_AMT6`**: Amount of previous payment (NT dollar). `PAY_AMT1` = amount paid in September, 2005; `PAY_AMT2` = amount paid in August, 2005; ...; `PAY_AMT6` = amount paid in April, 2005. \n", "\n", "These 23 input variables are used to predict the target variable, whether or not a customer defaulted on their credit card bill in late 2005.\n", "\n", "Because h2o accepts both numeric and character inputs, some variables will be recoded into more transparent character values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import data and clean\n", "The credit card default data is available as an `.xls` file. Pandas reads `.xls` files automatically, so it's used to load the credit card default data and give the prediction target a shorter name: `DEFAULT_NEXT_MONTH`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# import XLS file\n", "path = 'default_of_credit_card_clients.xls'\n", "data = pd.read_excel(path,\n", " skiprows=1)\n", "\n", "# remove spaces from target column name \n", "data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'}) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Assign modeling roles\n", "The shorthand name `y` is assigned to the prediction target. `X` is assigned to all other input variables in the credit card default data except the row indentifier, `ID`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "y = DEFAULT_NEXT_MONTH\n", "X = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']\n" ] } ], "source": [ "# assign target and inputs for GBM\n", "y = 'DEFAULT_NEXT_MONTH'\n", "X = [name for name in data.columns if name not in [y, 'ID']]\n", "print('y =', y)\n", "print('X =', X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Helper function for recoding values in the UCI credict card default data\n", "This simple function maps longer, more understandable character string values from the UCI credit card default data dictionary to the original integer values of the input variables found in the dataset. These character values can be used directly in h2o decision tree models, and the function returns the original Pandas DataFrame as an h2o object, an H2OFrame. H2o models cannot run on Pandas DataFrames. They require H2OFrames." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" ] } ], "source": [ "def recode_cc_data(frame):\n", " \n", " \"\"\" Recodes numeric categorical variables into categorical character variables\n", " with more transparent values. \n", " \n", " Args:\n", " frame: Pandas DataFrame version of UCI credit card default data.\n", " \n", " Returns: \n", " H2OFrame with recoded values.\n", " \n", " \"\"\"\n", " \n", " # define recoded values\n", " sex_dict = {1:'male', 2:'female'}\n", " education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school', \n", " 4:'other', 5:'other', 6:'other'}\n", " marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}\n", " pay_dict = {-2:'no consumption', -1:'pay duly', 0:'use of revolving credit', 1:'1 month delay', \n", " 2:'2 month delay', 3:'3 month delay', 4:'4 month delay', 5:'5 month delay', 6:'6 month delay', \n", " 7:'7 month delay', 8:'8 month delay', 9:'9+ month delay'}\n", " \n", " # recode values using Pandas apply() and anonymous function\n", " frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])\n", " frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i]) \n", " frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i]) \n", " for name in frame.columns:\n", " if name in ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']:\n", " frame[name] = frame[name].apply(lambda i: pay_dict[i]) \n", " \n", " return h2o.H2OFrame(frame)\n", "\n", "data = recode_cc_data(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ensure target is handled as a categorical variable\n", "In h2o, a numeric variable can be treated as numeric or categorical. The target variable `DEFAULT_NEXT_MONTH` takes on values of `0` or `1`. To ensure this numeric variable is treated as a categorical variable, the `asfactor()` function is used to explicitly declare that it is a categorical variable. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "data[y] = data[y].asfactor() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display descriptive statistics\n", "The h2o `describe()` function displays a brief description of the credit card default data. For the categorical input variables `LIMIT_BAL`, `SEX`, `EDUCATION`, `MARRIAGE`, and `PAY_0`-`PAY_6`, the new character values created above in cell 5 are visible. Basic descriptive statistics are displayed for numeric inputs." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows:30000\n", "Cols:25\n", "\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH
type int int enum enum enum int enum enum enum enum enum enum int int int int int int int int int int int int enum
mins 1.0 10000.0 21.0 -165580.0 -69777.0 -157264.0 -170000.0 -81334.0 -339603.0 0.0 0.0 0.0 0.0 0.0 0.0
mean 15000.5 167484.32266666688 35.48549999999994 51223.3309000000949179.0751666666847013.1547999997143262.9489666666 40311.4009666665338871.760399999915663.580500000014 5921.16350000001 5225.681500000005 4826.076866666661 4799.387633333302 5215.502566666664
maxs 30000.0 1000000.0 79.0 964511.0 983931.0 1664089.0 891586.0 927171.0 961664.0 873552.0 1684259.0 896040.0 621000.0 426529.0 528666.0
sigma 8660.398374208891129747.66156720225 9.21790406809016 73635.8605755295971173.7687825283669349.3874270368164332.8561339164160797.1557702648 59554.1075367457416563.28035402576323040.87040205722617606.96146980311515666.15974403199315278.30567914479317777.465775435332
zeros 0 0 0 2008 2506 2870 3195 3506 4020 5249 5396 5968 6408 6703 7173
missing0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1.0 20000.0 femaleuniversity married 24.0 2 month delay 2 month delay pay duly pay duly no consumption no consumption 3913.0 3102.0 689.0 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 2.0 120000.0 femaleuniversity single 26.0 pay duly 2 month delay use of revolving credituse of revolving credituse of revolving credit2 month delay 2682.0 1725.0 2682.0 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 3.0 90000.0 femaleuniversity single 34.0 use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit29239.0 14027.0 13559.0 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 4.0 50000.0 femaleuniversity married 37.0 use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit46990.0 48233.0 49291.0 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
4 5.0 50000.0 male university married 57.0 pay duly use of revolving creditpay duly use of revolving credituse of revolving credituse of revolving credit8617.0 5670.0 35835.0 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0
5 6.0 50000.0 male graduate schoolsingle 37.0 use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit64400.0 57069.0 57608.0 19394.0 19619.0 20024.0 2500.0 1815.0 657.0 1000.0 1000.0 800.0 0
6 7.0 500000.0 male graduate schoolsingle 29.0 use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit367965.0 412023.0 445007.0 542653.0 483003.0 473944.0 55000.0 40000.0 38000.0 20239.0 13750.0 13770.0 0
7 8.0 100000.0 femaleuniversity single 23.0 use of revolving creditpay duly pay duly use of revolving credituse of revolving creditpay duly 11876.0 380.0 601.0 221.0 -159.0 567.0 380.0 601.0 0.0 581.0 1687.0 1542.0 0
8 9.0 140000.0 femalehigh school married 28.0 use of revolving credituse of revolving credit2 month delay use of revolving credituse of revolving credituse of revolving credit11285.0 14096.0 12108.0 12211.0 11793.0 3719.0 3329.0 0.0 432.0 1000.0 1000.0 1000.0 0
9 10.0 20000.0 male high school single 35.0 no consumption no consumption no consumption no consumption pay duly pay duly 0.0 0.0 0.0 0.0 13007.0 13912.0 0.0 0.0 0.0 13007.0 1122.0 0.0 0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Train an H2O GBM classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Split data into training and test sets for early stopping\n", "The credit card default data is split into training and test sets to monitor and prevent overtraining. Reproducibility is also important factor in creating trustworthy models, and randomly splitting datasets can introduce randomness in model predictions and other results. A random seed is used here to ensure the data split is reproducible." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train data rows = 21060, columns = 25\n", "Test data rows = 8940, columns = 25\n" ] } ], "source": [ "# split into training and validation\n", "train, test = data.split_frame([0.7], seed=12345)\n", "\n", "# summarize split\n", "print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))\n", "print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train h2o GBM classifier\n", "Many tuning parameters must be specified to train a GBM using h2o. Typically a grid search would be performed to identify the best parameters for a given modeling task using the `H2OGridSearch` class. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Because gradient boosting methods typically resample training data, an additional random seed is also specified for the h2o GBM using the `seed` parameter to create reproducible predictions, error rates, and variable importance values. To avoid overfitting, the `stopping_rounds` parameter is used to stop the training process after the test error fails to decrease for 5 iterations. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm Model Build progress: |███████████████████████████████████████████████| 100%\n", "GBM Test AUC = 0.7804\n" ] } ], "source": [ "# initialize GBM model\n", "model = H2OGradientBoostingEstimator(ntrees=150, # maximum 150 trees in GBM\n", " max_depth=4, # trees can have maximum depth of 4\n", " sample_rate=0.9, # use 90% of rows in each iteration (tree)\n", " col_sample_rate=0.9, # use 90% of variables in each iteration (tree)\n", " stopping_rounds=5, # stop if validation error does not decrease for 5 iterations (trees)\n", " seed=12345) # for reproducibility\n", "\n", "# train a GBM model\n", "model.train(y=y, x=X, training_frame=train, validation_frame=test)\n", "\n", "# print AUC\n", "print('GBM Test AUC = %.4f' % model.auc(valid=True))\n", "\n", "# uncomment to see model details\n", "# print(model) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display variable importance\n", "During training, the h2o GBM aggregates the improvement in error caused by each split in each decision tree across all the decision trees in the ensemble classifier. These values are attributed to the input variable used in each split and give an indication of the contribution each input variable makes toward the model's predictions. The variable importance ranking should be parsimonious with human domain knowledge and reasonable expectations. In this case, a customer's most recent payment behavior, `PAY_0`, is by far the most important variable followed by their second most recent payment, `PAY_2`, their credit limit, `LIMIT_BAL`, and third most recent payment behavior, `PAY_3`. This result is well-aligned with business practices in credit lending: people who miss their most recent payments are likely to default soon." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model.varimp_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Conduct residual analysis to debug model\n", "Residuals refer to the difference between the recorded value of a dependent variable and the predicted value of a dependent variable for every row in a data set. Plotting the residual values against the predicted values is a time-honored model assessment technique and a great way to see all your modeling results in two dimensions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bind model predictions onto test data \n", "To calculate the residuals for our GBM model, first the model predictions are merged onto onto the test set. The test data is used here to see how the model behaves on holdout data, which should be closer to its behavior on new data than analyzing residuals for the training inputs and predictions." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] } ], "source": [ "yhat = 'p_DEFAULT_NEXT_MONTH'\n", "preds1 = model.predict(test).drop(['predict', 'p0'])\n", "preds1.columns = [yhat]\n", "test_yhat = test.cbind(preds1[yhat])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculate deviance residuals for binomial classification\n", "For binomial classification, deviance residuals are related to the logloss cost function. Like analyzing $y - \\hat{y}$ for linear regression, these residuals are the quantities that the GBM sought to minimize. Deviance residual values are calculated by applying the simple formula in the cell directly below." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# use Pandas for adding columns and plotting\n", "test_yhat = test_yhat.as_data_frame()\n", "test_yhat['s'] = 1\n", "test_yhat.loc[test_yhat['DEFAULT_NEXT_MONTH'] == 0, 's'] = -1\n", "test_yhat['r_DEFAULT_NEXT_MONTH'] = test_yhat['s'] * np.sqrt(-2*(test_yhat[y]*np.log(test_yhat[yhat]) +\n", " ((1 - test_yhat[y])*np.log(1 - test_yhat[yhat]))))\n", "test_yhat = test_yhat.drop('s', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plot residuals\n", "Plotting residuals is a model debugging and diagnostic tool that enables users to see modeling results, and any anomolies, in a single two-dimensional plot. Here the green points represent customers who defaulted, and the blue points represent customers who did not. A few potential outliers are visible. There appear to be several cases in the test data with relatively large negative residuals. Understanding and addressing the factors that cause these outliers could lead to a more acccurate model." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "groups = test_yhat.groupby('DEFAULT_NEXT_MONTH') # define groups\n", "fig, ax_ = plt.subplots(figsize=(8, 8)) # initialize figure\n", "\n", "plt.xlabel('Predicted: DEFAULT_NEXT_MONTH')\n", "plt.ylabel('Residual: DEFAULT_NEXT_MONTH')\n", "\n", "# plot groups with appropriate color\n", "color_list = ['b', 'g'] \n", "c_idx = 0\n", "for name, group in groups:\n", " ax_.plot(group.p_DEFAULT_NEXT_MONTH, group.r_DEFAULT_NEXT_MONTH, label=' '.join(['DEFAULT_NEXT_MONTH:', str(name)]),\n", " marker='o', linestyle='', color=color_list[c_idx], alpha=0.3)\n", " c_idx += 1\n", "\n", "_ = ax_.legend(loc=1) # legend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sort data by residuals and display data and residuals\n", "Printing a table with model inputs, actual target values, and model predictions sorted by residuals is another simple way to analyze residuals. Customers that defaulted, but were predicted not to, are listed at the top of the table below. Scroll to the bottom of the table to see the customers who were predicted to default, but then did not. Also notice the jumps in residual values. These are the potential outliers pictured in the residual plot above. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDLIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4PAY_5PAY_6BILL_AMT1BILL_AMT2BILL_AMT3BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6DEFAULT_NEXT_MONTHp_DEFAULT_NEXT_MONTHr_DEFAULT_NEXT_MONTH
02561310000femalegraduate schoolsingle32no consumptionno consumptionno consumptionno consumptionno consumptionno consumption201388267659938543169575082676600885431695750735010.0458372.483007
13016350000malegraduate schoolmarried38no consumptionno consumptionpay dulyuse of revolving credituse of revolving creditno consumption1645941204416435233884992494144743088499241082410.0502302.445869
211462210000femalegraduate schoolsingle46pay dulypay dulypay dulyuse of revolving credituse of revolving creditpay duly1565539182988124247216641556485430366043315561404710.0505272.443456
325772350000femalegraduate schoolmarried33use of revolving creditpay dulypay dulypay dulypay dulypay duly82964685321792617966307413108868940180181805830897312448846110.0515032.435615
46933500000malegraduate schoolsingle37pay dulypay dulypay dulypay dulypay dulypay duly4331604463059215416713410254266044630594150843163881254263952610.0517172.433910
522505260000femaleuniversitysingle33pay dulypay dulypay dulypay dulypay dulyuse of revolving credit518812357286567497768515434130002902275002776912000620010.0530612.423347
622751350000femalegraduate schoolmarried32pay dulypay dulyno consumptionno consumptionno consumptionno consumption3062560003714799502211748746039671479950221174874010.0566502.396193
713381400000femaleuniversitysingle35use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit10994322208522335021383121056321192512001810071803780188809502210.0587532.380929
819530350000femaleuniversitymarried36use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit21026355883800240357436635273515000300030004000100002500010.0589112.379801
915549450000maleuniversitysingle36no consumptionno consumptionno consumptionno consumptionno consumptionno consumption80124009522647153275642240215241472932846441428510.0590602.378739
1025692330000femalegraduate schoolsingle42no consumptionno consumptionno consumptionno consumptionno consumptionno consumption56520650153600129231816206501536001292318161705010.0590892.378531
11971300000malegraduate schoolmarried42pay dulyuse of revolving credituse of revolving creditpay dulyuse of revolving credituse of revolving credit1197361834251453766619453104922097950003767688082000270910.0607452.366883
1222390310000femalegraduate schoolmarried32use of revolving creditpay dulyuse of revolving credituse of revolving credituse of revolving credituse of revolving credit4762269437488102769605964342694350006000930003000500010.0615222.361507
136854290000malehigh schoolsingle34use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit545162301408021433541462251488201200135000520055005500540010.0619002.358911
148959340000malegraduate schoolsingle44use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit8305985634739505932415609411023420000500020001120004234400010.0619352.358675
151980500000femalegraduate schoolmarried35use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit351763619344157483222159313866250410004517810472019100410.0620272.358041
1618899140000femaleothersingle28use of revolving credituse of revolving creditpay dulyuse of revolving credituse of revolving credituse of revolving credit108018650063271384851404921410061000632713500047005000500010.0631402.350490
172302230000femalegraduate schoolmarried30pay dulypay dulyuse of revolving credituse of revolving credituse of revolving credituse of revolving credit221217402324501728597669981174022001334650008000500010.0633862.348836
18523360000malegraduate schoolsingle28pay dulypay dulypay dulyuse of revolving credituse of revolving creditpay duly12108206464412598410658412555739075720625201700013220016700010.0636292.347206
1911745220000malegraduate schoolsingle51pay dulypay dulypay dulypay dulypay dulypay duly20730-27053895-1052089520835054165021000209403346010.0638132.345973
208339480000malegraduate schoolmarried58no consumptionno consumptionno consumptionno consumptionno consumptionno consumption24610-3101485441879159096898841496541888559406933720065510.0638602.345661
2122712320000femalehigh schoolmarried35use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit15724914812314285213358312904912832560006048600050005000500010.0653162.336031
2226005320000femaleuniversitymarried35pay dulypay dulypay dulypay dulyuse of revolving credituse of revolving credit2276662611131138241799215250662612446177466000574992810.0654422.335207
2313797390000malegraduate schoolmarried36no consumptionno consumptionno consumptionno consumptionpay dulypay duly393136251600381583304765362516003315116454765217110.0656572.333801
2410147450000femalegraduate schoolmarried46pay dulypay dulypay dulypay dulypay dulypay duly282053760414823126909418937934148231269094189153910.0657322.333313
251199340000femalehigh schoolsingle44use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit14283614512514668215040714786814934970005500602753285390604710.0657562.333153
2610754160000femaleuniversitymarried31use of revolving credituse of revolving credituse of revolving creditpay dulypay dulyuse of revolving credit42781427744181774955721057323002300749557255731379310.0665222.328184
2725816350000femaleuniversitymarried47use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit97500842028293380501790388069430102970288628242925298710.0665852.327775
2812668210000femalegraduate schoolmarried37use of revolving credituse of revolving creditpay dulypay dulypay dulypay duly2454748302454930857300658345199098308573006583506010.0669602.325363
2915482150000malegraduate schoolmarried37no consumptionno consumptionno consumptionno consumptionno consumptionno consumption2210910876102685872306821811094310273597830682181324210.0679852.318820
....................................................................................
8910286550000femaleuniversitymarried462 month delay2 month delay2 month delay2 month delay2 month delay2 month delay28390296393085430062327053351920002000033001500000.772660-1.721227
89118115120000femaleuniversitysingle263 month delay3 month delay2 month delay2 month delay3 month delay2 month delay1203412548120561395813468614410000240010005725800.772848-1.721707
891213422100000femalegraduate schoolsingle292 month delay2 month delay2 month delay2 month delay2 month delay2 month delay74032755577643474611792928094533002700059003100000.774303-1.725433
89131284690000maleuniversitymarried422 month delay2 month delay2 month delay2 month delay2 month delayuse of revolving credit95773954899468193965905459052940003500350003500420000.775734-1.729118
89141629140000femalehigh schoolmarried312 month delay2 month delay2 month delay2 month delay2 month delay2 month delay89910925889193694623949529523450001800500019004700000.776880-1.732076
8915424990000malehigh schoolmarried422 month delay2 month delay2 month delay3 month delay3 month delay3 month delay48674498955257053614545345337423004116250020520000.779941-1.740033
89169756140000malegraduate schoolmarried312 month delayuse of revolving credituse of revolving credit2 month delay2 month delay2 month delay51028521125523255932549105734425004600220003513300000.780607-1.741776
89172721560000maleuniversitymarried352 month delay2 month delay2 month delay2 month delay2 month delay2 month delay2019521267213322168023011234981700700100020001000000.781268-1.743508
89185986100000malehigh schoolmarried442 month delay2 month delay2 month delay2 month delay2 month delay2 month delay30076312873167632259316083352420001200140002600000.784777-1.752759
89191518630000femalegraduate schoolsingle252 month delay2 month delay2 month delay2 month delay2 month delay2 month delay75939634447688308153642223797700213155100.787675-1.760476
8920297430000femaleuniversitymarried242 month delay2 month delay2 month delay2 month delay2 month delay2 month delay1501501501501503000000150000.793444-1.776054
89211078580000femaleuniversitymarried332 month delay2 month delay2 month delay2 month delay2 month delay2 month delay53843559335657557303585935973835002100220023002200210000.794019-1.777621
892213290230000femalegraduate schoolmarried342 month delay2 month delay2 month delay2 month delay2 month delay2 month delay19078419572419870720163420594921007793007500750075007500760000.796132-1.783414
89232725580000malegraduate schoolmarried462 month delay2 month delay2 month delay2 month delay2 month delay2 month delay40509405514259243296438924306010003000170016000350000.797857-1.788172
89241770360000femaleuniversitymarried352 month delay2 month delay2 month delay2 month delay2 month delay2 month delay31675601536667726515790625000150001500000.802237-1.800380
89251381140000malegraduate schoolmarried472 month delay2 month delay2 month delay2 month delay2 month delay2 month delay11084126051310212595143861400520001000020000200000.805651-1.810028
89261692050000malehigh schoolmarried522 month delay2 month delay2 month delay2 month delay4 month delay3 month delay36428375303863041774408064135720002000408601500100000.807385-1.814973
89271774830000femalehigh schoolmarried542 month delay2 month delay2 month delay2 month delay4 month delay3 month delay22147247702606828842280942736130002000350000100000.812416-1.829496
89281865940000femaleuniversitymarried282 month delay2 month delay3 month delay2 month delay2 month delay2 month delay31131338153300232173346293394035000030000200000.814786-1.836434
892926565200000femalehigh schoolmarried552 month delay2 month delay3 month delay2 month delay2 month delay2 month delay15901716269716314316190616580716959991594842300080007000300000.815782-1.839367
893021098200000malegraduate schoolmarried422 month delay2 month delay2 month delay2 month delay2 month delay2 month delay16828917200117528117789518007818404880007500700066007000710000.816460-1.841371
8931706890000femalegraduate schoolsingle302 month delay2 month delay3 month delay3 month delay3 month delay3 month delay7507507507502450215000020000000.825031-1.867161
8932308730000femaleuniversitysingle242 month delay2 month delay7 month delay7 month delay7 month delay7 month delay30030030030030030000000000.825105-1.867387
893314589280000malegraduate schoolmarried503 month delay5 month delay4 month delay3 month delay2 month delayuse of revolving credit3279183214763149311764391540101343340050006267225700.832569-1.890599
893416957270000malegraduate schoolmarried502 month delay4 month delay3 month delay3 month delay2 month delay2 month delay213616208784212058207226202394231339080000032236300000.841972-1.920928
89352950520000maleuniversitymarried401 month delay2 month delay3 month delay2 month delay3 month delay3 month delay148291726716706186941904918459300002560955066100.852781-1.957464
893619316110000femalegraduate schoolmarried413 month delay2 month delay2 month delay7 month delay7 month delay7 month delay15015015015015015000000000.866568-2.007067
893722725100000femaleuniversitymarried383 month delay2 month delay2 month delay3 month delay3 month delay3 month delay75075075075075075000000150000.869051-2.016405
89389672170000malegraduate schoolsingle482 month delay2 month delay7 month delay7 month delay7 month delay7 month delay24002400240024002400240000000000.874018-2.035492
89395916110000femalegraduate schoolmarried412 month delay2 month delay7 month delay7 month delay7 month delay7 month delay15015015015015015000000000.886468-2.085985
\n", "

8940 rows × 27 columns

\n", "
" ], "text/plain": [ " ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE \\\n", "0 2561 310000 female graduate school single 32 \n", "1 3016 350000 male graduate school married 38 \n", "2 11462 210000 female graduate school single 46 \n", "3 25772 350000 female graduate school married 33 \n", "4 6933 500000 male graduate school single 37 \n", "5 22505 260000 female university single 33 \n", "6 22751 350000 female graduate school married 32 \n", "7 13381 400000 female university single 35 \n", "8 19530 350000 female university married 36 \n", "9 15549 450000 male university single 36 \n", "10 25692 330000 female graduate school single 42 \n", "11 971 300000 male graduate school married 42 \n", "12 22390 310000 female graduate school married 32 \n", "13 6854 290000 male high school single 34 \n", "14 8959 340000 male graduate school single 44 \n", "15 1980 500000 female graduate school married 35 \n", "16 18899 140000 female other single 28 \n", "17 2302 230000 female graduate school married 30 \n", "18 523 360000 male graduate school single 28 \n", "19 11745 220000 male graduate school single 51 \n", "20 8339 480000 male graduate school married 58 \n", "21 22712 320000 female high school married 35 \n", "22 26005 320000 female university married 35 \n", "23 13797 390000 male graduate school married 36 \n", "24 10147 450000 female graduate school married 46 \n", "25 1199 340000 female high school single 44 \n", "26 10754 160000 female university married 31 \n", "27 25816 350000 female university married 47 \n", "28 12668 210000 female graduate school married 37 \n", "29 15482 150000 male graduate school married 37 \n", "... ... ... ... ... ... ... \n", "8910 2865 50000 female university married 46 \n", "8911 8115 120000 female university single 26 \n", "8912 13422 100000 female graduate school single 29 \n", "8913 12846 90000 male university married 42 \n", "8914 1629 140000 female high school married 31 \n", "8915 4249 90000 male high school married 42 \n", "8916 9756 140000 male graduate school married 31 \n", "8917 27215 60000 male university married 35 \n", "8918 5986 100000 male high school married 44 \n", "8919 15186 30000 female graduate school single 25 \n", "8920 2974 30000 female university married 24 \n", "8921 10785 80000 female university married 33 \n", "8922 13290 230000 female graduate school married 34 \n", "8923 27255 80000 male graduate school married 46 \n", "8924 17703 60000 female university married 35 \n", "8925 13811 40000 male graduate school married 47 \n", "8926 16920 50000 male high school married 52 \n", "8927 17748 30000 female high school married 54 \n", "8928 18659 40000 female university married 28 \n", "8929 26565 200000 female high school married 55 \n", "8930 21098 200000 male graduate school married 42 \n", "8931 7068 90000 female graduate school single 30 \n", "8932 3087 30000 female university single 24 \n", "8933 14589 280000 male graduate school married 50 \n", "8934 16957 270000 male graduate school married 50 \n", "8935 29505 20000 male university married 40 \n", "8936 19316 110000 female graduate school married 41 \n", "8937 22725 100000 female university married 38 \n", "8938 9672 170000 male graduate school single 48 \n", "8939 5916 110000 female graduate school married 41 \n", "\n", " PAY_0 PAY_2 \\\n", "0 no consumption no consumption \n", "1 no consumption no consumption \n", "2 pay duly pay duly \n", "3 use of revolving credit pay duly \n", "4 pay duly pay duly \n", "5 pay duly pay duly \n", "6 pay duly pay duly \n", "7 use of revolving credit use of revolving credit \n", "8 use of revolving credit use of revolving credit \n", "9 no consumption no consumption \n", "10 no consumption no consumption \n", "11 pay duly use of revolving credit \n", "12 use of revolving credit pay duly \n", "13 use of revolving credit use of revolving credit \n", "14 use of revolving credit use of revolving credit \n", "15 use of revolving credit use of revolving credit \n", "16 use of revolving credit use of revolving credit \n", "17 pay duly pay duly \n", "18 pay duly pay duly \n", "19 pay duly pay duly \n", "20 no consumption no consumption \n", "21 use of revolving credit use of revolving credit \n", "22 pay duly pay duly \n", "23 no consumption no consumption \n", "24 pay duly pay duly \n", "25 use of revolving credit use of revolving credit \n", "26 use of revolving credit use of revolving credit \n", "27 use of revolving credit use of revolving credit \n", "28 use of revolving credit use of revolving credit \n", "29 no consumption no consumption \n", "... ... ... \n", "8910 2 month delay 2 month delay \n", "8911 3 month delay 3 month delay \n", "8912 2 month delay 2 month delay \n", "8913 2 month delay 2 month delay \n", "8914 2 month delay 2 month delay \n", "8915 2 month delay 2 month delay \n", "8916 2 month delay use of revolving credit \n", "8917 2 month delay 2 month delay \n", "8918 2 month delay 2 month delay \n", "8919 2 month delay 2 month delay \n", "8920 2 month delay 2 month delay \n", "8921 2 month delay 2 month delay \n", "8922 2 month delay 2 month delay \n", "8923 2 month delay 2 month delay \n", "8924 2 month delay 2 month delay \n", "8925 2 month delay 2 month delay \n", "8926 2 month delay 2 month delay \n", "8927 2 month delay 2 month delay \n", "8928 2 month delay 2 month delay \n", "8929 2 month delay 2 month delay \n", "8930 2 month delay 2 month delay \n", "8931 2 month delay 2 month delay \n", "8932 2 month delay 2 month delay \n", "8933 3 month delay 5 month delay \n", "8934 2 month delay 4 month delay \n", "8935 1 month delay 2 month delay \n", "8936 3 month delay 2 month delay \n", "8937 3 month delay 2 month delay \n", "8938 2 month delay 2 month delay \n", "8939 2 month delay 2 month delay \n", "\n", " PAY_3 PAY_4 \\\n", "0 no consumption no consumption \n", "1 pay duly use of revolving credit \n", "2 pay duly use of revolving credit \n", "3 pay duly pay duly \n", "4 pay duly pay duly \n", "5 pay duly pay duly \n", "6 no consumption no consumption \n", "7 use of revolving credit use of revolving credit \n", "8 use of revolving credit use of revolving credit \n", "9 no consumption no consumption \n", "10 no consumption no consumption \n", "11 use of revolving credit pay duly \n", "12 use of revolving credit use of revolving credit \n", "13 use of revolving credit use of revolving credit \n", "14 use of revolving credit use of revolving credit \n", "15 use of revolving credit use of revolving credit \n", "16 pay duly use of revolving credit \n", "17 use of revolving credit use of revolving credit \n", "18 pay duly use of revolving credit \n", "19 pay duly pay duly \n", "20 no consumption no consumption \n", "21 use of revolving credit use of revolving credit \n", "22 pay duly pay duly \n", "23 no consumption no consumption \n", "24 pay duly pay duly \n", "25 use of revolving credit use of revolving credit \n", "26 use of revolving credit pay duly \n", "27 use of revolving credit use of revolving credit \n", "28 pay duly pay duly \n", "29 no consumption no consumption \n", "... ... ... \n", "8910 2 month delay 2 month delay \n", "8911 2 month delay 2 month delay \n", "8912 2 month delay 2 month delay \n", "8913 2 month delay 2 month delay \n", "8914 2 month delay 2 month delay \n", "8915 2 month delay 3 month delay \n", "8916 use of revolving credit 2 month delay \n", "8917 2 month delay 2 month delay \n", "8918 2 month delay 2 month delay \n", "8919 2 month delay 2 month delay \n", "8920 2 month delay 2 month delay \n", "8921 2 month delay 2 month delay \n", "8922 2 month delay 2 month delay \n", "8923 2 month delay 2 month delay \n", "8924 2 month delay 2 month delay \n", "8925 2 month delay 2 month delay \n", "8926 2 month delay 2 month delay \n", "8927 2 month delay 2 month delay \n", "8928 3 month delay 2 month delay \n", "8929 3 month delay 2 month delay \n", "8930 2 month delay 2 month delay \n", "8931 3 month delay 3 month delay \n", "8932 7 month delay 7 month delay \n", "8933 4 month delay 3 month delay \n", "8934 3 month delay 3 month delay \n", "8935 3 month delay 2 month delay \n", "8936 2 month delay 7 month delay \n", "8937 2 month delay 3 month delay \n", "8938 7 month delay 7 month delay \n", "8939 7 month delay 7 month delay \n", "\n", " PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 \\\n", "0 no consumption no consumption 20138 8267 \n", "1 use of revolving credit no consumption 16459 4120 \n", "2 use of revolving credit pay duly 15655 3918 \n", "3 pay duly pay duly 82964 68532 \n", "4 pay duly pay duly 4331 60446 \n", "5 pay duly use of revolving credit 5188 12357 \n", "6 no consumption no consumption 30625 60003 \n", "7 use of revolving credit use of revolving credit 109943 222085 \n", "8 use of revolving credit use of revolving credit 21026 35588 \n", "9 no consumption no consumption 8012 4009 \n", "10 no consumption no consumption 565 20650 \n", "11 use of revolving credit use of revolving credit 11973 61834 \n", "12 use of revolving credit use of revolving credit 4762 26943 \n", "13 use of revolving credit use of revolving credit 5451 6230 \n", "14 use of revolving credit use of revolving credit 83059 85634 \n", "15 use of revolving credit use of revolving credit 35176 36193 \n", "16 use of revolving credit use of revolving credit 108018 6500 \n", "17 use of revolving credit use of revolving credit 2212 17402 \n", "18 use of revolving credit pay duly 1210 820 \n", "19 pay duly pay duly 20730 -270 \n", "20 no consumption no consumption 24610 -310 \n", "21 use of revolving credit use of revolving credit 157249 148123 \n", "22 use of revolving credit use of revolving credit 2276 6626 \n", "23 pay duly pay duly 3931 3625 \n", "24 pay duly pay duly 28205 3760 \n", "25 use of revolving credit use of revolving credit 142836 145125 \n", "26 pay duly use of revolving credit 42781 42774 \n", "27 use of revolving credit use of revolving credit 97500 84202 \n", "28 pay duly pay duly 24547 48302 \n", "29 no consumption no consumption 22109 10876 \n", "... ... ... ... ... \n", "8910 2 month delay 2 month delay 28390 29639 \n", "8911 3 month delay 2 month delay 12034 12548 \n", "8912 2 month delay 2 month delay 74032 75557 \n", "8913 2 month delay use of revolving credit 95773 95489 \n", "8914 2 month delay 2 month delay 89910 92588 \n", "8915 3 month delay 3 month delay 48674 49895 \n", "8916 2 month delay 2 month delay 51028 52112 \n", "8917 2 month delay 2 month delay 20195 21267 \n", "8918 2 month delay 2 month delay 30076 31287 \n", "8919 2 month delay 2 month delay 7593 9634 \n", "8920 2 month delay 2 month delay 150 150 \n", "8921 2 month delay 2 month delay 53843 55933 \n", "8922 2 month delay 2 month delay 190784 195724 \n", "8923 2 month delay 2 month delay 40509 40551 \n", "8924 2 month delay 2 month delay 3167 5601 \n", "8925 2 month delay 2 month delay 11084 12605 \n", "8926 4 month delay 3 month delay 36428 37530 \n", "8927 4 month delay 3 month delay 22147 24770 \n", "8928 2 month delay 2 month delay 31131 33815 \n", "8929 2 month delay 2 month delay 159017 162697 \n", "8930 2 month delay 2 month delay 168289 172001 \n", "8931 3 month delay 3 month delay 750 750 \n", "8932 7 month delay 7 month delay 300 300 \n", "8933 2 month delay use of revolving credit 327918 321476 \n", "8934 2 month delay 2 month delay 213616 208784 \n", "8935 3 month delay 3 month delay 14829 17267 \n", "8936 7 month delay 7 month delay 150 150 \n", "8937 3 month delay 3 month delay 750 750 \n", "8938 7 month delay 7 month delay 2400 2400 \n", "8939 7 month delay 7 month delay 150 150 \n", "\n", " BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 \\\n", "0 65993 8543 1695 750 8267 66008 \n", "1 44164 35233 884 9924 941 44743 \n", "2 29881 24247 21664 1556 4854 30366 \n", "3 17926 17966 30741 31088 68940 18018 \n", "4 30592 154167 13410 25426 60446 30594 \n", "5 28656 7497 7685 15434 13000 29022 \n", "6 7147 9950 22117 4874 60396 7147 \n", "7 223350 213831 210563 211925 120018 10071 \n", "8 38002 40357 43663 52735 15000 3000 \n", "9 5226 4715 3275 6422 4021 5241 \n", "10 15360 0 12923 1816 20650 15360 \n", "11 25145 37666 19453 10492 20979 5000 \n", "12 7488 10276 96059 6434 26943 5000 \n", "13 140802 143354 146225 148820 1200 135000 \n", "14 73950 59324 156094 110234 20000 5000 \n", "15 44157 48322 21593 13866 2504 10004 \n", "16 6327 138485 140492 141006 1000 6327 \n", "17 32450 17285 9766 9981 17402 20013 \n", "18 64644 125984 106584 125557 390 75720 \n", "19 53895 -105 20895 20835 0 54165 \n", "20 148544 18791 5909 68988 4 149654 \n", "21 142852 133583 129049 128325 6000 6048 \n", "22 11131 13824 17992 15250 6626 12446 \n", "23 1600 3815 8330 4765 3625 1600 \n", "24 4148 2312 6909 4189 3793 4148 \n", "25 146682 150407 147868 149349 7000 5500 \n", "26 41817 749 5572 10573 2300 2300 \n", "27 82933 80501 79038 80694 3010 2970 \n", "28 4549 3085 7300 6583 4519 9098 \n", "29 10268 5872 3068 2181 10943 10273 \n", "... ... ... ... ... ... ... \n", "8910 30854 30062 32705 33519 2000 2000 \n", "8911 12056 13958 13468 6144 1000 0 \n", "8912 76434 74611 79292 80945 3300 2700 \n", "8913 94681 93965 90545 90529 4000 3500 \n", "8914 91936 94623 94952 95234 5000 1800 \n", "8915 52570 53614 54534 53374 2300 4116 \n", "8916 55232 55932 54910 57344 2500 4600 \n", "8917 21332 21680 23011 23498 1700 700 \n", "8918 31676 32259 31608 33524 2000 1200 \n", "8919 4476 8830 8153 6422 2379 7 \n", "8920 150 150 150 300 0 0 \n", "8921 56575 57303 58593 59738 3500 2100 \n", "8922 198707 201634 205949 210077 9300 7500 \n", "8923 42592 43296 43892 43060 1000 3000 \n", "8924 5366 6772 6515 7906 2500 0 \n", "8925 13102 12595 14386 14005 2000 1000 \n", "8926 38630 41774 40806 41357 2000 2000 \n", "8927 26068 28842 28094 27361 3000 2000 \n", "8928 33002 32173 34629 33940 3500 0 \n", "8929 163143 161906 165807 169599 9159 4842 \n", "8930 175281 177895 180078 184048 8000 7500 \n", "8931 750 750 2450 2150 0 0 \n", "8932 300 300 300 300 0 0 \n", "8933 314931 176439 154010 134334 0 0 \n", "8934 212058 207226 202394 231339 0 8000 \n", "8935 16706 18694 19049 18459 3000 0 \n", "8936 150 150 150 150 0 0 \n", "8937 750 750 750 750 0 0 \n", "8938 2400 2400 2400 2400 0 0 \n", "8939 150 150 150 150 0 0 \n", "\n", " PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH \\\n", "0 8543 1695 750 7350 1 \n", "1 0 884 9924 10824 1 \n", "2 0 433 1556 14047 1 \n", "3 18058 30897 31244 88461 1 \n", "4 150843 163881 25426 39526 1 \n", "5 7500 27769 12000 6200 1 \n", "6 9950 22117 4874 0 1 \n", "7 8037 8018 8809 5022 1 \n", "8 3000 4000 10000 25000 1 \n", "9 4729 3284 6441 4285 1 \n", "10 0 12923 1816 17050 1 \n", "11 37676 8808 2000 2709 1 \n", "12 6000 93000 3000 5000 1 \n", "13 5200 5500 5500 5400 1 \n", "14 2000 112000 4234 4000 1 \n", "15 5178 1047 2019 1004 1 \n", "16 135000 4700 5000 5000 1 \n", "17 346 5000 8000 5000 1 \n", "18 62520 17000 132200 167000 1 \n", "19 0 21000 20940 33460 1 \n", "20 18885 5940 69337 200655 1 \n", "21 6000 5000 5000 5000 1 \n", "22 17746 6000 5749 928 1 \n", "23 3315 11645 4765 2171 1 \n", "24 2312 6909 4189 1539 1 \n", "25 6027 5328 5390 6047 1 \n", "26 749 5572 5573 13793 1 \n", "27 2886 2824 2925 2987 1 \n", "28 3085 7300 6583 5060 1 \n", "29 5978 3068 2181 3242 1 \n", "... ... ... ... ... ... \n", "8910 0 3300 1500 0 0 \n", "8911 2400 100 0 57258 0 \n", "8912 0 5900 3100 0 0 \n", "8913 3500 0 3500 4200 0 \n", "8914 5000 1900 4700 0 0 \n", "8915 2500 2052 0 0 0 \n", "8916 2200 0 3513 3000 0 \n", "8917 1000 2000 1000 0 0 \n", "8918 1400 0 2600 0 0 \n", "8919 7002 13 155 1 0 \n", "8920 0 0 150 0 0 \n", "8921 2200 2300 2200 2100 0 \n", "8922 7500 7500 7500 7600 0 \n", "8923 1700 1600 0 3500 0 \n", "8924 1500 0 1500 0 0 \n", "8925 0 2000 0 2000 0 \n", "8926 4086 0 1500 1000 0 \n", "8927 3500 0 0 1000 0 \n", "8928 0 3000 0 2000 0 \n", "8929 3000 8000 7000 3000 0 \n", "8930 7000 6600 7000 7100 0 \n", "8931 0 2000 0 0 0 \n", "8932 0 0 0 0 0 \n", "8933 500 0 6267 2257 0 \n", "8934 0 0 32236 3000 0 \n", "8935 2560 955 0 661 0 \n", "8936 0 0 0 0 0 \n", "8937 0 0 0 1500 0 \n", "8938 0 0 0 0 0 \n", "8939 0 0 0 0 0 \n", "\n", " p_DEFAULT_NEXT_MONTH r_DEFAULT_NEXT_MONTH \n", "0 0.045837 2.483007 \n", "1 0.050230 2.445869 \n", "2 0.050527 2.443456 \n", "3 0.051503 2.435615 \n", "4 0.051717 2.433910 \n", "5 0.053061 2.423347 \n", "6 0.056650 2.396193 \n", "7 0.058753 2.380929 \n", "8 0.058911 2.379801 \n", "9 0.059060 2.378739 \n", "10 0.059089 2.378531 \n", "11 0.060745 2.366883 \n", "12 0.061522 2.361507 \n", "13 0.061900 2.358911 \n", "14 0.061935 2.358675 \n", "15 0.062027 2.358041 \n", "16 0.063140 2.350490 \n", "17 0.063386 2.348836 \n", "18 0.063629 2.347206 \n", "19 0.063813 2.345973 \n", "20 0.063860 2.345661 \n", "21 0.065316 2.336031 \n", "22 0.065442 2.335207 \n", "23 0.065657 2.333801 \n", "24 0.065732 2.333313 \n", "25 0.065756 2.333153 \n", "26 0.066522 2.328184 \n", "27 0.066585 2.327775 \n", "28 0.066960 2.325363 \n", "29 0.067985 2.318820 \n", "... ... ... \n", "8910 0.772660 -1.721227 \n", "8911 0.772848 -1.721707 \n", "8912 0.774303 -1.725433 \n", "8913 0.775734 -1.729118 \n", "8914 0.776880 -1.732076 \n", "8915 0.779941 -1.740033 \n", "8916 0.780607 -1.741776 \n", "8917 0.781268 -1.743508 \n", "8918 0.784777 -1.752759 \n", "8919 0.787675 -1.760476 \n", "8920 0.793444 -1.776054 \n", "8921 0.794019 -1.777621 \n", "8922 0.796132 -1.783414 \n", "8923 0.797857 -1.788172 \n", "8924 0.802237 -1.800380 \n", "8925 0.805651 -1.810028 \n", "8926 0.807385 -1.814973 \n", "8927 0.812416 -1.829496 \n", "8928 0.814786 -1.836434 \n", "8929 0.815782 -1.839367 \n", "8930 0.816460 -1.841371 \n", "8931 0.825031 -1.867161 \n", "8932 0.825105 -1.867387 \n", "8933 0.832569 -1.890599 \n", "8934 0.841972 -1.920928 \n", "8935 0.852781 -1.957464 \n", "8936 0.866568 -2.007067 \n", "8937 0.869051 -2.016405 \n", "8938 0.874018 -2.035492 \n", "8939 0.886468 -2.085985 \n", "\n", "[8940 rows x 27 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_yhat = test_yhat.sort_values(by='r_DEFAULT_NEXT_MONTH', ascending=False).reset_index(drop=True)\n", "test_yhat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This simple analysis has uncovered some of the most difficult customers for the GBM to correctly predict default. Perhaps because of the high importance of the payment features, `PAY_0`-`PAY_6`, the GBM struggles to correctly predict several cases in which customers made timely recent payments and then suddenly defaulted (high positive residuals) and those customers that were chronically late making payments but did not default (high negative residuals)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plot residuals by most important input variable \n", "Residuals can also be plotted for important input variables to understand how the values of a single input variable affect prediction errors. When plotted by `PAY_0`, the residuals confirm that the GBM is struggling to accurately predict cases where default status is not correlated with recent payment behavior in an obvious way. The residual plots for values of `PAY_0` indicating timely payment behavior (e.g., `use of revolving credit`, `pay duly`, and `no consumption`) generally display the highest positive residuals and relatively small negative residuals. Residuals for the other values of `PAY_0`, those that represent late recent payments, tend to show large negative residuals and relatively small positive residuals." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# use Seaborn FacetGrid for convenience\n", "g = sns.FacetGrid(test_yhat, row='PAY_0', hue=y)\n", "_ = g.map(plt.scatter, yhat, 'r_DEFAULT_NEXT_MONTH', alpha=0.4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Retrain GBM classifier based on results of residual analysis\n", "Now that an issue has been discovered using residual analysis, can it be resolved? \n", "\n", "#### Create a feature that contains information about behavior over time\n", "One strategy to improve prediction accuracy is to introduce a new feature that summarizes a customer's spending behavior over time to expose any potential financial instability: the standard deviation of a customer's bill amounts over six months. Pandas has a one-liner for calculating standard deviations for a set of columns, so the H2OFrame is casted back into Pandas DataFrame for convenience." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDLIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4PAY_5PAY_6BILL_AMT1BILL_AMT2BILL_AMT3BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6DEFAULT_NEXT_MONTHbill_std
0120000femaleuniversitymarried242 month delay2 month delaypay dulypay dulyno consumptionno consumption391331026890000689000011761.633219
12120000femaleuniversitysingle26pay duly2 month delayuse of revolving credituse of revolving credituse of revolving credit2 month delay2682172526823272345532610100010001000020001637.967841
2390000femaleuniversitysingle34use of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credituse of revolving credit29239140271355914331149481554915181500100010001000500006064.518593
\n", "
" ], "text/plain": [ " ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 \\\n", "0 1 20000 female university married 24 2 month delay \n", "1 2 120000 female university single 26 pay duly \n", "2 3 90000 female university single 34 use of revolving credit \n", "\n", " PAY_2 PAY_3 PAY_4 \\\n", "0 2 month delay pay duly pay duly \n", "1 2 month delay use of revolving credit use of revolving credit \n", "2 use of revolving credit use of revolving credit use of revolving credit \n", "\n", " PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 \\\n", "0 no consumption no consumption 3913 3102 \n", "1 use of revolving credit 2 month delay 2682 1725 \n", "2 use of revolving credit use of revolving credit 29239 14027 \n", "\n", " BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 \\\n", "0 689 0 0 0 0 689 0 \n", "1 2682 3272 3455 3261 0 1000 1000 \n", "2 13559 14331 14948 15549 1518 1500 1000 \n", "\n", " PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std \n", "0 0 0 0 1 1761.633219 \n", "1 1000 0 2000 1 637.967841 \n", "2 1000 1000 5000 0 6064.518593 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.as_data_frame()\n", "data['bill_std'] = data[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].std(axis=1)\n", "data.head(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert Pandas DataFrame back to H2OFrame for modeling\n", "To retrain the model with the new feature, an H2OFrame is required and that H2OFrame is split using the same proportion and random seed as in cell 8 for the first GBM model." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" ] } ], "source": [ "data = h2o.H2OFrame(data) # convert \n", "data[y] = data[y].asfactor() # ensure target is handled as a categorical variable\n", "train, test = data.split_frame([0.7], seed=12345) # split into training and validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retrain GBM with new feature\n", "The `train()` function is used to retrain the GBM model with the nearly same hyperparameters used previously in cell 9. A slight, but noticable, increase in accuracy results from retraining with the new feature. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm Model Build progress: |███████████████████████████████████████████████| 100%\n", "GBM Test AUC = 0.7825\n" ] } ], "source": [ "# initialize GBM model\n", "model = H2OGradientBoostingEstimator(ntrees=150, # maximum 150 trees in GBM\n", " max_depth=6, # trees can have maximum depth of 6\n", " sample_rate=0.9, # use 90% of rows in each iteration (tree)\n", " col_sample_rate=0.85, # use 90% of variables in each iteration (tree)\n", " stopping_rounds=5, # stop if validation error does not decrease for 5 iterations (trees)\n", " seed=12345) # for reproducibility\n", "\n", "# retrain GBM model\n", "model.train(y=y,\n", " x=X + ['bill_std'], # add new feature\n", " training_frame=train, \n", " validation_frame=test)\n", "\n", "# print AUC\n", "print('GBM Test AUC = %.4f' % model.auc(valid=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While there maybe be other more complex features or a more optimal set of hyperparameters that could lead to further incremental increases in accuracy, more information is needed to achieve meaningful improvement in prediction performance. In particular, a common measure for credit lending, the customers' debt-to-income ratio, for each payment and billing period could be particularly useful. Spikes in debt-to-income ratio, representing loss of income or large increases in debt, would likely be very indicative of a default and would expose the GBM to information not currently available in the UCI credit card default data. Introducing new data could also de-emphasize `PAY_0`, which would likely result in a more stable model as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Perform sensitivity analysis to test model performance on unseen data\n", "\n", "Sensitivity analysis investigates whether model behavior and outputs remain stable when data is intentionally perturbed or other changes are simulated in data. Beyond traditional assessment practices, sensitivity analysis of machine learning model predictions is perhaps the most important validation technique for machine learning models. Machine learning models can make drastically differing predictions for only minor changes in input variable values. In practice, many linear model validation techniques focus on the numerical instability of regression parameters due to correlation between input variables or between input variables and the dependent variable. It may be prudent for those switching from linear modeling techniques to machine learning techniques to focus less on numerical instability of model parameters and to focus more on the potential instability of model predictions.\n", "\n", "Here sensitivity analysis is used to understand the impact of changing the most important input variable, `PAY_0`, and the impact of a sociologically sensitive variable, `SEX`, in the model. If the model changes in reasonable and expected ways when important variable values are changed this can enhance trust in the model. If the contribution of potentially sensitive variables, such as those related to gender, race, age, marital status, or disability status, can be shown to have minimal impact on the model, this is an indication of fairness in the model predictions and can also increase overall trust in the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bind new model predictions onto test data \n", "Typically, a productive exercise in model debugging and validation is to investigate customers with very high or low predicted probabilities to determine if their predictions stay within reasonable bounds when important variables are changed. The predictions from the new, more accurate model are merged onto the test set to find these potentially interesting customers. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] } ], "source": [ "preds2 = model.predict(test).drop(['predict', 'p0'])\n", "preds2.columns = [yhat]\n", "test_yhat = test.cbind(preds1[yhat])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Helper function for finding percentile indices\n", "The function below finds and returns the row indices for the minimum, the maximum, and the deciles of one column in terms of another -- in this case, the model predictions (`p_DEFAULT_NEXT_MONTH`) and the row identifier (`ID`), respectively. These indices are used as a starting point for boundary testing. Outlying predictions found through residual analysis is another group of potentially interesting local predictions to investigate." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: 28716,\n", " 10: 8942,\n", " 20: 28257,\n", " 30: 4074,\n", " 40: 13411,\n", " 50: 16633,\n", " 60: 2402,\n", " 70: 19769,\n", " 80: 25069,\n", " 90: 21372,\n", " 99: 29116}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_percentile_dict(yhat, id_, frame):\n", "\n", " \"\"\" Returns the minimum, the maximum, and the deciles of a column, yhat, \n", " as the indices based on another column id_.\n", " \n", " Args:\n", " yhat: Column in which to find percentiles.\n", " id_: Id column that stores indices for percentiles of yhat.\n", " frame: H2OFrame containing yhat and id_. \n", " \n", " Returns:\n", " Dictionary of percentile values and index column values.\n", " \n", " \"\"\"\n", " \n", " # create a copy of frame and sort it by yhat\n", " sort_df = frame.as_data_frame()\n", " sort_df.sort_values(yhat, inplace=True)\n", " sort_df.reset_index(inplace=True)\n", " \n", " # find top and bottom percentiles\n", " percentiles_dict = {}\n", " percentiles_dict[0] = sort_df.loc[0, id_]\n", " percentiles_dict[99] = sort_df.loc[sort_df.shape[0]-1, id_]\n", "\n", " # find 10th-90th percentiles\n", " inc = sort_df.shape[0]//10\n", " for i in range(1, 10):\n", " percentiles_dict[i * 10] = sort_df.loc[i * inc, id_]\n", "\n", " return percentiles_dict\n", "\n", "# display percentiles dictionary\n", "# ID values for rows\n", "# from lowest prediction \n", "# to highest prediction\n", "pred_percentile_dict = get_percentile_dict(yhat, 'ID', test_yhat)\n", "pred_percentile_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display test data prediction range\n", "Unlike some regression models and neural networks that can produce outrageous predictions for changes in input variables, GBM predictions in new data are bounded by the lowest and highest probability leaf nodes in each constiuent decision tree in the trained model. While unbounded, extreme predictions are typically not an issue for tree models and classification tasks, it is often a good idea to check that the model predictions cover a full range of useful values in the test set. Below, we can see that the model produces both low and high predictions in the test set, indicating that it is likely responsive to signal in new data and not simply predicting the majority class or an average value. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lowest prediction: " ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DEFAULT_NEXT_MONTH p_DEFAULT_NEXT_MONTH
0 0.0383668
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Highest prediction: " ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DEFAULT_NEXT_MONTH p_DEFAULT_NEXT_MONTH
1 0.895285
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print('Lowest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])][[y, yhat]])\n", "print('Highest prediction:', test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])][[y, yhat]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use trained model to test predictions for interesting situations: customer least likely to default\n", "As a starting point for further analysis, sensitivity analysis is performed for the customer least likely to default. This woman has a very low probability of defaulting according to the trained GBM." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std p_DEFAULT_NEXT_MONTH
28716 780000femaleuniversity single 41no consumptionno consumptionno consumptionno consumptionno consumptionno consumption 101957 61715 38686 21482 72628 182792 62819 39558 22204 82097 184322 25695 0 57564.1 0.0383668
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test effect of changing `SEX`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SEX` should not have a large impact on predictions. This could indicate unwanted sociological bias in the GBM model." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std predict p0 p1
28716 780000male university single 41no consumptionno consumptionno consumptionno consumptionno consumptionno consumption 101957 61715 38686 21482 72628 182792 62819 39558 22204 82097 184322 25695 0 57564.1 00.9590520.0409481
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]\n", "test_case = test_case.drop([yhat])\n", "test_case['SEX'] = 'male'\n", "test_case = test_case.cbind(model.predict(test_case))\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As desired, simulating this person as a male does not have a large impact on their probability of default." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test effect of changing `PAY_0`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variable importance and residual analysis indicates that the value of `PAY_0` can have a strong effect on model predictions. Measuring the change in predicted probability when the value of `PAY_0` is changed from a timely payment to late payment is probably a good test case for prediction stability. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std predict p0 p1
28716 780000femaleuniversity single 412 month delayno consumptionno consumptionno consumptionno consumptionno consumption 101957 61715 38686 21482 72628 182792 62819 39558 22204 82097 184322 25695 0 57564.1 10.5710320.428968
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[0])]\n", "test_case = test_case.drop([yhat])\n", "test_case['PAY_0'] = '2 month delay' \n", "test_case = test_case.cbind(model.predict(test_case))\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the value is changed from `no consumption` to `two month delay` there is a very large increase in predicted probability. Such a marked change related to the value of one variable is problematic for numerous reasons." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use trained model to test predictions for interesting situations: customer most likely to default\n", "Now the same test will be performed on the customer most likely to default. This woman has a very high probability of default under the GBM model. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std p_DEFAULT_NEXT_MONTH
29116 20000femaleuniversity married 593 month delay2 month delay3 month delay2 month delay2 month delay4 month delay 8803 11137 10672 11201 12721 11946 2800 0 1000 2000 0 0 1 1327.55 0.895285
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test effect of changing `SEX`\n", "Changing the value for `SEX` from female to male for this customer decreases the predicted probability by a relatively small amount." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std predict p0 p1
29116 20000male university married 593 month delay2 month delay3 month delay2 month delay2 month delay4 month delay 8803 11137 10672 11201 12721 11946 2800 0 1000 2000 0 0 1 1327.55 10.1615790.838421
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]\n", "test_case = test_case.drop([yhat])\n", "test_case['SEX'] = 'male'\n", "test_case = test_case.cbind(model.predict(test_case))\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test effect of changing `PAY_0`\n", "Switching the riskiest customer's value for `PAY_0` from `3 month delay` to `pay duly` reduces the their chance of default by about 20%, a noticable swing in probability but still a higher probability value, notably greater than common lending cutoffs." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ID LIMIT_BALSEX EDUCATION MARRIAGE AGEPAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH bill_std predict p0 p1
29116 20000femaleuniversity married 59pay duly2 month delay3 month delay2 month delay2 month delay4 month delay 8803 11137 10672 11201 12721 11946 2800 0 1000 2000 0 0 1 1327.55 10.2738580.726142
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_case = test_yhat[test_yhat['ID'] == int(pred_percentile_dict[99])]\n", "test_case = test_case.drop([yhat])\n", "test_case['PAY_0'] = 'pay duly' \n", "test_case = test_case.cbind(model.predict(test_case))\n", "test_case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this small number of boundary test cases, the GBM model appears stable. However, if large swings in predictions occur for sensitive or important variables, practicioners are urged to retrain unstable models without the problematic variables or combinations of variables, which may unfortunately involve some trial and error. Also, four test cases is woefully inadequate for real-world models. Automated sensitivity analysis across many variables, combinations of variables, and for many different rows of data seems more appropriate for mission-critical machine learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Shutdown H2O\n", "After using h2o, it's typically best to shut it down. However, before doing so, users should ensure that they have saved any h2o data structures, such as models, H2OFrames, or scoring artifacts, such as POJOs or MOJOs." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? n\n" ] } ], "source": [ "# be careful, this can erase your work!\n", "h2o.cluster().shutdown(prompt=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, a complex GBM classifier was trained to predict credit card defaults. Residual analysis was used to debug the GBM model predictions and enabled a slight improvement in accuracy. Sensitivity analysis was used to test the GBM for trustworthiness and stability. In a small number of boundary test cases, the trained GBM appeared stable. Residual analysis and sensitivity analysis are powerful model debugging techniques and can increase trust in complex models. These techniques should generalize well for many types of business and research problems, enabling you to train a complex model and justify it to your colleagues, bosses, and potentially, external regulators. " ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }