{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#
Credit scoring for the microloan organization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Feature and data explanation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have data from a micro loan organization. Data includes: applications, status (if application was granted or not) and portfoli snapshorts. Current risk level (ration of default clients) is too much, we need to implement at least basic scorecard to extract the best clients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data can be found here https://cloud.mail.ru/public/MTPz/zvoEGUjz9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**applications.csv** - contains information about applications\n", "\n", "**status.csv** - contains information if loan was granted or not, and short information about previous applications of client\n", "\n", "**portfolio.csv** - snapshot for every day if client was in delinquency or not" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**applications.csv:**\n", "\n", "Term - term of loan\n", "\n", "expired_date - till that date client can accept the loan\n", "\n", "loans_amount - amount of loans (rur) that client took\n", "\n", "loans_cnt - count of loans that client took\n", "\n", "paid_amount - amount of loans (rur) that client paid\n", "\n", "paid_cnt - count of loans that client paid\n", "\n", "pretention - client has pretention about previous loans (for example, didn't pay the penalties)\n", "\n", "location_country - country location of client (from IP address)\n", "\n", "location_city - city location of client (from IP address)\n", "\n", "app_dt - date of application\n", "\n", "verif_data - client verified his data\n", "\n", "verif_doc - client verified his document\n", "\n", "verif_appl - client verified his photo\n", "\n", "verif_addres - client verified his address\n", "\n", "doc_date - issue date of passport\n", "\n", "client_date - date of registration in the community of microloan organization\n", "\n", "credit_status - status of client (2: has active loan, 1: has closed loans, 0: never had loans in the system)\n", "\n", "rating, rating2 - ratings in the community of microloan organization\n", "\n", "amount_issued - amount of current loan\n", "\n", "amount_return - amount of current loan plus interest rate\n", "\n", "friends - friends in the community of microloan organization\n", "\n", "app_id - id of applications\n", "\n", "client_id - id of client\n", "\n", "**status.csv**:\n", "\n", "comments - comments of staff (usually it means something bad)\n", "\n", "decision - if loan was granted or not (1 - granted)\n", "\n", "**portfolio.csv**:\n", "\n", "report_dt - date of snapshot\n", "\n", "delq - if client was in delinquency or not (1 - was)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Primary data analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split, cross_val_predict, GridSearchCV\n", "from sklearn.linear_model import LogisticRegression, Ridge, LinearRegression\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from scipy.sparse import csr_matrix, hstack\n", "from datetime import timedelta, date\n", "import datetime\n", "import math\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status = pd.read_csv('status.csv')\n", "applications = pd.read_csv('applications.csv')\n", "portfolio = pd.read_csv('portfolio.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "portfolio.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to select target variable. Of cource, it should be that loan are not paid, but if client delayed payment for a day we shouldn't consider him as a defaulted client. Therefore, we need to examine how much days of delay we'll consider as a default (target = 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "portfolio['report_dt'] = pd.to_datetime(portfolio['report_dt'], format = '%Y-%m-%d')\n", "portfolio['key'] = portfolio['client_id'].apply(lambda x: str(x)) + portfolio['report_dt'].apply(lambda x: str(x))\n", "portfolio['key_last'] = portfolio['client_id'].apply(lambda x: str(x)) + portfolio['report_dt'].apply(lambda x: str(x + timedelta(-1)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = []\n", "d = {}\n", "d[0] = portfolio[(portfolio['delq'] == 0) & (portfolio['report_dt'] < '2018-11-15')]\n", "for i in range(1,30):\n", " d[i] = portfolio[(portfolio['key_last'].isin(d[i-1]['key'])) & (portfolio['delq'] == 1)]\n", " res.append(d[i].shape[0])\n", "res = np.array(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recovery rate (how many percent of clients, who was in delinquency one day, are still in delinquency depending of delinquency day):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(res[1:]/res[1], 'bo')\n", "plt.ylim(0,1)\n", "res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll choose default definition - to be in delinquency for **more than five days**, since after five days recovery rate is decreasing slowly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Calculating target variable**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We exclude all applications with term more than 60 days, since company are not going to credit them in future. And we need to exclude all loans granted for last 60 days + 6 days, otherwise we can't calculate target variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications = applications[applications['Term'] <= 60]\n", "applications['app_dt'] = pd.to_datetime(applications['app_dt'], format = '%Y-%m-%d')\n", "applications = applications[applications['app_dt'] <= '2018-10-10']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add flag that loan was granted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['granted'] = 0\n", "applications.loc[applications['app_id'].isin(status[status['decision'] == 1]['app_id']),'granted'] = 1\n", "applications['granted'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check stability of granted rate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['month_gr'] = applications['app_dt'].apply(lambda x: str(x.year*100+x.month))\n", "res = applications[['month_gr', 'granted']].groupby('month_gr').mean().sort_values(by = ['month_gr'])\n", "plt.plot(res)\n", "plt.xticks(rotation='vertical');\n", "plt.ylim((0,1));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a fall in June-2018, but in general it's stable and there is no trends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Leave only granted applications:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications = applications[applications['granted'] == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no data for app_id = 12558 in portfolio, delete this one application (otherwise the next script has an error):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications = applications[applications['app_id'] != 12558]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "for i in applications['app_id']:\n", " dt = pd.to_datetime(applications[applications['app_id'] == i]['app_dt'].values[0])\n", " client_id = applications[applications['app_id'] == i]['client_id'].values[0]\n", " term = int(applications[applications['app_id'] == i]['Term'].values[0])\n", " a = portfolio.loc[(portfolio['report_dt'] >= dt + timedelta(term+2)) & (portfolio['report_dt'] <= dt + timedelta(term+6)) & (portfolio['client_id'] == client_id), ['client_id', 'delq']].groupby('client_id').min()\n", " applications.loc[applications['app_id'] == i, 'target'] = a['delq'].values[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['target'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = applications[['month_gr', 'target']].groupby('month_gr').mean().sort_values(by = ['month_gr'])\n", "plt.plot(res)\n", "plt.xticks(rotation='vertical');\n", "plt.ylim((0, 0.3));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Default rate is less stable because one client can take more than one loan. And bad rate strongly depends on such clients (if they are good or bad):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[['client_id', 'app_id']].groupby('client_id').count().reset_index().sort_values(by = 'app_id', ascending = False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To avoid it we'll leave only one active loan per one client per one time moment, e.g. if client took a loan on 1st of January for 30 days, we'll exclude all granted loans of this client before 31th of January." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "client_id = 0\n", "applications = applications.sort_values(by = ['client_id' , 'app_dt', 'app_id'])\n", "applications['sample'] = 0\n", "\n", "for i in applications['app_id']:\n", " if client_id != applications[applications['app_id'] == i]['client_id'].values[0]:\n", " client_id = applications[applications['app_id'] == i]['client_id'].values[0]\n", " final_dt = pd.to_datetime(applications[applications['app_id'] == i]['app_dt'].values[0], format = '%Y-%m-%d') + timedelta(int(applications[applications['app_id'] == i]['Term'].values[0]))\n", " applications.loc[applications['app_id'] == i, 'sample'] = 1 \n", " else:\n", " if pd.to_datetime(applications[applications['app_id'] == i]['app_dt'].values[0], format = '%Y-%m-%d') >= final_dt:\n", " final_dt = pd.to_datetime(applications[applications['app_id'] == i]['app_dt'].values[0], format = '%Y-%m-%d') + timedelta(int(applications[applications['app_id'] == i]['Term'].values[0]))\n", " applications.loc[applications['app_id'] == i, 'sample'] = 1 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = applications[applications['sample'] == 1][['month_gr', 'target']].groupby('month_gr').mean().sort_values(by = ['month_gr'])\n", "plt.plot(res)\n", "plt.xticks(rotation='vertical');\n", "plt.ylim((0, 0.3));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's better. But we can see growth in summer 2018. At the same time we saw decsrease of approval rate at the same period, it means that clients flow was worst at that time.\n", "\n", "Now we have not too much data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = applications[applications['sample'] == 1][['month_gr', 'target']].groupby('month_gr').count().sort_values(by = ['month_gr'])\n", "plt.plot(res)\n", "plt.xticks(rotation='vertical');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check NaN:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[['Term', 'expired_date', 'loans_amount', 'loans_cnt',\n", " 'paid_amount', 'paid_cnt', 'pretention', 'location_country',\n", " 'location_city', 'app_dt', 'verif_data', 'verif_doc', 'verif_appl',\n", " 'verif_addres', 'doc_date', 'client_date', 'credit_status', 'rating',\n", " 'rating2', 'amount_issued', 'amount_return', 'friends']].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**pretention:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[['pretention', 'month_gr']].groupby('month_gr').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since one third of varible has missing values, we'll not use it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check **verif_doc**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[['verif_doc', 'app_id']].groupby('verif_doc').count()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['verif_doc'] = applications['verif_doc'].fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check **verif_appl**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[['verif_appl', 'app_id']].groupby('verif_appl').count()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['verif_appl'] = applications['verif_appl'].fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check **location_country** and **location_city**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['location_country'] = applications['location_country'].fillna('n\\a')\n", "applications['location_city'] = applications['location_city'].fillna('n\\a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Creation of new features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add new features: time from client registration and from issued of document (for both than more then better)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['doc_date_i'] = ((pd.to_datetime(applications['app_dt'], format = '%Y-%m-%d')-pd.to_datetime(applications['doc_date'], format = '%Y-%m-%d')).dt.days/365.25).fillna(0)\n", "applications['client_date_i'] = (pd.to_datetime(applications['app_dt'], format = '%Y-%m-%d')-pd.to_datetime(applications['client_date'], format = '%Y-%m-%d')).dt.days/365.25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add new feature: average interest rate (monthly). According to current loan process, client accepts maximum rate that he agrees. Process won't be changed, therefore we can use such variable " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['ir'] = (applications['amount_return'] - applications['amount_issued'])/(applications['Term']/30)/applications['amount_issued']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['pti'] = (applications['loans_amount'] - applications['paid_amount'] + applications['amount_return'])/applications['paid_amount']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Primary visual data analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "appl_short = applications[applications['sample'] == 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "appl_short.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'Term'\n", "ratio = 15\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'loans_amount'\n", "ratio = 200000\n", "limit = 5\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'loans_cnt'\n", "ratio = 100\n", "limit = 3\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'rating'\n", "ratio = 100\n", "limit = 3\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'rating2'\n", "ratio = 10\n", "limit = 1\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'amount_issued'\n", "ratio = 5000\n", "limit = 3\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'friends'\n", "ratio = 15\n", "limit = 5\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'client_date_i'\n", "ratio = 3\n", "limit = 5\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'doc_date_i'\n", "ratio = 3\n", "limit = 5\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = 'ir'\n", "ratio = 0.02\n", "limit = 15\n", "plt.figure(1,figsize=(15,5))\n", "plt.subplot(1,2,1)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (cnt)']).groupby('x').count().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());\n", "plt.subplot(1,2,2)\n", "pd.DataFrame(data = np.concatenate((np.asarray([appl_short[var].apply(lambda x: limit if int(x/ratio) > limit else int(x/ratio))]).T,\n", " appl_short[['target']]), axis = 1), \n", " columns = ['x',\n", " var + ' (avg target)']).groupby('x').mean().reset_index().sort_values(by = ['x']).plot(x = 'x', \n", " kind = 'bar',\n", " ax=plt.gca());" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Very high interest rate means very high risk. Decreasing of risk in the last bucket is due to clients with very short term (usualy, they have much higher interest rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Metrics selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to split clients for \"good\" and \"bad\". But we don't need to split \"bad\" clients from \"very bad\" clients. Therefore, firstly, we decline very bad clients - with very high interest rate.\n", "After we'll use standard metric for bank scoring - roc-auc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = appl_short[['target', 'ir', 'Term']]\n", "df['Term_gr'] = df['Term'].apply(lambda x: 7 if x <= 7 else 29 if x <= 29 else 30 if x == 30 else 60)\n", "df['ir_gr'] = df['ir'].apply(lambda x: 15 if int(x/0.02) > 15 else int(x/0.02))\n", "df['cnt'] = 1\n", "df2 = df.groupby(['Term_gr', 'ir_gr']).sum().reset_index()\n", "df2['avg_target'] = df2['target']/df2['cnt']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2[['Term_gr', 'ir_gr', 'avg_target', 'cnt']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = appl_short[['target', 'ir', 'Term']]\n", "df['Term_gr'] = df['Term'].apply(lambda x: 7 if x <= 7 else 29 if x <= 29 else 30 if x == 30 else 60)\n", "df['ir_gr'] = df['ir'].apply(lambda x: 11 if int(x/0.02) > 11 else 7 if int(x/0.02) >= 7 else 6)\n", "df['cnt'] = 1\n", "df2 = df.groupby(['Term_gr', 'ir_gr']).sum().reset_index()\n", "df2['avg_target'] = df2['target']/df2['cnt']\n", "df2[['Term_gr', 'ir_gr', 'avg_target', 'cnt']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we'll decline clients with \n", " - interest rate >= 0.13 if term >= 30\n", " - interest rate >= 0.21 if term > 7" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications['auto_decl'] = 0\n", "applications.loc[(applications['ir'] >= 0.13) & (applications['Term'] >= 30),'auto_decl'] = 1\n", "applications.loc[(applications['ir'] >= 0.21) & (applications['Term'] > 7),'auto_decl'] = 1\n", "appl_short = applications[(applications['sample'] == 1) & (applications['auto_decl'] == 0)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "applications[(applications['sample'] == 1) & (applications['auto_decl'] == 0)]['target'].mean(), \\\n", "applications[(applications['sample'] == 1) & (applications['auto_decl'] == 1)]['target'].mean(), \\\n", "applications[(applications['sample'] == 1) & (applications['auto_decl'] == 0)]['target'].count(), \\\n", "applications[(applications['sample'] == 1) & (applications['auto_decl'] == 1)]['target'].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We decline 124 observations with risk rate more than 30%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll choose catboost model since it works very good with categorial variables. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "appl_short = appl_short.sort_values(by = ['app_dt', 'app_id'])\n", "X_train = appl_short[['location_country','location_city', 'verif_data', 'verif_doc', 'verif_appl',\n", " 'verif_addres', 'credit_status', 'Term', 'rating', 'pti',\n", " 'rating2', 'amount_issued', 'amount_return', 'friends', 'doc_date_i','client_date_i', 'ir']].values\n", "y_train = appl_short['target'].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_part, X_valid, y_train_part, y_valid = \\\n", " train_test_split(X_train, y_train, \n", " test_size=0.3, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from catboost import CatBoostClassifier\n", "\n", "model = CatBoostClassifier(random_seed = 17, thread_count = 4, verbose = False)\n", "model.fit(X_train_part, y_train_part, cat_features = [0,1,2,3,4,5,6])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cb_train_pred = model.predict_proba(X_valid)[:, 1]\n", "roc_auc_score(y_valid, cb_train_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Cross-validation and adjustment of model hyperparameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train = appl_short[['location_country','location_city', 'verif_data', 'verif_doc', 'verif_appl',\n", " 'verif_addres', 'credit_status', 'Term', 'rating', 'pti',\n", " 'rating2', 'amount_issued', 'amount_return', 'friends', 'doc_date_i','client_date_i', 'ir']].values\n", "y_train = appl_short['target'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll find the best values of max_depth and n_estimators, we'll assume parameter cv = 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "res = []\n", "cv = 3\n", "\n", "step = int(X_train.shape[0]/(cv + 1))\n", "\n", "for max_depth in range(2,7,2):\n", " for n_estimators in range(500, 2000, 500): \n", " for i in range(1,cv+1):\n", " X_train_part = X_train[:step*i-1,:]\n", " y_train_part = y_train[:step*i-1]\n", " if i < cv:\n", " X_valid = X_train[step*i:step*(i+1),:]\n", " y_valid = y_train[step*i:step*(i+1)]\n", " else:\n", " X_valid = X_train[step*i:,:]\n", " y_valid = y_train[step*i:]\n", "\n", " model = CatBoostClassifier(random_seed = 17, thread_count = 4, verbose = False,\n", " max_depth = max_depth, n_estimators = n_estimators)\n", " model.fit(X_train_part, y_train_part, cat_features = [0,1,2,3,4,5,6])\n", " \n", " cb_train_pred = model.predict_proba(X_train_part)[:, 1]\n", " train_res = roc_auc_score(y_train_part, cb_train_pred)\n", " \n", " cb_valid_pred = model.predict_proba(X_valid)[:, 1]\n", " cv_res = roc_auc_score(y_valid, cb_valid_pred)\n", " res.append([max_depth, n_estimators, i, train_res, cv_res])\n", " print(max_depth, n_estimators, i, train_res, cv_res)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = pd.DataFrame(data = res, columns = ['max_depth', 'n_estimators', 'cv_iter', 'train_res', 'cv_res'])\n", "res[['max_depth', 'n_estimators', 'train_res', 'cv_res']].groupby(['max_depth', 'n_estimators']).\\\n", " mean().reset_index().sort_values(by = ['cv_res'], ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Plotting training and validation curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Max_depth:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "res = []\n", "cv = 3\n", "\n", "step = int(X_train.shape[0]/(cv + 1))\n", "\n", "n_estimators = 100\n", "\n", "for max_depth in range(2,11):\n", " for i in range(1,cv+1):\n", " X_train_part = X_train[:step*i-1,:]\n", " y_train_part = y_train[:step*i-1]\n", " if i < cv:\n", " X_valid = X_train[step*i:step*(i+1),:]\n", " y_valid = y_train[step*i:step*(i+1)]\n", " else:\n", " X_valid = X_train[step*i:,:]\n", " y_valid = y_train[step*i:]\n", "\n", " model = CatBoostClassifier(random_seed = 17, thread_count = 4, verbose = False,\n", " max_depth = max_depth, n_estimators = n_estimators)\n", " model.fit(X_train_part, y_train_part, cat_features = [0,1,2,3,4,5,6])\n", "\n", " cb_train_pred = model.predict_proba(X_train_part)[:, 1]\n", " train_res = roc_auc_score(y_train_part, cb_train_pred)\n", "\n", " cb_valid_pred = model.predict_proba(X_valid)[:, 1]\n", " cv_res = roc_auc_score(y_valid, cb_valid_pred)\n", " res.append([max_depth, n_estimators, i, train_res, cv_res])\n", "\n", "df_res = pd.DataFrame(data = res, columns = ['max_depth', 'n_estimators', 'cv_iter', 'train_res', 'cv_res'])\n", "x = np.linspace(2,10,9)\n", "y1 = df_res[['max_depth', 'n_estimators', 'train_res']].groupby(['max_depth', 'n_estimators']).\\\n", " mean()['train_res'].values\n", "y2 = df_res[['max_depth', 'n_estimators', 'cv_res']].groupby(['max_depth', 'n_estimators']).\\\n", " mean().reset_index()['cv_res'].values\n", "plt.plot(x, y1, '-b', label = 'train')\n", "plt.plot(x, y2, '-r', label = 'valid')\n", "plt.legend(loc='upper left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "n_estimators:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "res = []\n", "cv = 3\n", "\n", "step = int(X_train.shape[0]/(cv + 1))\n", "\n", "max_depth = 2\n", "\n", "for n_estimators in range(100,1500,100):\n", " for i in range(1,cv+1):\n", " X_train_part = X_train[:step*i-1,:]\n", " y_train_part = y_train[:step*i-1]\n", " if i < cv:\n", " X_valid = X_train[step*i:step*(i+1),:]\n", " y_valid = y_train[step*i:step*(i+1)]\n", " else:\n", " X_valid = X_train[step*i:,:]\n", " y_valid = y_train[step*i:]\n", "\n", " model = CatBoostClassifier(random_seed = 17, thread_count = 4, verbose = False,\n", " max_depth = max_depth, n_estimators = n_estimators)\n", " model.fit(X_train_part, y_train_part, cat_features = [0,1,2,3,4,5,6])\n", "\n", " cb_train_pred = model.predict_proba(X_train_part)[:, 1]\n", " train_res = roc_auc_score(y_train_part, cb_train_pred)\n", "\n", " cb_valid_pred = model.predict_proba(X_valid)[:, 1]\n", " cv_res = roc_auc_score(y_valid, cb_valid_pred)\n", " res.append([max_depth, n_estimators, i, train_res, cv_res])\n", " \n", "df_res = pd.DataFrame(data = res, columns = ['max_depth', 'n_estimators', 'cv_iter', 'train_res', 'cv_res'])\n", "x = np.linspace(100,1400,14)\n", "y1 = df_res[['max_depth', 'n_estimators', 'train_res']].groupby(['max_depth', 'n_estimators']).\\\n", " mean()['train_res'].values\n", "y2 = df_res[['max_depth', 'n_estimators', 'cv_res']].groupby(['max_depth', 'n_estimators']).\\\n", " mean().reset_index()['cv_res'].values\n", "plt.plot(x, y1, '-b', label = 'train')\n", "plt.plot(x, y2, '-r', label = 'valid')\n", "plt.legend(loc='upper left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11. Prediction for test or hold-out samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Final model:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train = appl_short[['location_country','location_city', 'verif_data', 'verif_doc', 'verif_appl',\n", " 'verif_addres', 'credit_status', 'Term', 'rating', 'pti',\n", " 'rating2', 'amount_issued', 'amount_return', 'friends', 'doc_date_i','client_date_i', 'ir']].values\n", "y_train = appl_short['target'].values\n", "\n", "X_train_part, X_valid, y_train_part, y_valid = \\\n", " train_test_split(X_train, y_train, \n", " test_size=0.3, random_state=17)\n", "model = CatBoostClassifier(random_seed = 17, thread_count = 4, verbose = False,\n", " max_depth = 2, n_estimators = 1000)\n", "model.fit(X_train_part, y_train_part, cat_features = [0,1,2,3,4,5,6])\n", "cb_train_pred = model.predict_proba(X_valid)[:, 1]\n", "roc_auc_score(y_valid, cb_train_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result is less than we have at first iteration of model, but it seems to be more stable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 12. Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to set up cut-off, since the main goal of this project is to estimate potential volume of loans. To set up cut-off, we need to choose value of score according of risk level. Acceptable risk level is 2% (calculated from margin)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = []\n", "res = np.concatenate((np.asarray([cb_train_pred]).T, np.asarray([y_valid]).T), axis = 1)\n", "df_res = pd.DataFrame(data = res, columns = ['score', 'bad'])\n", "df_res = df_res.sort_values(by = ['score'])\n", "df_res['score_gr'] = df_res['score'].apply(lambda x: 'good' if x <= 0.027 else 'bad')\n", "print(df_res[['score_gr', 'bad']].groupby('score_gr').mean(), df_res[['score_gr', 'bad']].groupby('score_gr').count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Share of good clients is too small" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Conclusion:** \n", "\n", "1. To build model only on application data is not possible, additional data should be used\n", "2. To include in the model variable of credit history (now it's not available in credit process)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }