{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Открытый курс по машинному обучению. Сессия № 2\n", "\n", "###
Автор материала: Мария Тараканова (@Masha Noiseless)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##
Индивидуальный проект по анализу данных
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "В проекте используются данные по кредитным картам клиентов.\n", "Данные можно скачать отсюда: [default of credit card clients Data Set](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "%pylab inline\n", "\n", "pylab.rcParams['figure.figsize'] = (16, 7)\n", "mpl.rcParams['figure.figsize'] = (16,7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Описание набора данных и признаков" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Представлены данные за полгода (апрель - сентябрь 2005) по клиентам банка в Тайване.\n", "\n", "**Список переменных:**\n", "\n", " - **LIMIT_BAL** - Величина выдаваемого кредита: включает как индивидуальный потребительский кредит, так и его / ее семейный (дополнительный) кредит.\n", "\n", "- **SEX** - Пол (1 = мужчина; 2 = женщина)\n", "\n", "- **EDUCATION** - Образование (1 = аспиратура; 2 = высшее образование; 3 = среднее образование; 4 = другое).\n", "\n", "- **MARRIAGE** - Семейное положение (1 = женат/замужем; 2 = одинокий; 3 = другое).\n", "\n", "- **AGE** - Возраст (в годах).\n", "\n", "- **PAY_0 ... PAY_6** - История платежей. Отслеживались прошлые ежемесячные платежные записи (с Апреля по Сентябрь, 2005) следующим образом: PAY_0 = статус погашения в Сентябре, 2005, ..., PAY_6 = статус погашения в Апреле, 2005. Шкала измерения для статуса погашения: -1 = оплачен вовремя; 1 = оплата задержана на месяц; ... 8 = оплата задержана на 8 месяцев; 9 = оплата задержана на 9 месяцев и больше.\n", "\n", "- **BILL_AMT1\t... BILL_AMT6** - Сумма выписки по счету в каждый из месяцев. BILL_AMT1 = Сентябрь, 2005; ... BILL_AMT6 = Апрель, 2005.\n", "\n", "- **PAY_AMT1 ... PAY_AMT6** - Сумма предыдущего платежа. PAY_AMT1 = выплаченная сумма в Сентябре, 2005; ... \n", " PAY_AMT6 = выплаченная сумма в Апреле, 2005.\n", " \n", "Целевая переменная: **default payment next month** - неуплата в следующем месяце, бинарная (1 = да, 0 = нет). \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_excel('default of credit card clients.xls', header=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Первичный анализ данных" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head().T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Как видно, в данных нет пропущенных значений, тип всех переменных int64.\n", "\n", "Выведем основные харакетристики признаков." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe(include='all').T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# для удобства переименуем целевую переменную\n", "data = data.rename(columns = {'default payment next month':'target'})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = data['target'].hist(orientation='horizontal', figsize=(10,5))\n", "ax.set_title(\"default payment next month distribution\")\n", "\n", "print('Distribution of target:')\n", "data['target'].value_counts() / data.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Выборка не сбалансирована, одного класса больше чем другого." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Исследуем категориальные признаки $SEX$ (пол), $EDUCATION$ (образование) и $MARRIAGE$ (семейное положение)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(data['SEX'].value_counts())\n", "print('***')\n", "print(data['EDUCATION'].value_counts())\n", "print('***')\n", "print(data['MARRIAGE'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В переменной $EDUCATION$ кроме заявленных категорий (1 = аспиратура; 2 = высшее образование; 3 = среднее образование; 4 = другое) появились неизвестные категории 0, 5 и 6. Проведем дополнительное исследование этих параметров." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "data2 = data[(data['EDUCATION'] == 0) | (data['EDUCATION'] == 5)| (data['EDUCATION'] == 6) |\n", " (data['EDUCATION'] == 4)]\n", "scaler = StandardScaler()\n", "scaler.fit(data2)\n", "print('Размер нового датасета: {}'.format(data2.shape))\n", "tsne_model = TSNE(2, random_state=17)\n", "data2_2d = tsne_model.fit_transform(scaler.transform(data2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(data2_2d[:, 0], data2_2d[:, 1], color = ['red' if i == 0 else (\n", " 'blue' if i==5 else ('black' if i == 4 else 'green')) for i in data2['EDUCATION']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Категория 0, выделенная красным, четко отделена от категорий 4,5 и 6. Значит можем объединить их в одну категорию 4." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['EDUCATION'].replace([5, 6], 4, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Аналогично в признаке $MARRIAGE$ появилась неизвестная категория 0. Исследуем этот случай." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "data2 = data[(data['MARRIAGE'] == 0) | (data['MARRIAGE'] == 3)]\n", "scaler = StandardScaler()\n", "scaler.fit(data2)\n", "print('Размер нового датасета: {}'.format(data2.shape))\n", "tsne_model = TSNE(2, random_state=17)\n", "data2_2d = tsne_model.fit_transform(scaler.transform(data2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(data2_2d[:, 0], data2_2d[:, 1], color = ['red' if i == 0 else 'blue' for i in data2['MARRIAGE']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В целом, категория 0 выделена в отдельную группу, кроме нескольких точек." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Первичный визуальный анализ данных" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Исследуем зависимость категориальных признаков $SEX$ (пол), $EDUCATION$ (образование), $PAY\\_i, i = 1,...,6$ (статус платежа в $i$-том месяце) и $MARRIAGE$ (семейное положение) от целевой переменной $target$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.barplot(x='SEX', y='target', data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Доля неуплативших долг у мужчин ($SEX$=1), чуть больше, чем у женщин ($SEX$=2), но в целом существенной разницы нет." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.barplot(x='MARRIAGE', y='target', data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Доля неуплативших долг в незаявленной категории ($MARRIAGE$ = 0) меньше, чем в других." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.barplot(x='EDUCATION', y='target', data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Доля неуплативших долг людей со средним образованием ($EDUCATION$ = 3) больше, чем в остальных. Причем в незаявленной категории\n", "\n", "($EDUCATION$ = 0) неуплативших нет." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data.rename(columns = {'PAY_0':'PAY_1'})\n", "pays = ['PAY_%s' % i for i in range(1,7)]\n", "for i, pay in enumerate(pays):\n", " plt.subplot(3, 2, i + 1)\n", " sns.barplot(x=pay,y='target', data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно заметить, что чем больше задержка оплаты долга, тем вероятней его неуплата." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_matrix = data.drop('ID', axis=1).corr()\n", "sns.heatmap(corr_matrix, cmap=\"Blues\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Видно, что признаки $BILL\\_AMTi, i = 1,...6$ (сумма выписки по счету в $i$-том месяце) коррелируют между собой, как и признаки $PAY\\_i, i = 1,...,6$ (статус платежа в $i$-том месяце)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bills = ['BILL_AMT%s' % i for i in range(1,7)]\n", "plots = data[bills].hist(figsize=(12,10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pay_amts = ['PAY_AMT%s' % i for i in range(1,7)]\n", "plots = data[pay_amts].hist(figsize=(12,10), bins = 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Выбор метрики и модели" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Так как в выборке есть дисбаланс данных, то в качестве целевой метрики будем использовать *roc_auc_score*, так же, для контроля, будем использовать *f1_score* и *accuracy*. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, roc_auc_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для построения модели будем использовать три разных классификатора: логистическую регрессию, случайный лес и градиентный бустинг." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def learning_model(model, X_train, y_train, X_test, y_test):\n", " model.fit(X_train, y_train)\n", " y_pred = model.predict(X_test)\n", " y_roc = model.predict_proba(X_test)[:,1]\n", " score_roc = roc_auc_score(y_test, y_roc)\n", " score_f1 = f1_score(y_test, y_pred, average='weighted')\n", " score_acc = accuracy_score(y_test, y_pred)\n", " print('roc_auc = {}'.format(score_roc))\n", " print('f1_score = {}'.format(score_f1))\n", " print('acuracy_score={}'.format(score_acc))\n", " return y_roc, score_roc, score_f1, score_acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Так как выборка не сбалансирована, то будем использовать дополнительную настройку для случайного леса и логистической регрессии - *class_weight='balanced'*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forest = RandomForestClassifier(n_estimators=50, random_state=17, \n", " class_weight='balanced')\n", "logit = LogisticRegression(random_state=17, class_weight= 'balanced')\n", "boosting = GradientBoostingClassifier(random_state=17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Предобработка данных и создание новых признаков" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удалим столбец $ID$, для дальнейшего исседования он не нужен. \n", "Признак $SEX$ перекодируем в значения 0 ($SEX$=2) и 1 ($SEX$=1) и переименуем этот признак в $male$ (0 – женщина, 1 – мужчина)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data.drop('ID', axis=1)\n", "data['SEX'] = data['SEX'].apply(lambda x: 0 if x==2 else 1 )\n", "data = data.rename(columns = {'SEX':'male'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно заметить, что переменные $BILL\\_AMTi, i = 1,...6$ сильно коррелируют между собой, создадим переменную $balance$, в которую запишем разницу между $BILL\\_AMTi$ сумма выписки по счету в $i$-том месяце) и $PAY\\_AMT\\_i$ (выплаченная сумма в $i$-том месяце)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "balances = ['balance_%s' % i for i in range(1,7)]\n", "\n", "for i in range(6):\n", " data[balances[i]] = data[bills[i]] - data[pay_amts[i]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Скорее всего хорошие клиенты никогда (или редко) задерживают платеж, тогда у клиентов с низким средним значением $PAY\\_i, i = 1,...,6$ вероятнее не будет неуплаты платежа в следующем месяце. Создадим бинарный признак $good\\_client$, где 1 - среднее значение < 0, 0 - среднее значение > 0. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['pay_mean'] = (data['PAY_1'] + data['PAY_2'] + data['PAY_3'] + data['PAY_4'] + data['PAY_5']\\\n", " + data['PAY_6'] ) / 6\n", "data['good_client'] = data['pay_mean'].apply(lambda x: 1 if x <= 0 else 0).astype('int')\n", "data = data.drop('pay_mean', axis=1)\n", "sns.barplot(x='good_client', y='target', data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Судя по графику разница между \"хорошими\" клиентами и \"плохими\" есть, ее влияние на предсказание оценим ниже." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Из категориальных переменных $EDUCATION$, $MARRIAGE$ и $PAY\\_i$ (статус погашения в $i$-том месяце) сделаем дамми-переменные" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col_for_dum = ['EDUCATION', 'MARRIAGE'] + pays\n", "data = pd.get_dummies(data, columns=col_for_dum )\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Разобьем выборку на тестовую и обучающую." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], \n", " test_size=0.3, \n", " random_state=17,\n", " stratify = data['target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Отдельно маштабируем признаки $PAY\\_AMT\\_i$ (выплаченная сумма в $i$-том месяце), $AGE$, $LIMIT\\_BAL$ и новые колонки $balance\\_i$ в тестовой и отложенной выборке." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)\n", "X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_train.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Построим предсказания на заданных классификаторах без настройки гиперпараметров." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('RandomForest:') \n", "learning_model(forest, X_train_scaled, y_train, X_test_scaled, y_test);\n", "print('LogisticRegression:')\n", "learning_model(logit, X_train_scaled, y_train, X_test_scaled, y_test);\n", "print('GradientBoostingClassifier:')\n", "learning_model(boosting, X_train_scaled, y_train, X_test_scaled, y_test);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Оценим важность признаков с помощью дерева решений." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree_sel = DecisionTreeClassifier(random_state=17)\n", "tree_sel.fit(X_train_scaled, y_train)\n", "features = pd.DataFrame(tree_sel.feature_importances_,\n", " index=X_train.columns, \n", " columns=['Importance'])\n", "\n", "features.sort_values(by = 'Importance', ascending=False).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Наибольшую важность имеют признаки $good\\_client$, $AGE$, $LIMIT\\_BAL$, созданные признаки $balance\\_i$ так же входят в 20 самых важных признаков." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f_sort = features.sort_values(by = 'Importance', ascending=False)\n", "plt.plot(range(len(f_sort.Importance.tolist())), \n", " f_sort.Importance.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Отберем только 45 самых важных признаков." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col = f_sort.iloc[:45].T.columns\n", "X_train_new = X_train_scaled[col]\n", "X_test_new = X_test_scaled[col]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Кросс-валидация и настройка гиперпараметров модели" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_parameters = {'C': np.logspace(0, 1, 10)}\n", "\n", "boosting_parameters = {'n_estimators': [100, 150],\n", " 'learning_rate': [0.01, 0.05, 0.5],\n", " 'max_depth': [5, 6, 7],\n", " 'max_features': [0.3, 0.5, 0.75, 1.0], \n", " 'min_samples_leaf': [4, 5, 6]}\n", "\n", "forest_parameters = {'n_estimators': [50, 75, 100],\n", " 'max_depth': [3, 6, 9, 12],\n", " 'max_features': [0.5, 0.75, 1.0]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_logit = GridSearchCV(logit, logit_parameters, n_jobs=-1, scoring ='roc_auc', cv=sss)\n", "\n", "grid_boosting = GridSearchCV(boosting, boosting_parameters, n_jobs=-1, scoring ='roc_auc', cv=sss)\n", "\n", "grid_forest = GridSearchCV(forest, forest_parameters, n_jobs=-1, scoring ='roc_auc', cv=sss)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_score = [] \n", "roc_scores, f1_scores, acc_scores = [], [], []\n", "for grid in (grid_logit, grid_boosting, grid_forest):\n", " y, roc, f1, acc = learning_model(grid, X_train_new, y_train, \n", " X_test_new , y_test);\n", " y_score.append(y)\n", " roc_scores.append(roc)\n", " f1_scores.append(f1)\n", " acc_scores.append(acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Лучший результат показал градиентный бустинг с $roc\\_auc = 0.786$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max_feat = grid_boosting.best_params_['max_features']\n", "min_leaf = grid_boosting.best_params_['min_samples_leaf']\n", "n_est = grid_boosting.best_params_['n_estimators']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Построение кривых валидации и обучения" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve, validation_curve\n", "\n", "def plot_learning_curve(estimator, title, X, y,label_y, ylim=None, cv=None, scoring='roc_auc', \n", " n_jobs=1, train_sizes=np.linspace(.1, 1.0, 10)):\n", " plt.figure()\n", " plt.title(title)\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " plt.xlabel(\"Training examples\")\n", " plt.ylabel(label_y)\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring=scoring)\n", " train_scores_mean = np.mean(train_scores, axis=1)\n", " train_scores_std = np.std(train_scores, axis=1)\n", " test_scores_mean = np.mean(test_scores, axis=1)\n", " test_scores_std = np.std(test_scores, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n", " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " return plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(GradientBoostingClassifier(random_state=17, learning_rate=0.05, max_depth=6,\n", " n_estimators=n_est,\n", " min_samples_leaf=min_leaf, max_features=max_feat), \n", " 'Gradient Boosting',\n", " X_train_new, y_train, scoring='roc_auc', cv=sss, n_jobs=-1, label_y='roc_auc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(GradientBoostingClassifier(random_state=17, learning_rate=0.05, max_depth=6,\n", " n_estimators=n_est,\n", " min_samples_leaf=min_leaf, max_features=max_feat), \n", " 'Gradient Boosting',\n", " X_train_new, y_train, scoring='f1', cv=sss, n_jobs=-1, label_y='f1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(GradientBoostingClassifier(random_state=17, learning_rate=0.05, max_depth=6,\n", " n_estimators=n_est,\n", " min_samples_leaf=min_leaf, max_features=max_feat), \n", " 'Gradient Boosting',\n", " X_train_new, y_train, scoring='accuracy', cv=sss, n_jobs=-1, label_y='accuracy')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_validation_curve(estimator, X, y, cv_param_name, cv_param_values,ylim=None, cv=None,\n", " scoring='roc_auc',\n", " n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 10)):\n", " plt.figure()\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " plt.xlabel(cv_param_name) \n", " plt.ylabel(\"Score\")\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring=scoring)\n", " val_train, val_test = validation_curve(estimator, X, y, cv_param_name,\n", " cv_param_values, cv=cv,\n", " scoring=scoring)\n", " val_train_mean = np.mean(val_train, axis=1)\n", " val_train_std = np.std(val_train, axis=1)\n", " val_test_mean = np.mean(val_test, axis=1)\n", " val_test_std = np.std(val_test, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(cv_param_values, val_train_mean - val_train_std,\n", " val_train_mean + val_train_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(cv_param_values, val_test_mean - val_test_std,\n", " val_test_mean + val_test_std, alpha=0.1, color=\"g\")\n", " plt.plot(cv_param_values, val_train_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(cv_param_values, val_test_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " return plt\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimators = np.arange(25, 425, 25)\n", "plot_validation_curve(GradientBoostingClassifier(random_state=17, learning_rate=0.05, max_depth=6,\n", " min_samples_leaf =min_leaf, max_features=max_feat), \n", " X_train_new, y_train, \n", " cv_param_name='n_estimators', \n", " cv_param_values= estimators,\n", " scoring='roc_auc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "depth = np.arange(3, 25)\n", "plot_validation_curve(GradientBoostingClassifier(random_state=17, learning_rate=0.05, \n", " n_estimators=n_est,\n", " min_samples_leaf =min_leaf, max_features=max_feat), \n", " X_train_new, y_train, \n", " cv_param_name='max_depth', \n", " cv_param_values= depth,\n", " scoring='roc_auc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn_rate = np.linspace(0.01, 1.0, 10)\n", "plot_validation_curve(GradientBoostingClassifier(random_state=17, max_depth=6,\n", " n_estimators=n_est,\n", " min_samples_leaf =min_leaf, max_features=max_feat), \n", " X_train_new, y_train, \n", " cv_param_name='learning_rate', \n", " cv_param_values= learn_rate,\n", " scoring='roc_auc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min_l = np.arange(3, 10)\n", "plot_validation_curve(GradientBoostingClassifier(random_state=17, max_depth=6,\n", " n_estimators=n_est,max_features=max_feat), \n", " X_train_new, y_train, \n", " cv_param_name='min_samples_leaf', \n", " cv_param_values= min_l,\n", " scoring='roc_auc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8. Прогноз для тестовой или отложенной выборке" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = ['LogisticRegression', 'GradientBoosting', 'RandomForest']\n", "from sklearn.metrics import roc_curve, auc\n", "for i in range(3):\n", " # Compute ROC curve and ROC area\n", " fpr, tpr, _ = roc_curve(y_test, y_score[i])\n", " roc_auc = auc(fpr, tpr)\n", " # Plot of a ROC curve for a specific class\n", " plt.figure(1)\n", " plt.plot(fpr, tpr, label=labels[i]+', ROC curve (area = %0.2f)' % roc_auc)\n", " plt.plot([0, 1], [0, 1], 'k--')\n", " plt.xlim([0.0, 1.0])\n", " plt.ylim([0.0, 1.05])\n", " plt.xlabel('False Positive Rate')\n", " plt.ylabel('True Positive Rate')\n", " plt.legend(loc=\"lower right\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", "\n", " #print(cm)\n", "\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, cm[i, j],\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "font = {'size' : 15}\n", "plt.rc('font', **font)\n", "cnf_matrix = confusion_matrix(y_test, grid_logit.predict(X_test_new))\n", "plt.figure(figsize=(8, 6))\n", "plot_confusion_matrix(cnf_matrix, classes=['No default', 'default'], title=labels[0]+' Confusion matrix')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "font = {'size' : 15}\n", "plt.rc('font', **font)\n", "cnf_matrix = confusion_matrix(y_test, grid_boosting.predict(X_test_new))\n", "plt.figure(figsize=(8, 6))\n", "plot_confusion_matrix(cnf_matrix, classes=['No default', 'default'], title=labels[1]+' Confusion matrix')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "font = {'size' : 15}\n", "plt.rc('font', **font)\n", "cnf_matrix = confusion_matrix(y_test, grid_forest.predict(X_test_new))\n", "plt.figure(figsize=(8, 6))\n", "plot_confusion_matrix(cnf_matrix, classes=['No default', 'default'], title=labels[2]+' Confusion matrix')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = pd.DataFrame({'roc_auc_score': roc_scores, 'f1_score': f1_scores, 'accuracy': acc_scores}, \n", " index=labels)\n", "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Выводы" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Построены модели предсказания оплаты долга клиентом в следующем месяце по информации о предыдущих шести месяцев. Модель градиентного бустинга дает примерно 79% точности на отложенных 30% выборки, но по матрицам ошибок видно, что градиентный бустинг сильно ошибается, предсказывая класс \"нет уплаты\" (*\"No default\"*), когда на самом деле неуплата (*\"default\"*) происходит. \n", "\n", "Были построены кривые обучения и валидационные кривые. Кривые обучения показывают, что при увеличении размера выборки, кривые будут сходиться. В данном случае, объем выборки недостаточен.\n", "\n", "По валидационным кривым видно, что гиперпараметры, подобранные с помощью GridSearchCV являются оптимальными." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }