{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking Ensemble\n", "> A tutorial of stacking ensemble (a.k.a. stacked generalization)\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- categories: [notebook, kaggle]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/stacking-ensemble) at Kaggle.\n", "\n", "---\n", "\n", "This notebook shows how to perform stacking ensemble (a.k.a. stacked generalization).\n", "\n", "In [Ensemble-learning meta-classifier for stacking](https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking), @remekkinas shares how to do stacking ensemble using `MLExtend'`s `StackingCVClassifier`.\n", "\n", "To demonstrate how stacking works, this notebook shows how to prepare the baseline model predictions using cross-validation (CV), then use them for level-2 stacking. It trains four classifiers, Random Forests, Extremely Randomized Trees, LightGBM, and CatBoost as level-1 base models. It also uses CV predictions of two models, LightGBM with DAE features and supervised DAE trained from my previous notebook, [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder) to show why keeping CV predictions for **every** model is important. :)\n", "\n", "The contents of this notebook are as follows:\n", "1. **Feature Engineering**: Same as in the [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder) and [AutoEncoder + Pseudo Label + AutoLGB](https://www.kaggle.com/jeongyoonlee/autoencoder-pseudo-label-autolgb).\n", "2. **Level-1 Base Model Training**: Training four base models, Random Forests, Extremely Randomized Trees, LightGBM, and CatBoost using the same 5-fold CV.\n", "3. **Level-2 Stacking**: Training the LightGBM model with CV predictions of base models, original features, and DAE features. Performing feature selection and hyperparameter optimization using `Kaggler`'s `AutoLGB`.\n", "\n", "This notebook is inspired and/or based on other Kagglers' notebooks as follows:\n", "* [TPS-APR21-EDA+MODEL](https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model) by @udbhavpangotra\n", "* [Ensemble-learning meta-classifier for stacking](https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking) by @remekkinas\n", "* [TPS Apr 2021 pseudo labeling/voting ensemble](https://www.kaggle.com/hiro5299834/tps-apr-2021-pseudo-labeling-voting-ensemble?scriptVersionId=60616606) by @hiro5299834\n", "\n", "Thanks!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Data Loading & Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true }, "outputs": [], "source": [ "from catboost import CatBoostClassifier\n", "from joblib import dump\n", "import lightgbm as lgb\n", "from lightgbm import LGBMClassifier\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from pathlib import Path\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.metrics import roc_auc_score, confusion_matrix\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.preprocessing import StandardScaler\n", "import warnings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true, "_kg_hide-output": true }, "outputs": [], "source": [ "!pip install kaggler" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true }, "outputs": [], "source": [ "import kaggler\n", "from kaggler.model import AutoLGB\n", "from kaggler.preprocessing import LabelEncoder\n", "\n", "print(f'Kaggler: {kaggler.__version__}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true }, "outputs": [], "source": [ "warnings.simplefilter('ignore')\n", "pd.set_option('max_columns', 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')\n", "trn_file = data_dir / 'train.csv'\n", "tst_file = data_dir / 'test.csv'\n", "sample_file = data_dir / 'sample_submission.csv'\n", "pseudo_label_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/tps04-sub-006.csv'\n", "dae_feature_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/dae.csv'\n", "lgb_dae_predict_val_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/lgb_dae.val.txt'\n", "lgb_dae_predict_tst_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/lgb_dae.tst.txt'\n", "sdae_dae_predict_val_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/sdae_dae.val.txt'\n", "sdae_dae_predict_tst_file = '/kaggle/input/tps-apr-2021-pseudo-label-dae/sdae_dae.tst.txt'\n", "\n", "target_col = 'Survived'\n", "id_col = 'PassengerId'\n", "\n", "feature_name = 'dae'\n", "algo_name = 'esb'\n", "model_name = f'{algo_name}_{feature_name}'\n", "\n", "feature_file = f'{feature_name}.csv'\n", "predict_val_file = f'{model_name}.val.txt'\n", "predict_tst_file = f'{model_name}.tst.txt'\n", "submission_file = f'{model_name}.sub.csv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_fold = 5\n", "seed = 42\n", "n_est = 1000\n", "encoding_dim = 128" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trn = pd.read_csv(trn_file, index_col=id_col)\n", "tst = pd.read_csv(tst_file, index_col=id_col)\n", "sub = pd.read_csv(sample_file, index_col=id_col)\n", "pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)\n", "dae_features = np.loadtxt(dae_feature_file, delimiter=',')\n", "lgb_dae_predict_val = np.loadtxt(lgb_dae_predict_val_file)\n", "lgb_dae_predict_tst = np.loadtxt(lgb_dae_predict_tst_file)\n", "sdae_dae_predict_val = np.loadtxt(sdae_dae_predict_val_file)\n", "sdae_dae_predict_tst = np.loadtxt(sdae_dae_predict_tst_file)\n", "\n", "print(trn.shape, tst.shape, sub.shape, pseudo_label.shape, dae_features.shape)\n", "print(lgb_dae_predict_val.shape, lgb_dae_predict_tst.shape)\n", "print(sdae_dae_predict_val.shape, sdae_dae_predict_tst.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tst[target_col] = pseudo_label[target_col]\n", "n_trn = trn.shape[0]\n", "df = pd.concat([trn, tst], axis=0)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading 128 DAE features generated from [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dae = pd.DataFrame(dae_features, columns=[f'enc_{x}' for x in range(encoding_dim)])\n", "print(df_dae.shape)\n", "df_dae.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature engineering using @udbhavpangotra's [code](https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n", "\n", "df['Embarked'] = df['Embarked'].fillna('No')\n", "df['Cabin'] = df['Cabin'].fillna('_')\n", "df['CabinType'] = df['Cabin'].apply(lambda x:x[0])\n", "df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')\n", "\n", "df['Age'].fillna(round(df['Age'].median()), inplace=True,)\n", "df['Age'] = df['Age'].apply(round).astype(int)\n", "\n", "# Fare, fillna with mean value\n", "fare_map = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()\n", "df['Fare'] = df['Fare'].fillna(df['Pclass'].map(fare_map['Fare']))\n", "\n", "df['FirstName'] = df['Name'].str.split(', ').str[0]\n", "df['SecondName'] = df['Name'].str.split(', ').str[1]\n", "\n", "df['n'] = 1\n", "\n", "gb = df.groupby('FirstName')\n", "df_names = gb['n'].sum()\n", "df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x]).fillna(1)\n", "\n", "gb = df.groupby('SecondName')\n", "df_names = gb['n'].sum()\n", "df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x]).fillna(1)\n", "\n", "df['Sex'] = (df['Sex'] == 'male').astype(int)\n", "\n", "df['FamilySize'] = df.SibSp + df.Parch + 1\n", "\n", "feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',\n", " 'FamilySize', 'FirstName', 'SecondName']\n", "cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']\n", "num_cols = [x for x in feature_cols if x not in cat_cols]\n", "print(len(feature_cols), len(cat_cols), len(num_cols))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying `log2(1 + x)` for numerical features and label-encoding categorical features using `kaggler.preprocessing.LabelEncoder`, which handles `NaN`s and groups rare categories together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n", " df[col] = np.log2(1 + df[col])\n", " \n", "scaler = StandardScaler()\n", "df[num_cols] = scaler.fit_transform(df[num_cols])\n", "\n", "lbe = LabelEncoder(min_obs=50)\n", "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Level-1 Base Model Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Model params from https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking by remekkinas\n", "\n", "lgb_params = {\n", " 'metric': 'binary_logloss',\n", " 'n_estimators': n_est,\n", " 'objective': 'binary',\n", " 'random_state': seed,\n", " 'learning_rate': 0.01,\n", " 'min_child_samples': 20,\n", " 'reg_alpha': 3e-5,\n", " 'reg_lambda': 9e-2,\n", " 'num_leaves': 63,\n", " 'colsample_bytree': 0.8,\n", " 'subsample': 0.8,\n", "}\n", "\n", "ctb_params = {\n", " 'bootstrap_type': 'Poisson',\n", " 'loss_function': 'Logloss',\n", " 'eval_metric': 'Logloss',\n", " 'random_seed': seed,\n", " 'task_type': 'GPU',\n", " 'max_depth': 8,\n", " 'learning_rate': 0.01,\n", " 'n_estimators': n_est,\n", " 'max_bin': 280,\n", " 'min_data_in_leaf': 64,\n", " 'l2_leaf_reg': 0.01,\n", " 'subsample': 0.8\n", "}\n", "\n", "rf_params = {\n", " 'max_depth': 15,\n", " 'min_samples_leaf': 8,\n", " 'random_state': seed\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_models = {'rf': RandomForestClassifier(**rf_params), \n", " 'cbt': CatBoostClassifier(**ctb_params, verbose=None, logging_level='Silent'),\n", " 'lgb': LGBMClassifier(**lgb_params),\n", " 'et': ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.55, min_samples_leaf=8, min_samples_split=4, n_estimators=100)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure that you use the same CV folds across all level-1 models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from copy import copy\n", "\n", "X = pd.concat((df[feature_cols], df_dae), axis=1)\n", "y = df[target_col]\n", "X_tst = X.iloc[n_trn:]\n", "\n", "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n", "\n", "p_dict = {}\n", "for name in base_models:\n", " print(f'Training {name}:')\n", " p = np.zeros_like(y, dtype=float)\n", " p_tst = np.zeros((tst.shape[0],))\n", " for i, (i_trn, i_val) in enumerate(cv.split(X, y)):\n", " clf = copy(base_models[name])\n", " clf.fit(X.iloc[i_trn], y[i_trn])\n", " \n", " p[i_val] = clf.predict_proba(X.iloc[i_val])[:, 1]\n", " print(f'\\tCV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')\n", "\n", " p_dict[name] = p\n", " print(f'\\tCV AUC: {roc_auc_score(y, p):.6f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding CV predictions of two additional models trained separately. You can use all models trained throughout the competition as long as those are traine d with the same CV folds.\n", "\n", "**ALWAYS SAVE CV PREDICTIONS!!!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_dict.update({\n", " 'lgb_dae': lgb_dae_predict_val,\n", " 'sdae_dae': sdae_dae_predict_val\n", "})\n", "\n", "dump(p_dict, 'predict_val_dict.joblib')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3: Level-2 Stacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training a level-2 LightGBM model with the level-1 model CV predictions, original features, and DAE features as inputs. If you have enough level-1 model predictions, you can train level-2 models only with level-1 model predictions. Here, since we only have six level-1 models, we use additional features and perform feature selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = pd.concat([pd.DataFrame(p_dict), df[feature_cols], df_dae], axis=1)\n", "X_tst = X.iloc[n_trn:]\n", "\n", "p = np.zeros_like(y, dtype=float)\n", "p_tst = np.zeros((tst.shape[0],))\n", "print(f'Training a stacking ensemble LightGBM model:')\n", "for i, (i_trn, i_val) in enumerate(cv.split(X, y)):\n", " if i == 0:\n", " clf = AutoLGB(objective='binary', metric='auc', random_state=seed)\n", " clf.tune(X.iloc[i_trn], y[i_trn])\n", " features = clf.features\n", " params = clf.params\n", " n_best = clf.n_best\n", " print(f'{n_best}')\n", " print(f'{params}')\n", " print(f'{features}')\n", " \n", " trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])\n", " val_data = lgb.Dataset(X.iloc[i_val], y[i_val])\n", " clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)\n", " p[i_val] = clf.predict(X.iloc[i_val])\n", " p_tst += clf.predict(X_tst) / n_fold\n", " print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f' CV AUC: {roc_auc_score(y, p):.6f}')\n", "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_pos = int(0.34911 * tst.shape[0])\n", "th = sorted(p_tst, reverse=True)[n_pos]\n", "print(th)\n", "confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub[target_col] = (p_tst > th).astype(int)\n", "sub.to_csv(submission_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n", "\n", "Also please check out my previous notebooks as follows:\n", "* [AutoEncoder + Pseudo Label + AutoLGB](https://www.kaggle.com/jeongyoonlee/autoencoder-pseudo-label-autolgb): shows how to build a basic AutoEncoder using Keras, and perform automated feature selection and hyperparameter optimization using `Kaggler`'s `AutoLGB`.\n", "* [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder): shows how to build a more sophiscated version of AutoEncoder, called supervised emphasized Denoising AutoEncoder (DAE), which trains DAE and a classifier simultaneously.\n", "\n", "Happy Kaggling! ;)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }