{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AutoEncoder + Pseudo Label + AutoLGB\n", "> A tutorial of applying AutoEncoder and Kaggler's AutoLGB\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- categories: [notebook, kaggle]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/autoencoder-pseudo-label-autolgb) at Kaggle.\n", "\n", "---\n", "\n", "In this notebook, I will show how to use autoencoder, feature selection, hyperparameter optimization, and pseudo labeling using the `Keras` and `Kaggler` Python packages.\n", "\n", "The contents of the notebook are as follows:\n", "1. **Package installation**: Installing latest version of `Kaggler` using `Pip`\n", "2. **Regular feature engineering**: [code](https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model) by @udbhavpangotra\n", "3. **Feature transformation**: Using `kaggler.preprocessing.LabelEncoder` to impute missing values and group rare categories automatically.\n", "4. **Stacked AutoEncoder**: Notebooks for DAE will be shared later.\n", "5. **Model training**: with 5-fold CV and pseudo label from @hiro5299834's [data](https://www.kaggle.com/hiro5299834/tps-apr-2021-voting-pseudo-labeling).\n", "6. **Feature selection and hyperparameter optimization**: Using `kaggler.model.AutoLGB`\n", "7. **Saving a submission file**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load libraries and install `Kaggler`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" }, "outputs": [], "source": [ "# This Python 3 environment comes with many helpful analytics libraries installed\n", "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n", "# For example, here's several helpful packages to load\n", "\n", "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "\n", "# Input data files are available in the read-only \"../input/\" directory\n", "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", "\n", "import os\n", "for dirname, _, filenames in os.walk('/kaggle/input'):\n", " for filename in filenames:\n", " print(os.path.join(dirname, filename))\n", "\n", "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n", "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import lightgbm as lgb\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from pathlib import Path\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "import seaborn as sns\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.metrics import roc_auc_score, confusion_matrix\n", "import warnings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install kaggler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import kaggler\n", "from kaggler.model import AutoLGB\n", "from kaggler.preprocessing import LabelEncoder\n", "\n", "print(f'Kaggler: {kaggler.__version__}')\n", "print(f'TensorFlow: {tf.__version__}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "warnings.simplefilter('ignore')\n", "plt.style.use('fivethirtyeight')\n", "pd.set_option('max_columns', 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering (ref: [code](https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model) by @udbhavpangotra)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_name = 'ae'\n", "algo_name = 'lgb'\n", "model_name = f'{algo_name}_{feature_name}'\n", "\n", "data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')\n", "trn_file = data_dir / 'train.csv'\n", "tst_file = data_dir / 'test.csv'\n", "sample_file = data_dir / 'sample_submission.csv'\n", "pseudo_label_file = '/kaggle/input/tps-apr-2021-label/voting_submission_from_5_best.csv'\n", "\n", "feature_file = f'{feature_name}.csv'\n", "predict_val_file = f'{model_name}.val.txt'\n", "predict_tst_file = f'{model_name}.tst.txt'\n", "submission_file = f'{model_name}.sub.csv'\n", "\n", "target_col = 'Survived'\n", "id_col = 'PassengerId'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trn = pd.read_csv(trn_file, index_col=id_col)\n", "tst = pd.read_csv(tst_file, index_col=id_col)\n", "sub = pd.read_csv(sample_file, index_col=id_col)\n", "pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)\n", "print(trn.shape, tst.shape, sub.shape, pseudo_label.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tst[target_col] = pseudo_label[target_col]\n", "n_trn = trn.shape[0]\n", "df = pd.concat([trn, tst], axis=0)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.nunique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n", "\n", "df['Embarked'] = df['Embarked'].fillna('No')\n", "df['Cabin'] = df['Cabin'].fillna('_')\n", "df['CabinType'] = df['Cabin'].apply(lambda x:x[0])\n", "df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')\n", "\n", "df['Age'].fillna(round(df['Age'].median()), inplace=True,)\n", "df['Age'] = df['Age'].apply(round).astype(int)\n", "\n", "df['Fare'].fillna(round(df['Fare'].median()), inplace=True,)\n", "\n", "df['FirstName'] = df['Name'].str.split(', ').str[0]\n", "df['SecondName'] = df['Name'].str.split(', ').str[1]\n", "\n", "df['n'] = 1\n", "\n", "gb = df.groupby('FirstName')\n", "df_names = gb['n'].sum()\n", "df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x])\n", "\n", "gb = df.groupby('SecondName')\n", "df_names = gb['n'].sum()\n", "df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x])\n", "\n", "df['Sex'] = (df['Sex'] == 'male').astype(int)\n", "\n", "df['FamilySize'] = df.SibSp + df.Parch + 1\n", "\n", "feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',\n", " 'FamilySize', 'FirstName', 'SecondName']\n", "cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']\n", "num_cols = [x for x in feature_cols if x not in cat_cols]\n", "print(len(feature_cols), len(cat_cols), len(num_cols))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[num_cols].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16, 16))\n", "for i, col in enumerate(num_cols):\n", " ax = plt.subplot(4, 2, i + 1)\n", " ax.set_title(col)\n", " df[col].hist(bins=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply `log2(1 + x)` transformation followed by standardization for count variables to make them close to the normal distribution. `log2(1 + x)` has better resolution than `log1p` and it preserves the values of 0 and 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n", " df[col] = np.log2(1 + df[col])\n", " \n", "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "df[num_cols] = scaler.fit_transform(df[num_cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Label-encode categorical variables using `kaggler.preprocessing.LabelEncoder`, which creates new categories for `NaN`s as well as rare categories (using the threshold of `min_obs`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lbe = LabelEncoder(min_obs=50)\n", "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoEncoder using `Keras`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basic stacked autoencoder. I will add the versions with DAE and emphasized DAE later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoding_dim = 64\n", "\n", "def get_model(encoding_dim, dropout=.2):\n", " num_dim = len(num_cols)\n", " num_input = keras.layers.Input((num_dim,), name='num_input')\n", " cat_inputs = []\n", " cat_embs = []\n", " emb_dims = 0\n", " for col in cat_cols:\n", " cat_input = keras.layers.Input((1,), name=f'{col}_input')\n", " emb_dim = max(8, int(np.log2(1 + df[col].nunique()) * 4))\n", " cat_emb = keras.layers.Embedding(input_dim=df[col].max() + 1, output_dim=emb_dim)(cat_input)\n", " cat_emb = keras.layers.Dropout(dropout)(cat_emb)\n", " cat_emb = keras.layers.Reshape((emb_dim,))(cat_emb)\n", "\n", " cat_inputs.append(cat_input)\n", " cat_embs.append(cat_emb)\n", " emb_dims += emb_dim\n", "\n", " merged_inputs = keras.layers.Concatenate()([num_input] + cat_embs)\n", "\n", " encoded = keras.layers.Dense(encoding_dim * 3, activation='relu')(merged_inputs)\n", " encoded = keras.layers.Dropout(dropout)(encoded)\n", " encoded = keras.layers.Dense(encoding_dim * 2, activation='relu')(encoded)\n", " encoded = keras.layers.Dropout(dropout)(encoded) \n", " encoded = keras.layers.Dense(encoding_dim, activation='relu')(encoded)\n", " \n", " decoded = keras.layers.Dense(encoding_dim * 2, activation='relu')(encoded)\n", " decoded = keras.layers.Dropout(dropout)(decoded)\n", " decoded = keras.layers.Dense(encoding_dim * 3, activation='relu')(decoded)\n", " decoded = keras.layers.Dropout(dropout)(decoded) \n", " decoded = keras.layers.Dense(num_dim + emb_dims, activation='linear')(decoded)\n", "\n", " encoder = keras.Model([num_input] + cat_inputs, encoded)\n", " ae = keras.Model([num_input] + cat_inputs, decoded)\n", " ae.add_loss(keras.losses.mean_squared_error(merged_inputs, decoded))\n", " ae.compile(optimizer='adam')\n", " return ae, encoder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ae, encoder = get_model(encoding_dim)\n", "ae.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = [df[num_cols].values] + [df[x].values for x in cat_cols]\n", "ae.fit(inputs, inputs,\n", " epochs=100,\n", " batch_size=16384,\n", " shuffle=True,\n", " validation_split=.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoding = encoder.predict(inputs)\n", "print(encoding.shape)\n", "np.savetxt(feature_file, encoding, fmt='%.6f', delimiter=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training + Feature Selection + Hyperparameter Optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the `LightGBM` model with pseudo label and 5-fold CV. In the first fold, perform feature selection and hyperparameter optimization using `kaggler.model.AutoLGB`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seed = 42\n", "n_fold = 5\n", "X = pd.concat((df[feature_cols], \n", " pd.DataFrame(encoding, columns=[f'enc_{x}' for x in range(encoding_dim)])), axis=1)\n", "y = df[target_col]\n", "X_tst = X.iloc[n_trn:]\n", "\n", "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n", "p = np.zeros_like(y, dtype=float)\n", "p_tst = np.zeros((tst.shape[0],))\n", "for i, (i_trn, i_val) in enumerate(cv.split(X, y)):\n", " if i == 0:\n", " clf = AutoLGB(objective='binary', metric='auc', random_state=seed)\n", " clf.tune(X.iloc[i_trn], y[i_trn])\n", " features = clf.features\n", " params = clf.params\n", " n_best = clf.n_best\n", " print(f'{n_best}')\n", " print(f'{params}')\n", " print(f'{features}')\n", " \n", " trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])\n", " val_data = lgb.Dataset(X.iloc[i_val], y[i_val])\n", " clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)\n", " p[i_val] = clf.predict(X.iloc[i_val])\n", " p_tst += clf.predict(X_tst) / n_fold\n", " print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')\n", "\n", "np.savetxt(predict_val_file, p, fmt='%.6f')\n", "np.savetxt(predict_tst_file, p_tst, fmt='%.6f')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f' CV AUC: {roc_auc_score(y, p):.6f}')\n", "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submission File" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_pos = int(0.34911 * tst.shape[0])\n", "th = sorted(p_tst, reverse=True)[n_pos]\n", "print(th)\n", "confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub[target_col] = (p_tst > th).astype(int)\n", "sub.to_csv(submission_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you find it helpful, please upvote the notebook and give a star to [Kaggler](https://github.com/jeongyoonlee/Kaggler). If you have questions and/or feature requests for Kaggler, please post them as `Issue` in the `Kaggler` GitHub repository.\n", "\n", "Happy Kaggling!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }