{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "1" } }, "source": [ "# DAE with 2 Lines of Code with Kaggler\n", "> A tutorial on Kaggler's new DAE feature transformation\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- categories: [notebook, kaggle]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "2" } }, "source": [ "# **UPDATE on 5/1/2021**\n", "\n", "Today, [`Kaggler`](https://github.com/jeongyoonlee/Kaggler) v0.9.4 is released with additional features for DAE as follows:\n", "* In addition to the swap noise (`swap_prob`), the Gaussian noise (`noise_std`) and zero masking (`mask_prob`) have been added to DAE to overcome overfitting.\n", "* Stacked DAE is available through the `n_layer` input argument (see Figure 3. in [Vincent et al. (2010), \"Stacked Denoising Autoencoders\"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf) for reference).\n", "\n", "For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:\n", "```python\n", "from kaggler.preprocessing import DAE\n", "\n", "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)\n", "X = dae.fit_transform(pd.concat([trn, tst], axis=0))\n", "```\n", "\n", "If you're using previous versions, please upgrade `Kaggler` using `pip install -U kaggler`.\n", "\n", "---\n", "\n", "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n", "\n", "Now you can train a DAE with only 2 lines of code as follows:\n", "\n", "```python\n", "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n", "X = dae.fit_transform(df[feature_cols])\n", "```\n", "\n", "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n", "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n", "* `FrequencyEncoder`\n", "* `LabelEncoder`: that imputes missing values and groups rare categories\n", "* `OneHotEncoder`: that imputes missing values and groups rare categories\n", "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n", "* `QuantileEncoder`: that transforms numerical features into quantiles\n", "\n", "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "29" } }, "source": [ "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler) at Kaggle.\n", "\n", "---\n", "\n", "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n", "\n", "Now you can train a DAE with only 2 lines of code as follows:\n", "\n", "```python\n", "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n", "X = dae.fit_transform(df[feature_cols])\n", "```\n", "\n", "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n", "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n", "* `FrequencyEncoder`\n", "* `LabelEncoder`: that imputes missing values and groups rare categories\n", "* `OneHotEncoder`: that imputes missing values and groups rare categories\n", "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n", "* `QuantileEncoder`: that transforms numerical features into quantiles\n", "\n", "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "3" } }, "source": [ "# Part 1: Data Loading & Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true, "nterop": { "id": "4" } }, "outputs": [], "source": [ "import lightgbm as lgb\n", "import numpy as np\n", "import pandas as pd\n", "from pathlib import Path\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.metrics import roc_auc_score, confusion_matrix\n", "import warnings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-output": true, "nterop": { "id": "5" } }, "outputs": [], "source": [ "!pip install kaggler" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "6" } }, "outputs": [], "source": [ "import kaggler\n", "from kaggler.model import AutoLGB\n", "from kaggler.preprocessing import DAE, TargetEncoder, LabelEncoder\n", "\n", "print(f'Kaggler: {kaggler.__version__}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_kg_hide-input": true, "nterop": { "id": "7" } }, "outputs": [], "source": [ "warnings.simplefilter('ignore')\n", "pd.set_option('max_columns', 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "8" } }, "outputs": [], "source": [ "feature_name = 'dae'\n", "algo_name = 'lgb'\n", "model_name = f'{algo_name}_{feature_name}'\n", "\n", "data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')\n", "trn_file = data_dir / 'train.csv'\n", "tst_file = data_dir / 'test.csv'\n", "sample_file = data_dir / 'sample_submission.csv'\n", "pseudo_label_file = '../input/tps-apr-2021-pseudo-label-dae/tps04-sub-006.csv'\n", "\n", "feature_file = f'{feature_name}.csv'\n", "predict_val_file = f'{model_name}.val.txt'\n", "predict_tst_file = f'{model_name}.tst.txt'\n", "submission_file = f'{model_name}.sub.csv'\n", "\n", "target_col = 'Survived'\n", "id_col = 'PassengerId'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "9" } }, "outputs": [], "source": [ "n_fold = 5\n", "seed = 42\n", "encoding_dim = 64" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "10" } }, "outputs": [], "source": [ "trn = pd.read_csv(trn_file, index_col=id_col)\n", "tst = pd.read_csv(tst_file, index_col=id_col)\n", "sub = pd.read_csv(sample_file, index_col=id_col)\n", "pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)\n", "print(trn.shape, tst.shape, sub.shape, pseudo_label.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "11" } }, "outputs": [], "source": [ "tst[target_col] = pseudo_label[target_col]\n", "n_trn = trn.shape[0]\n", "df = pd.concat([trn, tst], axis=0)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "12" } }, "outputs": [], "source": [ "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n", "\n", "df['Embarked'] = df['Embarked'].fillna('No')\n", "df['Cabin'] = df['Cabin'].fillna('_')\n", "df['CabinType'] = df['Cabin'].apply(lambda x:x[0])\n", "df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')\n", "\n", "df['Age'].fillna(round(df['Age'].median()), inplace=True,)\n", "df['Age'] = df['Age'].apply(round).astype(int)\n", "\n", "# Fare, fillna with mean value\n", "fare_map = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()\n", "df['Fare'] = df['Fare'].fillna(df['Pclass'].map(fare_map['Fare']))\n", "\n", "df['FirstName'] = df['Name'].str.split(', ').str[0]\n", "df['SecondName'] = df['Name'].str.split(', ').str[1]\n", "\n", "df['n'] = 1\n", "\n", "gb = df.groupby('FirstName')\n", "df_names = gb['n'].sum()\n", "df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x]).fillna(1)\n", "\n", "gb = df.groupby('SecondName')\n", "df_names = gb['n'].sum()\n", "df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x]).fillna(1)\n", "\n", "df['Sex'] = (df['Sex'] == 'male').astype(int)\n", "\n", "df['FamilySize'] = df.SibSp + df.Parch + 1\n", "\n", "feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',\n", " 'FamilySize', 'FirstName', 'SecondName']\n", "cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']\n", "num_cols = [x for x in feature_cols if x not in cat_cols]\n", "print(len(feature_cols), len(cat_cols), len(num_cols))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "13" } }, "outputs": [], "source": [ "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n", " df[col] = np.log2(1 + df[col])\n", " \n", "scaler = StandardScaler()\n", "df[num_cols] = scaler.fit_transform(df[num_cols])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "14" } }, "source": [ "## Label encoding with rare category grouping and missing value imputation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "15" } }, "outputs": [], "source": [ "lbe = LabelEncoder(min_obs=50)\n", "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "16" } }, "source": [ "## Target encoding with smoothing and 5-fold cross-validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "17" } }, "outputs": [], "source": [ "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n", "te = TargetEncoder(cv=cv)\n", "df_te = te.fit_transform(df[cat_cols], df[target_col])\n", "df_te.columns = [f'te_{col}' for col in cat_cols]\n", "df_te.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "18" } }, "source": [ "## DAE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "19" } }, "outputs": [], "source": [ "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n", "X = dae.fit_transform(df[feature_cols])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "20" } }, "outputs": [], "source": [ "df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])\n", "print(df_dae.shape)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "21" } }, "source": [ "# Part 2: Model Training" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "22" } }, "source": [ "## AutoLGB for Feature Selection and Hyperparameter Optimization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "23" } }, "outputs": [], "source": [ "X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)\n", "y = df[target_col]\n", "X_tst = X.iloc[n_trn:]\n", "\n", "p = np.zeros_like(y, dtype=float)\n", "p_tst = np.zeros((tst.shape[0],))\n", "print(f'Training a stacking ensemble LightGBM model:')\n", "for i, (i_trn, i_val) in enumerate(cv.split(X, y)):\n", " if i == 0:\n", " clf = AutoLGB(objective='binary', metric='auc', random_state=seed)\n", " clf.tune(X.iloc[i_trn], y[i_trn])\n", " features = clf.features\n", " params = clf.params\n", " n_best = clf.n_best\n", " print(f'{n_best}')\n", " print(f'{params}')\n", " print(f'{features}')\n", " \n", " trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])\n", " val_data = lgb.Dataset(X.iloc[i_val], y[i_val])\n", " clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)\n", " p[i_val] = clf.predict(X.iloc[i_val])\n", " p_tst += clf.predict(X_tst) / n_fold\n", " print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "24" } }, "outputs": [], "source": [ "print(f' CV AUC: {roc_auc_score(y, p):.6f}')\n", "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "25" } }, "source": [ "## Submission" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "26" } }, "outputs": [], "source": [ "n_pos = int(0.34911 * tst.shape[0])\n", "th = sorted(p_tst, reverse=True)[n_pos]\n", "print(th)\n", "confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nterop": { "id": "27" } }, "outputs": [], "source": [ "sub[target_col] = (p_tst > th).astype(int)\n", "sub.to_csv(submission_file)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nterop": { "id": "28" } }, "source": [ "If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n", "\n", "Also please check my previous notebooks as well:\n", "* [AutoEncoder + Pseudo Label + AutoLGB](https://www.kaggle.com/jeongyoonlee/autoencoder-pseudo-label-autolgb): shows how to build a basic AutoEncoder using Keras, and perform automated feature selection and hyperparameter optimization using Kaggler's AutoLGB.\n", "* [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder): shows how to build a more sophiscated version of AutoEncoder, called supervised emphasized Denoising AutoEncoder (DAE), which trains DAE and a classifier simultaneously.\n", "* [Stacking Ensemble](https://www.kaggle.com/jeongyoonlee/stacking-ensemble): shows how to perform stacking ensemble." ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "nterop": { "seedId": "29" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }