{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "1"
    }
   },
   "source": [
    "# DAE with 2 Lines of Code with Kaggler\n",
    "> A tutorial on Kaggler's new DAE feature transformation\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- categories: [notebook, kaggle]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "2"
    }
   },
   "source": [
    "# **UPDATE on 5/1/2021**\n",
    "\n",
    "Today, [`Kaggler`](https://github.com/jeongyoonlee/Kaggler) v0.9.4 is released with additional features for DAE as follows:\n",
    "* In addition to the swap noise (`swap_prob`), the Gaussian noise (`noise_std`) and zero masking (`mask_prob`) have been added to DAE to overcome overfitting.\n",
    "* Stacked DAE is available through the `n_layer` input argument (see Figure 3. in [Vincent et al. (2010), \"Stacked Denoising Autoencoders\"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf) for reference).\n",
    "\n",
    "For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:\n",
    "```python\n",
    "from kaggler.preprocessing import DAE\n",
    "\n",
    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)\n",
    "X = dae.fit_transform(pd.concat([trn, tst], axis=0))\n",
    "```\n",
    "\n",
    "If you're using previous versions, please upgrade `Kaggler` using `pip install -U kaggler`.\n",
    "\n",
    "---\n",
    "\n",
    "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n",
    "\n",
    "Now you can train a DAE with only 2 lines of code as follows:\n",
    "\n",
    "```python\n",
    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
    "X = dae.fit_transform(df[feature_cols])\n",
    "```\n",
    "\n",
    "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n",
    "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n",
    "* `FrequencyEncoder`\n",
    "* `LabelEncoder`: that imputes missing values and groups rare categories\n",
    "* `OneHotEncoder`: that imputes missing values and groups rare categories\n",
    "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n",
    "* `QuantileEncoder`: that transforms numerical features into quantiles\n",
    "\n",
    "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "29"
    }
   },
   "source": [
    "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler) at Kaggle.\n",
    "\n",
    "---\n",
    "\n",
    "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n",
    "\n",
    "Now you can train a DAE with only 2 lines of code as follows:\n",
    "\n",
    "```python\n",
    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
    "X = dae.fit_transform(df[feature_cols])\n",
    "```\n",
    "\n",
    "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n",
    "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n",
    "* `FrequencyEncoder`\n",
    "* `LabelEncoder`: that imputes missing values and groups rare categories\n",
    "* `OneHotEncoder`: that imputes missing values and groups rare categories\n",
    "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n",
    "* `QuantileEncoder`: that transforms numerical features into quantiles\n",
    "\n",
    "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "3"
    }
   },
   "source": [
    "# Part 1: Data Loading & Feature Engineering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_kg_hide-input": true,
    "nterop": {
     "id": "4"
    }
   },
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import roc_auc_score, confusion_matrix\n",
    "import warnings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_kg_hide-output": true,
    "nterop": {
     "id": "5"
    }
   },
   "outputs": [],
   "source": [
    "!pip install kaggler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "6"
    }
   },
   "outputs": [],
   "source": [
    "import kaggler\n",
    "from kaggler.model import AutoLGB\n",
    "from kaggler.preprocessing import DAE, TargetEncoder, LabelEncoder\n",
    "\n",
    "print(f'Kaggler: {kaggler.__version__}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_kg_hide-input": true,
    "nterop": {
     "id": "7"
    }
   },
   "outputs": [],
   "source": [
    "warnings.simplefilter('ignore')\n",
    "pd.set_option('max_columns', 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "8"
    }
   },
   "outputs": [],
   "source": [
    "feature_name = 'dae'\n",
    "algo_name = 'lgb'\n",
    "model_name = f'{algo_name}_{feature_name}'\n",
    "\n",
    "data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')\n",
    "trn_file = data_dir / 'train.csv'\n",
    "tst_file = data_dir / 'test.csv'\n",
    "sample_file = data_dir / 'sample_submission.csv'\n",
    "pseudo_label_file = '../input/tps-apr-2021-pseudo-label-dae/tps04-sub-006.csv'\n",
    "\n",
    "feature_file = f'{feature_name}.csv'\n",
    "predict_val_file = f'{model_name}.val.txt'\n",
    "predict_tst_file = f'{model_name}.tst.txt'\n",
    "submission_file = f'{model_name}.sub.csv'\n",
    "\n",
    "target_col = 'Survived'\n",
    "id_col = 'PassengerId'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "9"
    }
   },
   "outputs": [],
   "source": [
    "n_fold = 5\n",
    "seed = 42\n",
    "encoding_dim = 64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "10"
    }
   },
   "outputs": [],
   "source": [
    "trn = pd.read_csv(trn_file, index_col=id_col)\n",
    "tst = pd.read_csv(tst_file, index_col=id_col)\n",
    "sub = pd.read_csv(sample_file, index_col=id_col)\n",
    "pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)\n",
    "print(trn.shape, tst.shape, sub.shape, pseudo_label.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "11"
    }
   },
   "outputs": [],
   "source": [
    "tst[target_col] = pseudo_label[target_col]\n",
    "n_trn = trn.shape[0]\n",
    "df = pd.concat([trn, tst], axis=0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "12"
    }
   },
   "outputs": [],
   "source": [
    "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n",
    "\n",
    "df['Embarked'] = df['Embarked'].fillna('No')\n",
    "df['Cabin'] = df['Cabin'].fillna('_')\n",
    "df['CabinType'] = df['Cabin'].apply(lambda x:x[0])\n",
    "df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')\n",
    "\n",
    "df['Age'].fillna(round(df['Age'].median()), inplace=True,)\n",
    "df['Age'] = df['Age'].apply(round).astype(int)\n",
    "\n",
    "# Fare, fillna with mean value\n",
    "fare_map = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()\n",
    "df['Fare'] = df['Fare'].fillna(df['Pclass'].map(fare_map['Fare']))\n",
    "\n",
    "df['FirstName'] = df['Name'].str.split(', ').str[0]\n",
    "df['SecondName'] = df['Name'].str.split(', ').str[1]\n",
    "\n",
    "df['n'] = 1\n",
    "\n",
    "gb = df.groupby('FirstName')\n",
    "df_names = gb['n'].sum()\n",
    "df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x]).fillna(1)\n",
    "\n",
    "gb = df.groupby('SecondName')\n",
    "df_names = gb['n'].sum()\n",
    "df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x]).fillna(1)\n",
    "\n",
    "df['Sex'] = (df['Sex'] == 'male').astype(int)\n",
    "\n",
    "df['FamilySize'] = df.SibSp + df.Parch + 1\n",
    "\n",
    "feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',\n",
    "                'FamilySize', 'FirstName', 'SecondName']\n",
    "cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']\n",
    "num_cols = [x for x in feature_cols if x not in cat_cols]\n",
    "print(len(feature_cols), len(cat_cols), len(num_cols))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "13"
    }
   },
   "outputs": [],
   "source": [
    "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n",
    "    df[col] = np.log2(1 + df[col])\n",
    "    \n",
    "scaler = StandardScaler()\n",
    "df[num_cols] = scaler.fit_transform(df[num_cols])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "14"
    }
   },
   "source": [
    "## Label encoding with rare category grouping and missing value imputation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "15"
    }
   },
   "outputs": [],
   "source": [
    "lbe = LabelEncoder(min_obs=50)\n",
    "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "16"
    }
   },
   "source": [
    "## Target encoding with smoothing and 5-fold cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "17"
    }
   },
   "outputs": [],
   "source": [
    "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n",
    "te = TargetEncoder(cv=cv)\n",
    "df_te = te.fit_transform(df[cat_cols], df[target_col])\n",
    "df_te.columns = [f'te_{col}' for col in cat_cols]\n",
    "df_te.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "18"
    }
   },
   "source": [
    "## DAE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "19"
    }
   },
   "outputs": [],
   "source": [
    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
    "X = dae.fit_transform(df[feature_cols])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "20"
    }
   },
   "outputs": [],
   "source": [
    "df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])\n",
    "print(df_dae.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "21"
    }
   },
   "source": [
    "# Part 2: Model Training"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "22"
    }
   },
   "source": [
    "## AutoLGB for Feature Selection and Hyperparameter Optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "23"
    }
   },
   "outputs": [],
   "source": [
    "X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)\n",
    "y = df[target_col]\n",
    "X_tst = X.iloc[n_trn:]\n",
    "\n",
    "p = np.zeros_like(y, dtype=float)\n",
    "p_tst = np.zeros((tst.shape[0],))\n",
    "print(f'Training a stacking ensemble LightGBM model:')\n",
    "for i, (i_trn, i_val) in enumerate(cv.split(X, y)):\n",
    "    if i == 0:\n",
    "        clf = AutoLGB(objective='binary', metric='auc', random_state=seed)\n",
    "        clf.tune(X.iloc[i_trn], y[i_trn])\n",
    "        features = clf.features\n",
    "        params = clf.params\n",
    "        n_best = clf.n_best\n",
    "        print(f'{n_best}')\n",
    "        print(f'{params}')\n",
    "        print(f'{features}')\n",
    "    \n",
    "    trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])\n",
    "    val_data = lgb.Dataset(X.iloc[i_val], y[i_val])\n",
    "    clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)\n",
    "    p[i_val] = clf.predict(X.iloc[i_val])\n",
    "    p_tst += clf.predict(X_tst) / n_fold\n",
    "    print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "24"
    }
   },
   "outputs": [],
   "source": [
    "print(f'  CV AUC: {roc_auc_score(y, p):.6f}')\n",
    "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "25"
    }
   },
   "source": [
    "## Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "26"
    }
   },
   "outputs": [],
   "source": [
    "n_pos = int(0.34911 * tst.shape[0])\n",
    "th = sorted(p_tst, reverse=True)[n_pos]\n",
    "print(th)\n",
    "confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nterop": {
     "id": "27"
    }
   },
   "outputs": [],
   "source": [
    "sub[target_col] = (p_tst > th).astype(int)\n",
    "sub.to_csv(submission_file)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nterop": {
     "id": "28"
    }
   },
   "source": [
    "If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n",
    "\n",
    "Also please check my previous notebooks as well:\n",
    "* [AutoEncoder + Pseudo Label + AutoLGB](https://www.kaggle.com/jeongyoonlee/autoencoder-pseudo-label-autolgb): shows how to build a basic AutoEncoder using Keras, and perform automated feature selection and hyperparameter optimization using Kaggler's AutoLGB.\n",
    "* [Supervised Emphasized Denoising AutoEncoder](https://www.kaggle.com/jeongyoonlee/supervised-emphasized-denoising-autoencoder): shows how to build a more sophiscated version of AutoEncoder, called supervised emphasized Denoising AutoEncoder (DAE), which trains DAE and a classifier simultaneously.\n",
    "* [Stacking Ensemble](https://www.kaggle.com/jeongyoonlee/stacking-ensemble): shows how to perform stacking ensemble."
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "nterop": {
   "seedId": "29"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}