{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c1006651",
   "metadata": {},
   "source": [
    "_This notebook contains code and comments from Section 8.4 of the book [Ensemble Methods for Machine Learning](https://www.manning.com/books/ensemble-methods-for-machine-learning). Please see the book for additional details on this topic. This notebook and code are released under the [MIT license](https://github.com/gkunapuli/ensemble-methods-notebooks/blob/master/LICENSE)._\n",
    "\n",
    "## 8.4 Encoding High-Cardinality String Features\n",
    "We wrap up this chapter by exploring encoding techniques for high-cardinality categorical features. The cardinality of a categorical feature is simply the number of unique categories in it. The number of categories is an important consideration in categorical encoding.\n",
    "\n",
    "Real-world data sets often contain categorical string features, where feature values are strings. For example, consider a categorical feature of job titles at an organization. This feature can contain dozens to hundreds of job titles from ‘Intern’ to ‘President and CEO’, each with their own unique roles and responsibilities. \n",
    "\n",
    "Such features contain a large number of categories and are inherently high-cardinality. This disqualifies encoding approaches such as one-hot encoding (because it increases feature dimension significantly), or ordinal encoding (because no natural ordering typically exists). What’s more, in real-world data sets, such high-cardinality are also ‘dirty’, as in they contain many variations of the same class or concept.\n",
    "\n",
    "To address this issue, we will need to determine categories (and how to encode them) by string similarity rather than by exact matching! The intuition behind this approach is to encode similar categories together in a way that a human might, to ensure that the downstream learning algorithm treats them similarly (as it should). \n",
    "\n",
    "The ``dirty-cat``  package provides such functionality off-the-shelf and can be used in seamlessly in modeling pipelines. The package provides three specialized encoders to handle so called “dirty categories”, which are essentially noisy and/or high-cardinality   string categories. \n",
    "- ``SimilarityEncoder``, a version of one-hot encoding constructed using string similarities,\n",
    "- ``GapEncoder``, that encodes categories by considering frequently co-occurring substring combinations, and\n",
    "- ``MinHashEncoder``, that encodes categories by applying hashing techniques to substrings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "dd06db5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "be732663",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Pre-process accordint to the example in dirty_cat gitbub\n",
    "# # https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#id2\n",
    "\n",
    "# from dirty_cat.datasets import fetch_employee_salaries\n",
    "# employee_salaries = fetch_employee_salaries()\n",
    "# X = employee_salaries.X\n",
    "# y = employee_salaries.y\n",
    "\n",
    "# X['date_first_hired'] = pd.to_datetime(X['date_first_hired'])\n",
    "# X['year_first_hired'] = X['date_first_hired'].apply(lambda x: x.year)\n",
    "# # Get mask of rows with missing values in gender\n",
    "# mask = X.isna()['gender']\n",
    "# # And remove the lines accordingly\n",
    "# X.dropna(subset=['gender'], inplace=True)\n",
    "# y = y[~mask]\n",
    "# X['salary'] = y\n",
    "# X = X.drop(['date_first_hired', 'division', 'department'], axis=1)\n",
    "# X = X.sample(frac=1)\n",
    "# X.to_csv('./data/ch08/employee_salaries.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2c237251",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>department_name</th>\n",
       "      <th>assignment_category</th>\n",
       "      <th>employee_position_title</th>\n",
       "      <th>underfilled_job_title</th>\n",
       "      <th>year_first_hired</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>F</td>\n",
       "      <td>Department of Environmental Protection</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Program Specialist II</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013</td>\n",
       "      <td>75362.93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>F</td>\n",
       "      <td>Department of Recreation</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Recreation Supervisor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1997</td>\n",
       "      <td>79522.62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>F</td>\n",
       "      <td>Department of Transportation</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Bus Operator</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2014</td>\n",
       "      <td>42053.83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>M</td>\n",
       "      <td>Fire and Rescue Services</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Fire/Rescue Captain</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1995</td>\n",
       "      <td>114587.02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>F</td>\n",
       "      <td>Department of Public Libraries</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Library Assistant I</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1996</td>\n",
       "      <td>55139.67</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  gender                         department_name assignment_category  \\\n",
       "0      F  Department of Environmental Protection    Fulltime-Regular   \n",
       "1      F                Department of Recreation    Fulltime-Regular   \n",
       "2      F            Department of Transportation    Fulltime-Regular   \n",
       "3      M                Fire and Rescue Services    Fulltime-Regular   \n",
       "4      F          Department of Public Libraries    Fulltime-Regular   \n",
       "\n",
       "  employee_position_title underfilled_job_title  year_first_hired     salary  \n",
       "0   Program Specialist II                   NaN              2013   75362.93  \n",
       "1   Recreation Supervisor                   NaN              1997   79522.62  \n",
       "2            Bus Operator                   NaN              2014   42053.83  \n",
       "3     Fire/Rescue Captain                   NaN              1995  114587.02  \n",
       "4     Library Assistant I                   NaN              1996   55139.67  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./data/ch08/employee_salaries.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d4144031",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9211, 6)\n",
      "Number of categories\n",
      "gender: 2 categories\n",
      "department_name: 37 categories\n",
      "assignment_category: 2 categories\n",
      "employee_position_title: 385 categories\n",
      "underfilled_job_title: 83 categories\n",
      "year_first_hired: 51 categories\n"
     ]
    }
   ],
   "source": [
    "X, y = df.drop('salary', axis=1), df['salary']  # Split the data into features and targets\n",
    "print(X.shape)\n",
    "\n",
    "print('Number of categories')\n",
    "for col in X.columns:\n",
    "    print('{0}: {1} categories'.format(col, df[col].nunique()))\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6ed54f4d",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SimilarityEncoder: 0.8995625658800894\n",
      "MinHashEncoder: 0.8996750692009536\n",
      "GapEncoder: 0.8895356402510632\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from dirty_cat import SimilarityEncoder, MinHashEncoder, GapEncoder\n",
    "from xgboost import XGBRegressor\n",
    "from sklearn.metrics import r2_score\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
    "\n",
    "lo_card = ['gender', 'department_name', 'assignment_category']\n",
    "hi_card = ['employee_position_title']\n",
    "continuous = ['year_first_hired']\n",
    "\n",
    "encoders = [# OneHotEncoder(sparse=False), \n",
    "            SimilarityEncoder(similarity='ngram'),\n",
    "            MinHashEncoder(n_components=100),\n",
    "            GapEncoder(n_components=100)]\n",
    "\n",
    "for encoder in encoders:\n",
    "    ensemble = XGBRegressor(objective='reg:squarederror', learning_rate=0.1, \n",
    "                            n_estimators=100, max_depth=3)\n",
    "\n",
    "    preprocess = ColumnTransformer(\n",
    "                         transformers=[('continuous', MinMaxScaler(), continuous),\n",
    "                                       ('onehot-encode', OneHotEncoder(sparse=False), lo_card),\n",
    "                                       ('sim-encode', encoder, hi_card)],\n",
    "                         remainder='drop')\n",
    "    \n",
    "    pipe = Pipeline(steps=[('preprocess', preprocess), \n",
    "                           ('train', ensemble)])\n",
    "    pipe.fit(Xtrn, ytrn)\n",
    "    \n",
    "    ypred = pipe.predict(Xtst)\n",
    "    print('{0}: {1}'.format(encoder.__class__.__name__, r2_score(ytst, ypred)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}