{ "cells": [ { "cell_type": "markdown", "id": "c1006651", "metadata": {}, "source": [ "_This notebook contains code and comments from Section 8.4 of the book [Ensemble Methods for Machine Learning](https://www.manning.com/books/ensemble-methods-for-machine-learning). Please see the book for additional details on this topic. This notebook and code are released under the [MIT license](https://github.com/gkunapuli/ensemble-methods-notebooks/blob/master/LICENSE)._\n", "\n", "## 8.4 Encoding High-Cardinality String Features\n", "We wrap up this chapter by exploring encoding techniques for high-cardinality categorical features. The cardinality of a categorical feature is simply the number of unique categories in it. The number of categories is an important consideration in categorical encoding.\n", "\n", "Real-world data sets often contain categorical string features, where feature values are strings. For example, consider a categorical feature of job titles at an organization. This feature can contain dozens to hundreds of job titles from ‘Intern’ to ‘President and CEO’, each with their own unique roles and responsibilities. \n", "\n", "Such features contain a large number of categories and are inherently high-cardinality. This disqualifies encoding approaches such as one-hot encoding (because it increases feature dimension significantly), or ordinal encoding (because no natural ordering typically exists). What’s more, in real-world data sets, such high-cardinality are also ‘dirty’, as in they contain many variations of the same class or concept.\n", "\n", "To address this issue, we will need to determine categories (and how to encode them) by string similarity rather than by exact matching! The intuition behind this approach is to encode similar categories together in a way that a human might, to ensure that the downstream learning algorithm treats them similarly (as it should). \n", "\n", "The ``dirty-cat`` package provides such functionality off-the-shelf and can be used in seamlessly in modeling pipelines. The package provides three specialized encoders to handle so called “dirty categories”, which are essentially noisy and/or high-cardinality string categories. \n", "- ``SimilarityEncoder``, a version of one-hot encoding constructed using string similarities,\n", "- ``GapEncoder``, that encodes categories by considering frequently co-occurring substring combinations, and\n", "- ``MinHashEncoder``, that encodes categories by applying hashing techniques to substrings." ] }, { "cell_type": "code", "execution_count": 1, "id": "dd06db5d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "be732663", "metadata": {}, "outputs": [], "source": [ "# # Pre-process accordint to the example in dirty_cat gitbub\n", "# # https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#id2\n", "\n", "# from dirty_cat.datasets import fetch_employee_salaries\n", "# employee_salaries = fetch_employee_salaries()\n", "# X = employee_salaries.X\n", "# y = employee_salaries.y\n", "\n", "# X['date_first_hired'] = pd.to_datetime(X['date_first_hired'])\n", "# X['year_first_hired'] = X['date_first_hired'].apply(lambda x: x.year)\n", "# # Get mask of rows with missing values in gender\n", "# mask = X.isna()['gender']\n", "# # And remove the lines accordingly\n", "# X.dropna(subset=['gender'], inplace=True)\n", "# y = y[~mask]\n", "# X['salary'] = y\n", "# X = X.drop(['date_first_hired', 'division', 'department'], axis=1)\n", "# X = X.sample(frac=1)\n", "# X.to_csv('./data/ch08/employee_salaries.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 3, "id": "2c237251", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderdepartment_nameassignment_categoryemployee_position_titleunderfilled_job_titleyear_first_hiredsalary
0FDepartment of Environmental ProtectionFulltime-RegularProgram Specialist IINaN201375362.93
1FDepartment of RecreationFulltime-RegularRecreation SupervisorNaN199779522.62
2FDepartment of TransportationFulltime-RegularBus OperatorNaN201442053.83
3MFire and Rescue ServicesFulltime-RegularFire/Rescue CaptainNaN1995114587.02
4FDepartment of Public LibrariesFulltime-RegularLibrary Assistant INaN199655139.67
\n", "
" ], "text/plain": [ " gender department_name assignment_category \\\n", "0 F Department of Environmental Protection Fulltime-Regular \n", "1 F Department of Recreation Fulltime-Regular \n", "2 F Department of Transportation Fulltime-Regular \n", "3 M Fire and Rescue Services Fulltime-Regular \n", "4 F Department of Public Libraries Fulltime-Regular \n", "\n", " employee_position_title underfilled_job_title year_first_hired salary \n", "0 Program Specialist II NaN 2013 75362.93 \n", "1 Recreation Supervisor NaN 1997 79522.62 \n", "2 Bus Operator NaN 2014 42053.83 \n", "3 Fire/Rescue Captain NaN 1995 114587.02 \n", "4 Library Assistant I NaN 1996 55139.67 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./data/ch08/employee_salaries.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "d4144031", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(9211, 6)\n", "Number of categories\n", "gender: 2 categories\n", "department_name: 37 categories\n", "assignment_category: 2 categories\n", "employee_position_title: 385 categories\n", "underfilled_job_title: 83 categories\n", "year_first_hired: 51 categories\n" ] } ], "source": [ "X, y = df.drop('salary', axis=1), df['salary'] # Split the data into features and targets\n", "print(X.shape)\n", "\n", "print('Number of categories')\n", "for col in X.columns:\n", " print('{0}: {1} categories'.format(col, df[col].nunique()))\n", "\n", "from sklearn.model_selection import train_test_split\n", "Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 5, "id": "6ed54f4d", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SimilarityEncoder: 0.8995625658800894\n", "MinHashEncoder: 0.8996750692009536\n", "GapEncoder: 0.8895356402510632\n" ] } ], "source": [ "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from dirty_cat import SimilarityEncoder, MinHashEncoder, GapEncoder\n", "from xgboost import XGBRegressor\n", "from sklearn.metrics import r2_score\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\", category=UserWarning)\n", "\n", "lo_card = ['gender', 'department_name', 'assignment_category']\n", "hi_card = ['employee_position_title']\n", "continuous = ['year_first_hired']\n", "\n", "encoders = [# OneHotEncoder(sparse=False), \n", " SimilarityEncoder(similarity='ngram'),\n", " MinHashEncoder(n_components=100),\n", " GapEncoder(n_components=100)]\n", "\n", "for encoder in encoders:\n", " ensemble = XGBRegressor(objective='reg:squarederror', learning_rate=0.1, \n", " n_estimators=100, max_depth=3)\n", "\n", " preprocess = ColumnTransformer(\n", " transformers=[('continuous', MinMaxScaler(), continuous),\n", " ('onehot-encode', OneHotEncoder(sparse=False), lo_card),\n", " ('sim-encode', encoder, hi_card)],\n", " remainder='drop')\n", " \n", " pipe = Pipeline(steps=[('preprocess', preprocess), \n", " ('train', ensemble)])\n", " pipe.fit(Xtrn, ytrn)\n", " \n", " ypred = pipe.predict(Xtst)\n", " print('{0}: {1}'.format(encoder.__class__.__name__, r2_score(ytst, ypred)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }