{ "cells": [ { "cell_type": "markdown", "id": "c1006651", "metadata": {}, "source": [ "_This notebook contains code and comments from Section 8.4 of the book [Ensemble Methods for Machine Learning](https://www.manning.com/books/ensemble-methods-for-machine-learning). Please see the book for additional details on this topic. This notebook and code are released under the [MIT license](https://github.com/gkunapuli/ensemble-methods-notebooks/blob/master/LICENSE)._\n", "\n", "## 8.4 Encoding High-Cardinality String Features\n", "We wrap up this chapter by exploring encoding techniques for high-cardinality categorical features. The cardinality of a categorical feature is simply the number of unique categories in it. The number of categories is an important consideration in categorical encoding.\n", "\n", "Real-world data sets often contain categorical string features, where feature values are strings. For example, consider a categorical feature of job titles at an organization. This feature can contain dozens to hundreds of job titles from ‘Intern’ to ‘President and CEO’, each with their own unique roles and responsibilities. \n", "\n", "Such features contain a large number of categories and are inherently high-cardinality. This disqualifies encoding approaches such as one-hot encoding (because it increases feature dimension significantly), or ordinal encoding (because no natural ordering typically exists). What’s more, in real-world data sets, such high-cardinality are also ‘dirty’, as in they contain many variations of the same class or concept.\n", "\n", "To address this issue, we will need to determine categories (and how to encode them) by string similarity rather than by exact matching! The intuition behind this approach is to encode similar categories together in a way that a human might, to ensure that the downstream learning algorithm treats them similarly (as it should). \n", "\n", "The ``dirty-cat`` package provides such functionality off-the-shelf and can be used in seamlessly in modeling pipelines. The package provides three specialized encoders to handle so called “dirty categories”, which are essentially noisy and/or high-cardinality string categories. \n", "- ``SimilarityEncoder``, a version of one-hot encoding constructed using string similarities,\n", "- ``GapEncoder``, that encodes categories by considering frequently co-occurring substring combinations, and\n", "- ``MinHashEncoder``, that encodes categories by applying hashing techniques to substrings." ] }, { "cell_type": "code", "execution_count": 1, "id": "dd06db5d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "be732663", "metadata": {}, "outputs": [], "source": [ "# # Pre-process accordint to the example in dirty_cat gitbub\n", "# # https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#id2\n", "\n", "# from dirty_cat.datasets import fetch_employee_salaries\n", "# employee_salaries = fetch_employee_salaries()\n", "# X = employee_salaries.X\n", "# y = employee_salaries.y\n", "\n", "# X['date_first_hired'] = pd.to_datetime(X['date_first_hired'])\n", "# X['year_first_hired'] = X['date_first_hired'].apply(lambda x: x.year)\n", "# # Get mask of rows with missing values in gender\n", "# mask = X.isna()['gender']\n", "# # And remove the lines accordingly\n", "# X.dropna(subset=['gender'], inplace=True)\n", "# y = y[~mask]\n", "# X['salary'] = y\n", "# X = X.drop(['date_first_hired', 'division', 'department'], axis=1)\n", "# X = X.sample(frac=1)\n", "# X.to_csv('./data/ch08/employee_salaries.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 3, "id": "2c237251", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | gender | \n", "department_name | \n", "assignment_category | \n", "employee_position_title | \n", "underfilled_job_title | \n", "year_first_hired | \n", "salary | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "F | \n", "Department of Environmental Protection | \n", "Fulltime-Regular | \n", "Program Specialist II | \n", "NaN | \n", "2013 | \n", "75362.93 | \n", "
| 1 | \n", "F | \n", "Department of Recreation | \n", "Fulltime-Regular | \n", "Recreation Supervisor | \n", "NaN | \n", "1997 | \n", "79522.62 | \n", "
| 2 | \n", "F | \n", "Department of Transportation | \n", "Fulltime-Regular | \n", "Bus Operator | \n", "NaN | \n", "2014 | \n", "42053.83 | \n", "
| 3 | \n", "M | \n", "Fire and Rescue Services | \n", "Fulltime-Regular | \n", "Fire/Rescue Captain | \n", "NaN | \n", "1995 | \n", "114587.02 | \n", "
| 4 | \n", "F | \n", "Department of Public Libraries | \n", "Fulltime-Regular | \n", "Library Assistant I | \n", "NaN | \n", "1996 | \n", "55139.67 | \n", "