{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# RareLabelEncoder\n", "\n", "The RareLabelEncoder() groups labels that show a small number of observations in the dataset into a new category called 'Rare'. This helps to avoid overfitting.\n", "\n", "The argument ' tol ' indicates the percentage of observations that the label needs to have in order not to be re-grouped into the \"Rare\" label.
The argument n_categories indicates the minimum number of distinct categories that a variable needs to have for any of the labels to be re-grouped into 'Rare'.

\n", "#### Note\n", "If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from feature_engine.encoding import RareLabelEncoder" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float')\n", " data['fare'] = data['fare'].astype('float')\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarked
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375BS
111Allison, Master. Hudson Trevormale0.916712113781151.5500CS
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500CS
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500CS
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500CS
\n", "
" ], "text/plain": [ " pclass survived name sex \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 1 0 Allison, Miss. Helen Loraine female \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton male \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked \n", "0 29.0000 0 0 24160 211.3375 B S \n", "1 0.9167 1 2 113781 151.5500 C S \n", "2 2.0000 1 2 113781 151.5500 C S \n", "3 30.0000 1 2 113781 151.5500 C S \n", "4 25.0000 1 2 113781 151.5500 C S " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X = data.drop(['survived', 'name', 'ticket'], axis=1)\n", "y = data.survived" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin 0\n", "pclass 0\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will encode the below variables, they have no missing values\n", "X[['cabin', 'pclass', 'embarked']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin object\n", "pclass object\n", "embarked object\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''' Make sure that the variables are type (object).\n", "if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) \n", "or not pick it up (if we leave variables=None). '''\n", "\n", "X[['cabin', 'pclass', 'embarked']].dtypes" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((916, 8), (393, 8))" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The RareLabelEncoder() groups rare / infrequent categories in\n", "a new category called \"Rare\", or any other name entered by the user.\n", "\n", "For example in the variable colour,
if the percentage of observations\n", "for the categories magenta, cyan and burgundy \n", "are < 5%, all those\n", "categories will be replaced by the new label \"Rare\".\n", "\n", "Note, infrequent labels can also be grouped under a user defined name, for\n", "example 'Other'. The name to replace infrequent categories is defined\n", "with the parameter replace_with.\n", " \n", "The encoder will encode only categorical variables (type 'object'). A list\n", "of variables can be passed as an argument. If no variables are passed as \n", "argument, the encoder will find and encode all categorical variables\n", "(object type)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent\n", " \"considered frequent\".format(var)\n", "c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent\n", " \"considered frequent\".format(var)\n" ] }, { "data": { "text/plain": [ "RareLabelEncoder(n_categories=5, variables=['cabin', 'pclass', 'embarked'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Rare value encoder\n", "'''\n", "Parameters\n", "----------\n", "\n", "tol: float, default=0.05\n", " the minimum frequency a label should have to be considered frequent.\n", " Categories with frequencies lower than tol will be grouped.\n", "\n", "n_categories: int, default=10\n", " the minimum number of categories a variable should have for the encoder\n", " to find frequent labels. If the variable contains less categories, all\n", " of them will be considered frequent.\n", "\n", "max_n_categories: int, default=None\n", " the maximum number of categories that should be considered frequent.\n", " If None, all categories with frequency above the tolerance (tol) will be\n", " considered.\n", "\n", "variables : list, default=None\n", " The list of categorical variables that will be encoded. If None, the \n", " encoder will find and select all object type variables.\n", "\n", "replace_with : string, default='Rare'\n", " The category name that will be used to replace infrequent categories.\n", "'''\n", "\n", "rare_encoder = RareLabelEncoder(tol=0.05, \n", " n_categories=5,\n", " variables=['cabin', 'pclass', 'embarked'])\n", "rare_encoder.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cabin': Index(['n', 'C'], dtype='object'),\n", " 'pclass': array([2, 3, 1], dtype=object),\n", " 'embarked': array(['S', 'C', 'Q'], dtype=object)}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rare_encoder.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
5012female13.00119.5000nS
5882female4.01123.0000nS
4022female30.01013.8583nC
11933maleNaN007.7250nQ
6863female22.0007.7250nQ
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "501 2 female 13.0 0 1 19.5000 n S\n", "588 2 female 4.0 1 1 23.0000 n S\n", "402 2 female 30.0 1 0 13.8583 n C\n", "1193 3 male NaN 0 0 7.7250 n Q\n", "686 3 female 22.0 0 0 7.7250 n Q" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = rare_encoder.transform(X_train)\n", "test_t = rare_encoder.transform(X_train)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "n 702\n", "Rare 143\n", "C 71\n", "Name: cabin, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_t.cabin.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The user can change the string from 'Rare' to something else." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
10593male28.0008.0500nS
12273female22.0009.8375nS
4702male35.00012.3500nQ
661female36.000262.3750BC
9503male29.0009.4833nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1059 3 male 28.0 0 0 8.0500 n S\n", "1227 3 female 22.0 0 0 9.8375 n S\n", "470 2 male 35.0 0 0 12.3500 n Q\n", "66 1 female 36.0 0 0 262.3750 B C\n", "950 3 male 29.0 0 0 9.4833 n S" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Rare value encoder\n", "\n", "rare_encoder = RareLabelEncoder(tol = 0.03,\n", " replace_with='Other', #replacing 'Rare' with 'Other'\n", " variables=['cabin', 'pclass', 'embarked'],\n", " n_categories=2\n", " )\n", "\n", "rare_encoder.fit(X_train)\n", "\n", "train_t = rare_encoder.transform(X_train)\n", "test_t = rare_encoder.transform(X_train)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),\n", " 'pclass': Int64Index([3, 1, 2], dtype='int64'),\n", " 'embarked': Index(['S', 'C', 'Q'], dtype='object')}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rare_encoder.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "n 702\n", "C 71\n", "B 42\n", "Other 37\n", "E 32\n", "D 32\n", "Name: cabin, dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_t.cabin.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The user can choose to retain only the most popular categories with the argument max_n_categories." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
12223male33.0008.6625nC
7813male33.0007.8958nC
2721female23.01082.2667BS
10433femaleNaN1015.5000nQ
8673female22.01112.2875nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1222 3 male 33.0 0 0 8.6625 n C\n", "781 3 male 33.0 0 0 7.8958 n C\n", "272 1 female 23.0 1 0 82.2667 B S\n", "1043 3 female NaN 1 0 15.5000 n Q\n", "867 3 female 22.0 1 1 12.2875 n S" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Rare value encoder\n", "\n", "rare_encoder = RareLabelEncoder(tol = 0.03,\n", " variables=['cabin', 'pclass', 'embarked'],\n", " n_categories=2,\n", " \n", " max_n_categories=3 #keeps only the most popular 3 categories in every variable.\n", " \n", " )\n", "\n", "rare_encoder.fit(X_train)\n", "\n", "train_t = rare_encoder.transform(X_train)\n", "test_t = rare_encoder.transform(X_train)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cabin': Index(['n', 'C', 'B'], dtype='object'),\n", " 'pclass': Int64Index([3, 1, 2], dtype='int64'),\n", " 'embarked': Index(['S', 'C', 'Q'], dtype='object')}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rare_encoder.encoder_dict_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically select all categorical variables\n", "\n", "If no variable list is passed as argument, it selects all the categorical variables." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent\n", " \"considered frequent\".format(var)\n", "c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable sex is less than that indicated in n_categories. Thus, all categories will be considered frequent\n", " \"considered frequent\".format(var)\n", "c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent\n", " \"considered frequent\".format(var)\n" ] }, { "data": { "text/plain": [ "{'pclass': array([2, 3, 1], dtype=object),\n", " 'sex': array(['female', 'male'], dtype=object),\n", " 'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),\n", " 'embarked': array(['S', 'C', 'Q'], dtype=object)}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Rare value encoder\n", "\n", "rare_encoder = RareLabelEncoder(tol = 0.03, n_categories=3)\n", "\n", "rare_encoder.fit(X_train)\n", "\n", "rare_encoder.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
3852male8.01136.750nS
1541male55.01193.500BS
3232male30.01024.000nC
5722female28.00012.650nS
8093male18.02234.375nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "385 2 male 8.0 1 1 36.750 n S\n", "154 1 male 55.0 1 1 93.500 B S\n", "323 2 male 30.0 1 0 24.000 n C\n", "572 2 female 28.0 0 0 12.650 n S\n", "809 3 male 18.0 2 2 34.375 n S" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = rare_encoder.transform(X_train)\n", "test_t = rare_encoder.transform(X_train)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }