{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# RareLabelEncoder\n",
"\n",
"The RareLabelEncoder() groups labels that show a small number of observations in the dataset into a new category called 'Rare'. This helps to avoid overfitting.\n",
"\n",
"The argument ' tol ' indicates the percentage of observations that the label needs to have in order not to be re-grouped into the \"Rare\" label.
The argument n_categories indicates the minimum number of distinct categories that a variable needs to have for any of the labels to be re-grouped into 'Rare'.
\n",
"#### Note\n",
"If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from feature_engine.encoding import RareLabelEncoder"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Load titanic dataset from OpenML\n",
"\n",
"def load_titanic():\n",
" data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n",
" data = data.replace('?', np.nan)\n",
" data['cabin'] = data['cabin'].astype(str).str[0]\n",
" data['pclass'] = data['pclass'].astype('O')\n",
" data['age'] = data['age'].astype('float')\n",
" data['fare'] = data['fare'].astype('float')\n",
" data['embarked'].fillna('C', inplace=True)\n",
" data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)\n",
" return data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" survived | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Allen, Miss. Elisabeth Walton | \n",
" female | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 24160 | \n",
" 211.3375 | \n",
" B | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Allison, Master. Hudson Trevor | \n",
" male | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" Allison, Miss. Helen Loraine | \n",
" female | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" Allison, Mr. Hudson Joshua Creighton | \n",
" male | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | \n",
" female | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass survived name sex \\\n",
"0 1 1 Allen, Miss. Elisabeth Walton female \n",
"1 1 1 Allison, Master. Hudson Trevor male \n",
"2 1 0 Allison, Miss. Helen Loraine female \n",
"3 1 0 Allison, Mr. Hudson Joshua Creighton male \n",
"4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n",
"\n",
" age sibsp parch ticket fare cabin embarked \n",
"0 29.0000 0 0 24160 211.3375 B S \n",
"1 0.9167 1 2 113781 151.5500 C S \n",
"2 2.0000 1 2 113781 151.5500 C S \n",
"3 30.0000 1 2 113781 151.5500 C S \n",
"4 25.0000 1 2 113781 151.5500 C S "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = load_titanic()\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"X = data.drop(['survived', 'name', 'ticket'], axis=1)\n",
"y = data.survived"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"cabin 0\n",
"pclass 0\n",
"embarked 0\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we will encode the below variables, they have no missing values\n",
"X[['cabin', 'pclass', 'embarked']].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"cabin object\n",
"pclass object\n",
"embarked object\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"''' Make sure that the variables are type (object).\n",
"if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) \n",
"or not pick it up (if we leave variables=None). '''\n",
"\n",
"X[['cabin', 'pclass', 'embarked']].dtypes"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((916, 8), (393, 8))"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's separate into training and testing set\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The RareLabelEncoder() groups rare / infrequent categories in\n",
"a new category called \"Rare\", or any other name entered by the user.\n",
"\n",
"For example in the variable colour,
if the percentage of observations\n",
"for the categories magenta, cyan and burgundy \n",
"are < 5%, all those\n",
"categories will be replaced by the new label \"Rare\".\n",
"\n",
"Note, infrequent labels can also be grouped under a user defined name, for\n",
"example 'Other'. The name to replace infrequent categories is defined\n",
"with the parameter replace_with.\n",
" \n",
"The encoder will encode only categorical variables (type 'object'). A list\n",
"of variables can be passed as an argument. If no variables are passed as \n",
"argument, the encoder will find and encode all categorical variables\n",
"(object type)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent\n",
" \"considered frequent\".format(var)\n",
"c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent\n",
" \"considered frequent\".format(var)\n"
]
},
{
"data": {
"text/plain": [
"RareLabelEncoder(n_categories=5, variables=['cabin', 'pclass', 'embarked'])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Rare value encoder\n",
"'''\n",
"Parameters\n",
"----------\n",
"\n",
"tol: float, default=0.05\n",
" the minimum frequency a label should have to be considered frequent.\n",
" Categories with frequencies lower than tol will be grouped.\n",
"\n",
"n_categories: int, default=10\n",
" the minimum number of categories a variable should have for the encoder\n",
" to find frequent labels. If the variable contains less categories, all\n",
" of them will be considered frequent.\n",
"\n",
"max_n_categories: int, default=None\n",
" the maximum number of categories that should be considered frequent.\n",
" If None, all categories with frequency above the tolerance (tol) will be\n",
" considered.\n",
"\n",
"variables : list, default=None\n",
" The list of categorical variables that will be encoded. If None, the \n",
" encoder will find and select all object type variables.\n",
"\n",
"replace_with : string, default='Rare'\n",
" The category name that will be used to replace infrequent categories.\n",
"'''\n",
"\n",
"rare_encoder = RareLabelEncoder(tol=0.05, \n",
" n_categories=5,\n",
" variables=['cabin', 'pclass', 'embarked'])\n",
"rare_encoder.fit(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cabin': Index(['n', 'C'], dtype='object'),\n",
" 'pclass': array([2, 3, 1], dtype=object),\n",
" 'embarked': array(['S', 'C', 'Q'], dtype=object)}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rare_encoder.encoder_dict_"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 501 | \n",
" 2 | \n",
" female | \n",
" 13.0 | \n",
" 0 | \n",
" 1 | \n",
" 19.5000 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 588 | \n",
" 2 | \n",
" female | \n",
" 4.0 | \n",
" 1 | \n",
" 1 | \n",
" 23.0000 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 402 | \n",
" 2 | \n",
" female | \n",
" 30.0 | \n",
" 1 | \n",
" 0 | \n",
" 13.8583 | \n",
" n | \n",
" C | \n",
"
\n",
" \n",
" 1193 | \n",
" 3 | \n",
" male | \n",
" NaN | \n",
" 0 | \n",
" 0 | \n",
" 7.7250 | \n",
" n | \n",
" Q | \n",
"
\n",
" \n",
" 686 | \n",
" 3 | \n",
" female | \n",
" 22.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.7250 | \n",
" n | \n",
" Q | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass sex age sibsp parch fare cabin embarked\n",
"501 2 female 13.0 0 1 19.5000 n S\n",
"588 2 female 4.0 1 1 23.0000 n S\n",
"402 2 female 30.0 1 0 13.8583 n C\n",
"1193 3 male NaN 0 0 7.7250 n Q\n",
"686 3 female 22.0 0 0 7.7250 n Q"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_t = rare_encoder.transform(X_train)\n",
"test_t = rare_encoder.transform(X_train)\n",
"\n",
"test_t.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"n 702\n",
"Rare 143\n",
"C 71\n",
"Name: cabin, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_t.cabin.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The user can change the string from 'Rare' to something else."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 1059 | \n",
" 3 | \n",
" male | \n",
" 28.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 1227 | \n",
" 3 | \n",
" female | \n",
" 22.0 | \n",
" 0 | \n",
" 0 | \n",
" 9.8375 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 470 | \n",
" 2 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 12.3500 | \n",
" n | \n",
" Q | \n",
"
\n",
" \n",
" 66 | \n",
" 1 | \n",
" female | \n",
" 36.0 | \n",
" 0 | \n",
" 0 | \n",
" 262.3750 | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" 950 | \n",
" 3 | \n",
" male | \n",
" 29.0 | \n",
" 0 | \n",
" 0 | \n",
" 9.4833 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass sex age sibsp parch fare cabin embarked\n",
"1059 3 male 28.0 0 0 8.0500 n S\n",
"1227 3 female 22.0 0 0 9.8375 n S\n",
"470 2 male 35.0 0 0 12.3500 n Q\n",
"66 1 female 36.0 0 0 262.3750 B C\n",
"950 3 male 29.0 0 0 9.4833 n S"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Rare value encoder\n",
"\n",
"rare_encoder = RareLabelEncoder(tol = 0.03,\n",
" replace_with='Other', #replacing 'Rare' with 'Other'\n",
" variables=['cabin', 'pclass', 'embarked'],\n",
" n_categories=2\n",
" )\n",
"\n",
"rare_encoder.fit(X_train)\n",
"\n",
"train_t = rare_encoder.transform(X_train)\n",
"test_t = rare_encoder.transform(X_train)\n",
"\n",
"test_t.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),\n",
" 'pclass': Int64Index([3, 1, 2], dtype='int64'),\n",
" 'embarked': Index(['S', 'C', 'Q'], dtype='object')}"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rare_encoder.encoder_dict_"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"n 702\n",
"C 71\n",
"B 42\n",
"Other 37\n",
"E 32\n",
"D 32\n",
"Name: cabin, dtype: int64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_t.cabin.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The user can choose to retain only the most popular categories with the argument max_n_categories."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 1222 | \n",
" 3 | \n",
" male | \n",
" 33.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.6625 | \n",
" n | \n",
" C | \n",
"
\n",
" \n",
" 781 | \n",
" 3 | \n",
" male | \n",
" 33.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.8958 | \n",
" n | \n",
" C | \n",
"
\n",
" \n",
" 272 | \n",
" 1 | \n",
" female | \n",
" 23.0 | \n",
" 1 | \n",
" 0 | \n",
" 82.2667 | \n",
" B | \n",
" S | \n",
"
\n",
" \n",
" 1043 | \n",
" 3 | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 0 | \n",
" 15.5000 | \n",
" n | \n",
" Q | \n",
"
\n",
" \n",
" 867 | \n",
" 3 | \n",
" female | \n",
" 22.0 | \n",
" 1 | \n",
" 1 | \n",
" 12.2875 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass sex age sibsp parch fare cabin embarked\n",
"1222 3 male 33.0 0 0 8.6625 n C\n",
"781 3 male 33.0 0 0 7.8958 n C\n",
"272 1 female 23.0 1 0 82.2667 B S\n",
"1043 3 female NaN 1 0 15.5000 n Q\n",
"867 3 female 22.0 1 1 12.2875 n S"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Rare value encoder\n",
"\n",
"rare_encoder = RareLabelEncoder(tol = 0.03,\n",
" variables=['cabin', 'pclass', 'embarked'],\n",
" n_categories=2,\n",
" \n",
" max_n_categories=3 #keeps only the most popular 3 categories in every variable.\n",
" \n",
" )\n",
"\n",
"rare_encoder.fit(X_train)\n",
"\n",
"train_t = rare_encoder.transform(X_train)\n",
"test_t = rare_encoder.transform(X_train)\n",
"\n",
"test_t.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cabin': Index(['n', 'C', 'B'], dtype='object'),\n",
" 'pclass': Int64Index([3, 1, 2], dtype='int64'),\n",
" 'embarked': Index(['S', 'C', 'Q'], dtype='object')}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rare_encoder.encoder_dict_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatically select all categorical variables\n",
"\n",
"If no variable list is passed as argument, it selects all the categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent\n",
" \"considered frequent\".format(var)\n",
"c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable sex is less than that indicated in n_categories. Thus, all categories will be considered frequent\n",
" \"considered frequent\".format(var)\n",
"c:\\users\\king_ashok\\desktop\\feature_engine\\feature_engine\\encoding\\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent\n",
" \"considered frequent\".format(var)\n"
]
},
{
"data": {
"text/plain": [
"{'pclass': array([2, 3, 1], dtype=object),\n",
" 'sex': array(['female', 'male'], dtype=object),\n",
" 'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),\n",
" 'embarked': array(['S', 'C', 'Q'], dtype=object)}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Rare value encoder\n",
"\n",
"rare_encoder = RareLabelEncoder(tol = 0.03, n_categories=3)\n",
"\n",
"rare_encoder.fit(X_train)\n",
"\n",
"rare_encoder.encoder_dict_"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 385 | \n",
" 2 | \n",
" male | \n",
" 8.0 | \n",
" 1 | \n",
" 1 | \n",
" 36.750 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 154 | \n",
" 1 | \n",
" male | \n",
" 55.0 | \n",
" 1 | \n",
" 1 | \n",
" 93.500 | \n",
" B | \n",
" S | \n",
"
\n",
" \n",
" 323 | \n",
" 2 | \n",
" male | \n",
" 30.0 | \n",
" 1 | \n",
" 0 | \n",
" 24.000 | \n",
" n | \n",
" C | \n",
"
\n",
" \n",
" 572 | \n",
" 2 | \n",
" female | \n",
" 28.0 | \n",
" 0 | \n",
" 0 | \n",
" 12.650 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
" 809 | \n",
" 3 | \n",
" male | \n",
" 18.0 | \n",
" 2 | \n",
" 2 | \n",
" 34.375 | \n",
" n | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass sex age sibsp parch fare cabin embarked\n",
"385 2 male 8.0 1 1 36.750 n S\n",
"154 1 male 55.0 1 1 93.500 B S\n",
"323 2 male 30.0 1 0 24.000 n C\n",
"572 2 female 28.0 0 0 12.650 n S\n",
"809 3 male 18.0 2 2 34.375 n S"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_t = rare_encoder.transform(X_train)\n",
"test_t = rare_encoder.transform(X_train)\n",
"\n",
"test_t.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}