{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MeanEncoder\n", "\n", "The MeanEncoder() replaces the labels of the variables by the mean value of the target for that label.
For example, in the variable colour, if the mean value of the binary target is 0.5 for the label blue, then blue is replaced by 0.5" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from feature_engine.encoding import MeanEncoder" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float')\n", " data['fare'] = data['fare'].astype('float')\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarked
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375BS
111Allison, Master. Hudson Trevormale0.916712113781151.5500CS
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500CS
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500CS
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500CS
\n", "
" ], "text/plain": [ " pclass survived name sex \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 1 0 Allison, Miss. Helen Loraine female \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton male \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked \n", "0 29.0000 0 0 24160 211.3375 B S \n", "1 0.9167 1 2 113781 151.5500 C S \n", "2 2.0000 1 2 113781 151.5500 C S \n", "3 30.0000 1 2 113781 151.5500 C S \n", "4 25.0000 1 2 113781 151.5500 C S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "X = data.drop(['survived', 'name', 'ticket'], axis=1)\n", "y = data.survived" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin 0\n", "pclass 0\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will encode the below variables, they have no missing values\n", "X[['cabin', 'pclass', 'embarked']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin object\n", "pclass object\n", "embarked object\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''' Make sure that the variables are type (object).\n", "if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) \n", "or not pick it up (if we leave variables=None). '''\n", "\n", "X[['cabin', 'pclass', 'embarked']].dtypes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((916, 8), (393, 8))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The MeanEncoder() replaces categories by the mean value of the\n", "target for each category.

\n", "For example in the variable colour, if the mean of the target for blue, red\n", "and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8\n", "and grey by 0.1.

\n", "The encoder will encode only categorical variables (type 'object'). A list\n", "of variables can be passed as an argument. If no variables are passed as \n", "argument, the encoder will find and encode all categorical variables\n", "(object type)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MeanEncoder(variables=['cabin', 'pclass', 'embarked'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will transform 3 variables\n", "'''\n", "Parameters\n", "---------- \n", "variables : list, default=None\n", " The list of categorical variables that will be encoded. If None, the \n", " encoder will find and select all object type variables.\n", "'''\n", "\n", "mean_enc = MeanEncoder(variables=['cabin', 'pclass', 'embarked'])\n", "\n", "# Note: the MeanCategoricalEncoder needs the target to fit\n", "mean_enc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cabin': {'A': 0.5294117647058824,\n", " 'B': 0.7619047619047619,\n", " 'C': 0.5633802816901409,\n", " 'D': 0.71875,\n", " 'E': 0.71875,\n", " 'F': 0.6666666666666666,\n", " 'G': 0.5,\n", " 'T': 0.0,\n", " 'n': 0.30484330484330485},\n", " 'pclass': {1: 0.6173913043478261,\n", " 2: 0.43617021276595747,\n", " 3: 0.25903614457831325},\n", " 'embarked': {'C': 0.5580110497237569,\n", " 'Q': 0.37349397590361444,\n", " 'S': 0.3389570552147239}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# see the dictionary with the mappings per variable\n", "\n", "mean_enc.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11390.259036male38.0007.89580.3048430.338957
5330.436170female21.00121.00000.3048430.338957
4590.436170male42.01027.00000.3048430.338957
11500.259036maleNaN0014.50000.3048430.338957
3930.436170male25.00031.50000.3048430.338957
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 0.259036 male 38.0 0 0 7.8958 0.304843 0.338957\n", "533 0.436170 female 21.0 0 1 21.0000 0.304843 0.338957\n", "459 0.436170 male 42.0 1 0 27.0000 0.304843 0.338957\n", "1150 0.259036 male NaN 0 0 14.5000 0.304843 0.338957\n", "393 0.436170 male 25.0 0 0 31.5000 0.304843 0.338957" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we can see the transformed variables in the head view\n", "\n", "train_t = mean_enc.transform(X_train)\n", "test_t = mean_enc.transform(X_test)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "''' The MeanEncoder has the characteristic that return monotonic\n", " variables, that is, encoded variables which values increase as the target increases'''\n", "\n", "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7,5))\n", "pd.concat([test_t,y_test], axis=1).groupby(\"pclass\")[\"survived\"].mean().plot()\n", "#plt.xticks([0,1,2])\n", "plt.yticks(np.arange(0,1.1,0.1))\n", "plt.title(\"Relationship between pclass and target\")\n", "plt.xlabel(\"Pclass\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically select the variables\n", "\n", "This encoder will select all categorical variables to encode, when no variables are specified when calling the encoder." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MeanEncoder(variables=['pclass', 'sex', 'cabin', 'embarked'])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_enc = MeanEncoder()\n", "\n", "mean_enc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pclass', 'sex', 'cabin', 'embarked']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_enc.variables" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11390.2590360.18760838.0007.89580.3048430.338957
5330.4361700.72835821.00121.00000.3048430.338957
4590.4361700.18760842.01027.00000.3048430.338957
11500.2590360.187608NaN0014.50000.3048430.338957
3930.4361700.18760825.00031.50000.3048430.338957
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 0.259036 0.187608 38.0 0 0 7.8958 0.304843 0.338957\n", "533 0.436170 0.728358 21.0 0 1 21.0000 0.304843 0.338957\n", "459 0.436170 0.187608 42.0 1 0 27.0000 0.304843 0.338957\n", "1150 0.259036 0.187608 NaN 0 0 14.5000 0.304843 0.338957\n", "393 0.436170 0.187608 25.0 0 0 31.5000 0.304843 0.338957" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we can see the transformed variables in the head view\n", "\n", "train_t = mean_enc.transform(X_train)\n", "test_t = mean_enc.transform(X_test)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }