{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## WoEEncoder (weight of evidence)\n", "\n", "This encoder replaces the labels by the weight of evidence \n", "#### It only works for binary classification.\n", "\n", "The weight of evidence is given by: log( p(1) / p(0) )" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from feature_engine.encoding import WoEEncoder\n", "\n", "from feature_engine.encoding import RareLabelEncoder #to reduce cardinality" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float')\n", " data['fare'] = data['fare'].astype('float')\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarked
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375BS
111Allison, Master. Hudson Trevormale0.916712113781151.5500CS
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500CS
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500CS
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500CS
\n", "
" ], "text/plain": [ " pclass survived name sex \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 1 0 Allison, Miss. Helen Loraine female \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton male \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked \n", "0 29.0000 0 0 24160 211.3375 B S \n", "1 0.9167 1 2 113781 151.5500 C S \n", "2 2.0000 1 2 113781 151.5500 C S \n", "3 30.0000 1 2 113781 151.5500 C S \n", "4 25.0000 1 2 113781 151.5500 C S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "X = data.drop(['survived', 'name', 'ticket'], axis=1)\n", "y = data.survived" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin 0\n", "pclass 0\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will encode the below variables, they have no missing values\n", "X[['cabin', 'pclass', 'embarked']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin object\n", "pclass object\n", "embarked object\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''' Make sure that the variables are type (object).\n", "if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) \n", "or not pick it up (if we leave variables=None). '''\n", "\n", "X[['cabin', 'pclass', 'embarked']].dtypes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((916, 8), (393, 8))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "## Rare value encoder first to reduce the cardinality\n", "# see RareLabelEncoder jupyter notebook for more details on this encoder\n", "rare_encoder = RareLabelEncoder(tol=0.03,\n", " n_categories=2, \n", " variables=['cabin', 'pclass', 'embarked'])\n", "\n", "rare_encoder.fit(X_train)\n", "\n", "# transform\n", "train_t = rare_encoder.transform(X_train)\n", "test_t = rare_encoder.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The WoERatioEncoder() replaces categories by the weight of evidence\n", "or by the ratio between the probability of the target = 1 and the probability\n", "of the target = 0.\n", "\n", "The weight of evidence is given by: log(P(X=xj|Y = 1)/P(X=xj|Y=0))\n", "\n", "\n", "Note: This categorical encoding is exclusive for binary classification.\n", "\n", "For example in the variable colour, if the mean of the target = 1 for blue\n", "is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by:\n", "np.log(0.8/0.2) = 1.386\n", "#### Note: \n", "The division by 0 is not defined and the log(0) is not defined.\n", "Thus, if p(0) = 0 or p(1) = 0 for\n", "woe , in any of the variables, the encoder will return an error.\n", " \n", "The encoder will encode only categorical variables (type 'object'). A list\n", "of variables can be passed as an argument. If no variables are passed as \n", "argument, the encoder will find and encode all categorical variables\n", "(object type).
\n", "\n", "For details on the calculation of the weight of evidence visit:
\n", "https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Weight of evidence" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WoEEncoder(variables=['cabin', 'pclass', 'embarked'])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "woe_enc = WoEEncoder(variables=['cabin', 'pclass', 'embarked'])\n", "\n", "# to fit you need to pass the target y\n", "woe_enc.fit(train_t, y_train)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cabin': {'B': 1.6299623810120747,\n", " 'C': 0.7217038208351837,\n", " 'D': 1.405081209799324,\n", " 'E': 1.405081209799324,\n", " 'Rare': 0.7387452866900354,\n", " 'n': -0.35752781962490193},\n", " 'pclass': {1: 0.9453018143294478,\n", " 2: 0.21009172435857942,\n", " 3: -0.5841726684724614},\n", " 'embarked': {'C': 0.6999054533737715,\n", " 'Q': -0.05044494288988759,\n", " 'S': -0.20113381737960143}}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "woe_enc.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
3100.945302male57.011164.8667-0.357528-0.201134
853-0.584173male25.0007.2500-0.357528-0.201134
1090-0.584173female23.0008.6625-0.357528-0.201134
988-0.584173maleNaN007.7500-0.357528-0.050445
875-0.584173male30.0007.2292-0.3575280.699905
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "310 0.945302 male 57.0 1 1 164.8667 -0.357528 -0.201134\n", "853 -0.584173 male 25.0 0 0 7.2500 -0.357528 -0.201134\n", "1090 -0.584173 female 23.0 0 0 8.6625 -0.357528 -0.201134\n", "988 -0.584173 male NaN 0 0 7.7500 -0.357528 -0.050445\n", "875 -0.584173 male 30.0 0 0 7.2292 -0.357528 0.699905" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transform and visualise the data\n", "\n", "train_t = woe_enc.transform(train_t)\n", "test_t = woe_enc.transform(test_t)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "''' The WoEEncoder has the characteristic that return monotonic\n", " variables, that is, encoded variables which values increase as the target increases'''\n", "\n", "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7,5))\n", "pd.concat([test_t,y_test], axis=1).groupby(\"pclass\")[\"survived\"].mean().plot()\n", "#plt.xticks([0,1,2])\n", "plt.yticks(np.arange(0,1.1,0.1))\n", "plt.title(\"Relationship between pclass and target\")\n", "plt.xlabel(\"Pclass\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically select the variables\n", "\n", "This encoder will select all categorical variables to encode, when no variables are specified when calling the encoder." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WoEEncoder(variables=['sex'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratio_enc = WoEEncoder()\n", "\n", "# to fit we need to pass the target y\n", "ratio_enc.fit(train_t, y_train)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
1139-0.584173-0.9988238.0007.8958-0.357528-0.201134
5330.2100921.4531221.00121.0000-0.357528-0.201134
4590.210092-0.9988242.01027.0000-0.357528-0.201134
1150-0.584173-0.99882NaN0014.5000-0.357528-0.201134
3930.210092-0.9988225.00031.5000-0.357528-0.201134
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 -0.584173 -0.99882 38.0 0 0 7.8958 -0.357528 -0.201134\n", "533 0.210092 1.45312 21.0 0 1 21.0000 -0.357528 -0.201134\n", "459 0.210092 -0.99882 42.0 1 0 27.0000 -0.357528 -0.201134\n", "1150 -0.584173 -0.99882 NaN 0 0 14.5000 -0.357528 -0.201134\n", "393 0.210092 -0.99882 25.0 0 0 31.5000 -0.357528 -0.201134" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transform and visualise the data\n", "\n", "train_t = ratio_enc.transform(train_t)\n", "test_t = ratio_enc.transform(test_t)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }