{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EqualFrequencyDiscretiser + WoEEncoder\n", "\n", "This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EqualFrequencyDiscretiser\n", "\n", "The EqualFrequencyDiscretiser() divides continuous numerical variables\n", "into contiguous equal frequency intervals, that is, intervals that contain\n", "approximately the same proportion of observations.\n", "\n", "The interval limits are determined by the quantiles. The number of intervals,\n", "i.e., the number of quantiles in which the variable should be divided is\n", "determined by the user.\n", "\n", "Note: Check out the EqualFrequencyDiscretiser notebook to larn more about this transformer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## WoEEncoder\n", "\n", "This encoder replaces the labels by the weight of evidence.\n", "\n", "**It only works for binary classification.**\n", "\n", "Note: Check out the WoEEncoder notebook to learn more about this transformer." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "\n", "from feature_engine.discretisation import EqualFrequencyDiscretiser\n", "from feature_engine.encoding import WoEEncoder\n", "\n", "plt.rcParams[\"figure.figsize\"] = [15,5]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float').fillna(data.age.median())\n", " data['fare'] = data['fare'].astype('float').fillna(data.fare.median())\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarked
011female29.000000211.3375BS
111male0.916712151.5500CS
210female2.000012151.5500CS
310male30.000012151.5500CS
410female25.000012151.5500CS
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked\n", "0 1 1 female 29.0000 0 0 211.3375 B S\n", "1 1 1 male 0.9167 1 2 151.5500 C S\n", "2 1 0 female 2.0000 1 2 151.5500 C S\n", "3 1 0 male 30.0000 1 2 151.5500 C S\n", "4 1 0 female 25.0000 1 2 151.5500 C S" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train : (916, 8)\n", "X_test : (393, 8)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "X = data.drop(['survived'], axis=1)\n", "y = data.survived\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", "\n", "print(\"X_train :\" ,X_train.shape)\n", "print(\"X_test :\" ,X_test.shape)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# we will use two continuous variables for the transformations\n", "X_train[[\"age\", 'fare']].hist(bins=30)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('EqualFrequencyDiscretiser',\n", " EqualFrequencyDiscretiser(q=4, return_boundaries=False,\n", " return_object=True,\n", " variables=['age', 'fare'])),\n", " ('WoEEncoder', WoEEncoder(variables=['age', 'fare']))],\n", " verbose=False)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set up the discretiser\n", "\n", "efd = EqualFrequencyDiscretiser(\n", " q=4,\n", " variables=['age', 'fare'],\n", " # important: return values as categorical\n", " return_object=True)\n", "\n", "# set up the encoder\n", "woe = WoEEncoder(variables=['age', 'fare'])\n", "\n", "# pipeline\n", "transformer = Pipeline(steps=[('EqualFrequencyDiscretiser', efd),\n", " ('WoEEncoder', woe),\n", " ])\n", "\n", "transformer.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': [-inf, 23.0, 28.0, 35.0, inf],\n", " 'fare': [-inf, 7.8958, 14.4542, 31.275, inf]}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['EqualFrequencyDiscretiser'].binner_dict_" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': {0: 0.07533270507296917,\n", " 1: -0.260402163917158,\n", " 2: 0.3237107275657203,\n", " 3: 0.05769015189511875},\n", " 'fare': {0: -0.5990108946387251,\n", " 1: -0.41504696424627724,\n", " 2: 0.142571903020815,\n", " 3: 0.7852653023249282}}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['WoEEncoder'].encoder_dict_" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11393male0.05769000-0.599011nS
5332female0.075333010.142572nS
4592male0.057690100.142572nS
11503male-0.260402000.142572nS
3932male-0.260402000.785265nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 3 male 0.057690 0 0 -0.599011 n S\n", "533 2 female 0.075333 0 1 0.142572 n S\n", "459 2 male 0.057690 1 0 0.142572 n S\n", "1150 3 male -0.260402 0 0 0.142572 n S\n", "393 2 male -0.260402 0 0 0.785265 n S" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = transformer.transform(X_train)\n", "test_t = transformer.transform(X_test)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7,5))\n", "pd.concat([test_t,y_test], axis=1).groupby(\"fare\")[\"survived\"].mean().plot()\n", "plt.title(\"Relationship between fare and target\")\n", "plt.xlabel(\"fare\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how now the intervals are monotonically sorted respect to the target." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }