{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EqualWidthDiscretiser + OrdinalEncoder\n", "\n", "\n", "This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EqualWidthDiscretiser\n", "\n", "The EqualWidthDiscretiser() divides continuous numerical variables into\n", "intervals of the same width, that is, equidistant intervals. Note that the\n", "proportion of observations per interval may vary.\n", "\n", "The number of intervals\n", "in which the variable should be divided must be indicated by the user.\n", "\n", "Note: Check out the EqualWidthDiscretiser notebook to learn more about this transformer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OrdinalEncoder\n", "The OrdinalEncoder() will replace the variable labels by digits, from 1 to the number of different labels. \n", "\n", "If we select \"arbitrary\", then the encoder will assign numbers as the labels appear in the variable (first come first served).\n", "\n", "If we select \"ordered\", the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 1, and those where the mean of the target is smallest will get the number n.\n", "\n", "Note: Check out the OrdinalEncoder notebook to know more about this transformer." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "\n", "\n", "from feature_engine.discretisation import EqualWidthDiscretiser\n", "from feature_engine.encoding import OrdinalEncoder\n", "\n", "plt.rcParams[\"figure.figsize\"] = [15,5]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float').fillna(data.age.median())\n", " data['fare'] = data['fare'].astype('float').fillna(data.fare.median())\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarked
011female29.000000211.3375BS
111male0.916712151.5500CS
210female2.000012151.5500CS
310male30.000012151.5500CS
410female25.000012151.5500CS
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked\n", "0 1 1 female 29.0000 0 0 211.3375 B S\n", "1 1 1 male 0.9167 1 2 151.5500 C S\n", "2 1 0 female 2.0000 1 2 151.5500 C S\n", "3 1 0 male 30.0000 1 2 151.5500 C S\n", "4 1 0 female 25.0000 1 2 151.5500 C S" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train : (916, 8)\n", "X_test : (393, 8)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "X = data.drop(['survived'], axis=1)\n", "y = data.survived\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=0)\n", "\n", "print(\"X_train :\", X_train.shape)\n", "print(\"X_test :\", X_test.shape)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# we will use two continuous variables for the transformations\n", "\n", "X_train[[\"age\", 'fare']].hist(bins=30)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('EqualWidthDiscretiser',\n", " EqualWidthDiscretiser(bins=5, return_boundaries=False,\n", " return_object=True,\n", " variables=['age', 'fare'])),\n", " ('OrdinalEncoder',\n", " OrdinalEncoder(encoding_method='ordered',\n", " variables=['age', 'fare']))],\n", " verbose=False)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set up the discretiser\n", "ewd = EqualWidthDiscretiser(\n", " bins=5,\n", " variables=['age', 'fare'],\n", " # important: return values as categorical\n", " return_object=True)\n", "\n", "# set up the encoder\n", "oe = OrdinalEncoder(variables=['age', 'fare'])\n", "\n", "# pipeline\n", "transformer = Pipeline(steps=[('EqualWidthDiscretiser', ewd),\n", " ('OrdinalEncoder', oe),\n", " ])\n", "\n", "transformer.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': [-inf,\n", " 14.933359999999999,\n", " 29.700019999999995,\n", " 44.46667999999999,\n", " 59.23333999999999,\n", " inf],\n", " 'fare': [-inf, 102.46584, 204.93168, 307.39752, 409.86336, inf]}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['EqualWidthDiscretiser'].binner_dict_" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': {4: 0, 1: 1, 2: 2, 3: 3, 0: 4}, 'fare': {0: 0, 2: 1, 1: 2, 4: 3}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['OrdinalEncoder'].encoder_dict_" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11393male2000nS
5332female1010nS
4592male2100nS
11503male1000nS
3932male1000nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 3 male 2 0 0 0 n S\n", "533 2 female 1 0 1 0 n S\n", "459 2 male 2 1 0 0 n S\n", "1150 3 male 1 0 0 0 n S\n", "393 2 male 1 0 0 0 n S" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = transformer.transform(X_train)\n", "test_t = transformer.transform(X_test)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7,5))\n", "pd.concat([test_t,y_test], axis=1).groupby(\"fare\")[\"survived\"].mean().plot()\n", "plt.title(\"Relationship between fare and target\")\n", "plt.xlabel(\"fare\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the bins are monotonically ordered with the target." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }