{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ArbitraryDiscretiser + MeanEncoder\n", "\n", "This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ArbitraryDiscretiser\n", "\n", "The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals arbitrarily defined by the user.\n", "\n", "The user needs to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {'var1': [0, 10, 100, 1000],'var2': [5, 10, 15, 20]}.\n", "\n", "Note: Check out the ArbitraryDiscretiser notebook to learn more about this transformer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MeanEncoder\n", "\n", "The MeanEncoder() replaces the labels of the variables by the mean value of the target for that label.
For example, in the variable colour, if the mean value of the binary target is 0.5 for the label blue, then blue is replaced by 0.5\n", "\n", "Note: Read MeanEncoder notebook to know more about this transformer" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "\n", "from feature_engine.discretisation import ArbitraryDiscretiser\n", "from feature_engine.encoding import MeanEncoder\n", "\n", "plt.rcParams[\"figure.figsize\"] = [15,5]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float').fillna(data.age.median())\n", " data['fare'] = data['fare'].astype('float').fillna(data.fare.median())\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarked
011female29.000000211.3375BS
111male0.916712151.5500CS
210female2.000012151.5500CS
310male30.000012151.5500CS
410female25.000012151.5500CS
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked\n", "0 1 1 female 29.0000 0 0 211.3375 B S\n", "1 1 1 male 0.9167 1 2 151.5500 C S\n", "2 1 0 female 2.0000 1 2 151.5500 C S\n", "3 1 0 male 30.0000 1 2 151.5500 C S\n", "4 1 0 female 25.0000 1 2 151.5500 C S" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train : (916, 8)\n", "X_test : (393, 8)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "X = data.drop(['survived'], axis=1)\n", "y = data.survived\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=0)\n", "\n", "print(\"X_train :\", X_train.shape)\n", "print(\"X_test :\", X_test.shape)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# we will transform two continuous variables\n", "X_train[[\"age\", 'fare']].hist(bins=30)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('ArbitraryDiscretiser',\n", " ArbitraryDiscretiser(binning_dict={'age': [0, 18, 30, 50, 100],\n", " 'fare': [-1, 20, 40, 60, 80,\n", " 600]},\n", " return_boundaries=False,\n", " return_object=True)),\n", " ('MeanEncoder', MeanEncoder(variables=['age', 'fare']))],\n", " verbose=False)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set up the discretiser\n", "arb_disc = ArbitraryDiscretiser(\n", " binning_dict={'age': [0, 18, 30, 50, 100],\n", " 'fare': [-1, 20, 40, 60, 80, 600]},\n", " # returns values as categorical\n", " return_object=True)\n", "\n", "# set up the mean encoder\n", "mean_enc = MeanEncoder(variables=['age', 'fare'])\n", "\n", "# set up the pipeline\n", "transformer = Pipeline(steps=[('ArbitraryDiscretiser', arb_disc),\n", " ('MeanEncoder', mean_enc),\n", " ])\n", "# train the pipeline\n", "transformer.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': [0, 18, 30, 50, 100], 'fare': [-1, 20, 40, 60, 80, 600]}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['ArbitraryDiscretiser'].binner_dict_" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': {0: 0.45081967213114754,\n", " 1: 0.34309623430962344,\n", " 2: 0.4262948207171315,\n", " 3: 0.4153846153846154},\n", " 'fare': {0: 0.288135593220339,\n", " 1: 0.43283582089552236,\n", " 2: 0.5636363636363636,\n", " 3: 0.45652173913043476,\n", " 4: 0.7349397590361446}}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer.named_steps['MeanEncoder'].encoder_dict_" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11393male0.426295000.288136nS
5332female0.343096010.432836nS
4592male0.426295100.432836nS
11503male0.343096000.288136nS
3932male0.343096000.432836nS
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1139 3 male 0.426295 0 0 0.288136 n S\n", "533 2 female 0.343096 0 1 0.432836 n S\n", "459 2 male 0.426295 1 0 0.432836 n S\n", "1150 3 male 0.343096 0 0 0.288136 n S\n", "393 2 male 0.343096 0 0 0.432836 n S" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = transformer.transform(X_train)\n", "test_t = transformer.transform(X_test)\n", "\n", "test_t.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7, 5))\n", "pd.concat([test_t, y_test], axis=1).groupby(\"fare\")[\"survived\"].mean().plot()\n", "plt.title(\"Relationship between fare and target\")\n", "plt.xlabel(\"fare\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe an almost linear relationship between the variable \"fare\" after the transformation and the target." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }