{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ArbitraryDiscretiser\n", "\n", "The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals are arbitrarily entered by the user.\n", "\n", "The user needs to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {'var1': [0, 10, 100, 1000], 'var2': [5, 10, 15, 20]}.\n", "\n", "**Note**\n", "\n", "For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:\n", "\n", "Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3\n", "\n", "http://jse.amstat.org/v19n3/decock.pdf\n", "\n", "https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.discretisation import ArbitraryDiscretiser\n", "plt.rcParams[\"figure.figsize\"] = [15,5]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('housing.csv')\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train : (1022, 79)\n", "X_test : (438, 79)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "X = data.drop([\"Id\", \"SalePrice\"], axis=1)\n", "y = data.SalePrice\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=0)\n", "\n", "print(\"X_train :\", X_train.shape)\n", "print(\"X_test :\", X_test.shape)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# we will discretise two continuous variables\n", "\n", "X_train[[\"LotArea\", 'GrLivArea']].hist(bins=50)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ArbitraryDiscretiser() works only with numerical variables. The discretiser will\n", "check if the dictionary entered by the user contains variables present in the\n", "training set, and if these variables are cast as numerical, before doing any\n", "transformation.\n", "\n", "Then it transforms the variables, that is, it sorts the values into the intervals,\n", "transform." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ArbitraryDiscretiser(binning_dict={'GrLivArea': [-inf, 500, 1000, 1500, 2000,\n", " 2500, inf],\n", " 'LotArea': [-inf, 4000, 8000, 12000, 16000,\n", " 20000, inf]},\n", " return_boundaries=False, return_object=False)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "Parameters\n", "----------\n", "\n", "binning_dict : dict\n", " The dictionary with the variable : interval limits pairs, provided by the user.\n", " A valid dictionary looks like this:\n", "\n", " binning_dict = {'var1':[0, 10, 100, 1000], 'var2':[5, 10, 15, 20]}.\n", "\n", "return_object : bool, default=False\n", " Whether the numbers in the discrete variable should be returned as\n", " numeric or as object. The decision is made by the user based on\n", " whether they would like to proceed the engineering of the variable as\n", " if it was numerical or categorical.\n", "\n", "return_boundaries: bool, default=False\n", " whether the output should be the interval boundaries. If True, it returns\n", " the interval boundaries. If False, it returns integers.\n", "'''\n", "\n", "atd = ArbitraryDiscretiser(binning_dict={\"LotArea\":[-np.inf,4000,8000,12000,16000,20000,np.inf],\n", " \"GrLivArea\":[-np.inf,500,1000,1500,2000,2500,np.inf]})\n", "\n", "atd.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotArea': [-inf, 4000, 8000, 12000, 16000, 20000, inf],\n", " 'GrLivArea': [-inf, 500, 1000, 1500, 2000, 2500, inf]}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# binner_dict contains the boundaries of the different bins\n", "atd.binner_dict_" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train_t = atd.transform(X_train)\n", "test_t = atd.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[4 2 1 0 3 5]\n", "[2 0 1 3 5 4]\n" ] } ], "source": [ "# the below are the bins into which the observations were sorted\n", "print(train_t['GrLivArea'].unique())\n", "print(train_t['LotArea'].unique())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotAreaGrLivAreaLotArea_binnedGrLivArea_binned
649375203424
6822887129102
960720785811
13849060125822
1100840043820
\n", "
" ], "text/plain": [ " LotArea GrLivArea LotArea_binned GrLivArea_binned\n", "64 9375 2034 2 4\n", "682 2887 1291 0 2\n", "960 7207 858 1 1\n", "1384 9060 1258 2 2\n", "1100 8400 438 2 0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here I put side by side the original variable and the transformed variable\n", "tmp = pd.concat([X_train[[\"LotArea\", 'GrLivArea']], train_t[[\"LotArea\", 'GrLivArea']]], axis=1)\n", "tmp.columns = [\"LotArea\", 'GrLivArea',\"LotArea_binned\", 'GrLivArea_binned']\n", "tmp.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.subplot(1,2,1)\n", "tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of observations per bin')\n", "plt.subplot(1,2,2)\n", "tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of observations per bin')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now return interval boundaries instead" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ArbitraryDiscretiser(binning_dict={'GrLivArea': [-inf, 500, 1000, 1500, 2000,\n", " 2500, inf],\n", " 'LotArea': [-inf, 4000, 8000, 12000, 16000,\n", " 20000, inf]},\n", " return_boundaries=True, return_object=False)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "atd = ArbitraryDiscretiser(binning_dict={\"LotArea\": [-np.inf, 4000, 8000, 12000, 16000, 20000, np.inf],\n", " \"GrLivArea\": [-np.inf, 500, 1000, 1500, 2000, 2500, np.inf]},\n", " # to return the boundary limits\n", " return_boundaries=True)\n", "\n", "atd.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train_t = atd.transform(X_train)\n", "test_t = atd.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([Interval(-inf, 500.0, closed='right'),\n", " Interval(500.0, 1000.0, closed='right'),\n", " Interval(1000.0, 1500.0, closed='right'),\n", " Interval(1500.0, 2000.0, closed='right'),\n", " Interval(2000.0, 2500.0, closed='right'),\n", " Interval(2500.0, inf, closed='right')], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the numbers are the different bins into which the observations\n", "# were sorted\n", "np.sort(np.ravel(train_t['GrLivArea'].unique()))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([Interval(500.0, 1000.0, closed='right'),\n", " Interval(1000.0, 1500.0, closed='right'),\n", " Interval(1500.0, 2000.0, closed='right'),\n", " Interval(2000.0, 2500.0, closed='right'),\n", " Interval(2500.0, inf, closed='right')], dtype=object)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sort(np.ravel(test_t['GrLivArea'].unique()))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# bar plot to show the intervals returned by the transformer\n", "test_t.LotArea.value_counts(sort=False).plot.bar(figsize=(6,4))\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of houses per interval')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }