{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DecisionTreeDiscretiser\n", "\n", "The DecisionTreeDiscretiser() divides continuous numerical variables into discrete, finite, values estimated by a decision tree.\n", "\n", "The methods is inspired by the following article from the winners of the KDD 2009 competition:\n", "http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf\n", "\n", "**Note**\n", "\n", "For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:\n", "\n", "Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3\n", "\n", "http://jse.amstat.org/v19n3/decock.pdf\n", "\n", "https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.discretisation import DecisionTreeDiscretiser\n", "\n", "plt.rcParams[\"figure.figsize\"] = [15,5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DecisionTreeDiscretiser with Regression" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('housing.csv')\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train : (1022, 79)\n", "X_test : (438, 79)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "X = data.drop([\"Id\", \"SalePrice\"], axis=1)\n", "y = data.SalePrice\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=0)\n", "\n", "print(\"X_train :\", X_train.shape)\n", "print(\"X_test :\", X_test.shape)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# we will discretise two continuous variables\n", "\n", "X_train[[\"LotArea\", 'GrLivArea']].hist(bins=50)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DecisionTreeDiscretiser() works only with numerical variables.\n", "A list of variables can be passed as an argument. Alternatively, the\n", "discretiser will automatically select and transform all numerical variables.\n", "\n", "The DecisionTreeDiscretiser() first trains a decision tree for each variable,\n", "fit.\n", "\n", "The DecisionTreeDiscretiser() then transforms the variables, that is,\n", "makes predictions based on the variable values, using the trained decision\n", "tree, transform." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},\n", " random_state=29, regression=True,\n", " scoring='neg_mean_squared_error',\n", " variables=['LotArea', 'GrLivArea'])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "Parameters\n", "----------\n", "\n", "cv : int, default=3\n", " Desired number of cross-validation fold to be used to fit the decision\n", " tree.\n", "\n", "scoring: str, default='neg_mean_squared_error'\n", " Desired metric to optimise the performance for the tree. Comes from\n", " sklearn metrics. See DecisionTreeRegressor or DecisionTreeClassifier\n", " model evaluation documentation for more options:\n", " https://scikit-learn.org/stable/modules/model_evaluation.html\n", "\n", "variables : list\n", " The list of numerical variables that will be transformed. If None, the\n", " discretiser will automatically select all numerical type variables.\n", "\n", "regression : boolean, default=True\n", " Indicates whether the discretiser should train a regression or a classification\n", " decision tree.\n", "\n", "param_grid : dictionary, default=None\n", " The list of parameters over which the decision tree should be optimised\n", " during the grid search. The param_grid can contain any of the permitted\n", " parameters for Scikit-learn's DecisionTreeRegressor() or\n", " DecisionTreeClassifier().\n", "\n", " If None, then param_grid = {'max_depth': [1, 2, 3, 4]}\n", "\n", "random_state : int, default=None\n", " The random_state to initialise the training of the decision tree. It is one\n", " of the parameters of the Scikit-learn's DecisionTreeRegressor() or\n", " DecisionTreeClassifier(). For reproducibility it is recommended to set\n", " the random_state to an integer.\n", "'''\n", "\n", "treeDisc = DecisionTreeDiscretiser(cv=3,\n", " scoring='neg_mean_squared_error',\n", " variables=['LotArea', 'GrLivArea'],\n", " regression=True,\n", " random_state=29)\n", "\n", "# the DecisionTreeDiscretiser needs the target for fitting\n", "treeDisc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotArea': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29, splitter='best'),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',\n", " refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=0),\n", " 'GrLivArea': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29, splitter='best'),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',\n", " refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=0)}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the binner_dict_ contains the best decision tree for each variable\n", "treeDisc.binner_dict_" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train_t = treeDisc.transform(X_train)\n", "test_t = treeDisc.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([246372.77165354, 149540.32663317, 122286.38839286, 88631.59375 ,\n", " 165174.20895522, 198837.68608414, 312260.5 , 509937.5 ])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the below account for the best obtained bins, aka, the tree predictions\n", "\n", "train_t['GrLivArea'].unique()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([181711.59622642, 145405.30751708, 213802.86363636, 251997.13333333])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the below account for the best obtained bins, aka, the tree predictions\n", "\n", "train_t['LotArea'].unique()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotAreaGrLivAreaLotArea_binnedGrLivArea_binned
6493752034181711.596226246372.771654
68228871291145405.307517149540.326633
9607207858145405.307517122286.388393
138490601258181711.596226149540.326633
11008400438145405.30751788631.593750
\n", "
" ], "text/plain": [ " LotArea GrLivArea LotArea_binned GrLivArea_binned\n", "64 9375 2034 181711.596226 246372.771654\n", "682 2887 1291 145405.307517 149540.326633\n", "960 7207 858 145405.307517 122286.388393\n", "1384 9060 1258 181711.596226 149540.326633\n", "1100 8400 438 145405.307517 88631.593750" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here I put side by side the original variable and the transformed variable\n", "\n", "tmp = pd.concat([X_train[[\"LotArea\", 'GrLivArea']],\n", " train_t[[\"LotArea\", 'GrLivArea']]], axis=1)\n", "\n", "tmp.columns = [\"LotArea\", 'GrLivArea', \"LotArea_binned\", 'GrLivArea_binned']\n", "\n", "tmp.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# in equal frequency discretisation, we obtain the same amount of observations\n", "# in each one of the bins.\n", "\n", "plt.subplot(1,2,1)\n", "tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of houses per discrete value')\n", "\n", "plt.subplot(1,2,2)\n", "tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.ylabel('Number of houses')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DecisionTreeDiscretiser with binary classification" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['embarked'].fillna('C', inplace=True)\n", " data['fare'] = data['fare'].astype('float').fillna(0)\n", " data['age'] = data['age'].astype('float').fillna(0)\n", " data.drop(['name', 'ticket', 'boat', 'home.dest'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarkedbody
011female29.000000211.3375BSNaN
111male0.916712151.5500CSNaN
210female2.000012151.5500CSNaN
310male30.000012151.5500CS135
410female25.000012151.5500CSNaN
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked \\\n", "0 1 1 female 29.0000 0 0 211.3375 B S \n", "1 1 1 male 0.9167 1 2 151.5500 C S \n", "2 1 0 female 2.0000 1 2 151.5500 C S \n", "3 1 0 male 30.0000 1 2 151.5500 C S \n", "4 1 0 female 25.0000 1 2 151.5500 C S \n", "\n", " body \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 135 \n", "4 NaN " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load data\n", "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(916, 9)\n", "(393, 9)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(data.drop(['survived'], axis=1),\n", " data['survived'],\n", " test_size=0.3, \n", " random_state=0)\n", "\n", "print(X_train.shape)\n", "print(X_test.shape)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fare float64\n", "age float64\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#this discretiser transforms the numerical variables\n", "X_train[['fare', 'age']].dtypes" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2]}, random_state=29,\n", " regression=False, scoring='roc_auc',\n", " variables=['fare', 'age'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treeDisc = DecisionTreeDiscretiser(cv=3,\n", " scoring='roc_auc',\n", " variables=['fare', 'age'],\n", " regression=False,\n", " param_grid={'max_depth': [1, 2]},\n", " random_state=29,\n", " )\n", "\n", "treeDisc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'fare': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29,\n", " splitter='best'),\n", " iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='roc_auc', verbose=0),\n", " 'age': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29,\n", " splitter='best'),\n", " iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='roc_auc', verbose=0)}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treeDisc.binner_dict_" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "train_t = treeDisc.transform(X_train)\n", "test_t = treeDisc.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.41295547, 0.26857143])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the below account for the best obtained bins\n", "# in this case, the tree has found that dividing the data in 6 bins is enough\n", "train_t['age'].unique()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.42379182, 0.26778243, 0.52307692, 0.74038462])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the below account for the best obtained bins\n", "# in this case, the tree has found that dividing the data in 8 bins is enough\n", "train_t['fare'].unique()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fareagefare_binnedage_binned
50119.500013.00.4237920.412955
58823.00004.00.4237920.412955
40213.858330.00.2677820.412955
11937.72500.00.2677820.268571
6867.725022.00.2677820.412955
\n", "
" ], "text/plain": [ " fare age fare_binned age_binned\n", "501 19.5000 13.0 0.423792 0.412955\n", "588 23.0000 4.0 0.423792 0.412955\n", "402 13.8583 30.0 0.267782 0.412955\n", "1193 7.7250 0.0 0.267782 0.268571\n", "686 7.7250 22.0 0.267782 0.412955" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here I put side by side the original variable and the transformed variable\n", "\n", "tmp = pd.concat([X_train[[\"fare\", 'age']], train_t[[\"fare\", 'age']]], axis=1)\n", "\n", "tmp.columns = [\"fare\", 'age', \"fare_binned\", 'age_binned']\n", "\n", "tmp.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.subplot(1,2,1)\n", "tmp.groupby('fare_binned')['fare'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of houses per discrete value')\n", "\n", "plt.subplot(1,2,2)\n", "tmp.groupby('age_binned')['age'].count().plot.bar()\n", "plt.ylabel('Number of houses')\n", "plt.title('Number of houses per discrete value')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The DecisionTreeDiscretiser() returns values which show\n", "# a monotonic relationship with target\n", "\n", "pd.concat([test_t, y_test], axis=1).groupby(\n", " 'age')['survived'].mean().plot(figsize=(6, 4))\n", "\n", "plt.ylabel(\"Mean of target\")\n", "plt.title(\"Relationship between fare and target\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The DecisionTreeDiscretiser() returns values which show\n", "# a monotonic relationship with target\n", "\n", "pd.concat([test_t, y_test], axis=1).groupby(\n", " 'fare')['survived'].mean().plot(figsize=(6, 4))\n", "\n", "plt.ylabel(\"Mean of target\")\n", "plt.title(\"Relationship between fare and target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DecisionTreeDiscretiser for Multi-class classification" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)type
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " type \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load iris dataset from sklearn\n", "from sklearn.datasets import load_iris\n", "\n", "data = pd.DataFrame(load_iris().data, \n", " columns=load_iris().feature_names).join(\n", " pd.Series(load_iris().target, name='type'))\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.type.unique() # 3 - class classification" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(105, 4)\n", "(45, 4)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(data.drop('type', axis=1),\n", " data['type'],\n", " test_size=0.3,\n", " random_state=0)\n", "\n", "print(X_train.shape)\n", "print(X_test.shape)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) float64\n", "sepal width (cm) float64\n", "dtype: object" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#selected two numerical variables\n", "X_train[['sepal length (cm)', 'sepal width (cm)']].dtypes" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},\n", " random_state=29, regression=False, scoring='accuracy',\n", " variables=['sepal length (cm)', 'sepal width (cm)'])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treeDisc = DecisionTreeDiscretiser(cv=3,\n", " scoring='accuracy',\n", " variables=[\n", " 'sepal length (cm)', 'sepal width (cm)'],\n", " regression=False,\n", " random_state=29,\n", " )\n", "\n", "treeDisc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sepal length (cm)': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29,\n", " splitter='best'),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',\n", " refit=True, return_train_score=False, scoring='accuracy',\n", " verbose=0),\n", " 'sepal width (cm)': GridSearchCV(cv=3, error_score=nan,\n", " estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort='deprecated',\n", " random_state=29,\n", " splitter='best'),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',\n", " refit=True, return_train_score=False, scoring='accuracy',\n", " verbose=0)}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treeDisc.binner_dict_" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "train_t = treeDisc.transform(X_train)\n", "test_t = treeDisc.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)sepalLen_binnedsepalWid_binned
605.02.00.1250001.000000
1166.53.00.2962960.250000
1446.73.30.2962960.200000
1196.02.20.2962960.500000
1086.72.50.2962960.434783
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) sepalLen_binned sepalWid_binned\n", "60 5.0 2.0 0.125000 1.000000\n", "116 6.5 3.0 0.296296 0.250000\n", "144 6.7 3.3 0.296296 0.200000\n", "119 6.0 2.2 0.296296 0.500000\n", "108 6.7 2.5 0.296296 0.434783" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here I put side by side the original variable and the transformed variable\n", "tmp = pd.concat([X_train[['sepal length (cm)', 'sepal width (cm)']],\n", " train_t[['sepal length (cm)', 'sepal width (cm)']]], axis=1)\n", "\n", "tmp.columns = ['sepal length (cm)', 'sepal width (cm)',\n", " 'sepalLen_binned', 'sepalWid_binned']\n", "\n", "tmp.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.subplot(1, 2, 1)\n", "tmp.groupby('sepalLen_binned')['sepal length (cm)'].count().plot.bar()\n", "plt.ylabel('Number of species')\n", "plt.title('Number of observations per discrete value')\n", "\n", "plt.subplot(1, 2, 2)\n", "tmp.groupby('sepalWid_binned')['sepal width (cm)'].count().plot.bar()\n", "plt.ylabel('Number of species')\n", "plt.title('Number of observations per discrete value')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }