{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EndTailImputer\n", "\n", "The EndTailImputer() replaces missing data by a value at either tail of the distribution. It automatically determines the value to be used in the imputation using the mean plus or minus a factor of the standard deviation, or using the inter-quartile range proximity rule. Alternatively, it can use a factor of the maximum value.\n", "\n", "The EndTailImputer() is in essence, very similar to the ArbitraryNumberImputer, but it selects the value to use fr the imputation automatically, instead of having the user pre-define them.\n", "\n", "It works only with numerical variables.\n", "\n", "**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**\n", "\n", "[Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3](http://jse.amstat.org/v19n3/decock.pdf)\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.2.0'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make sure you are using this \n", "# Feature-engine version.\n", "\n", "import feature_engine\n", "\n", "feature_engine.__version__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.imputation import EndTailImputer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Download the data from Kaggle and store it\n", "# in the same folder as this notebook.\n", "\n", "data = pd.read_csv('houseprice.csv')\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1022, 79), (438, 79))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Separate the data into train and test sets.\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['Id', 'SalePrice'], axis=1),\n", " data['SalePrice'],\n", " test_size=0.3,\n", " random_state=0,\n", ")\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check missing data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 0.184932\n", "MasVnrArea 0.004892\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# numerical variables with missing data\n", "\n", "X_train[['LotFrontage', 'MasVnrArea']].isnull().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The EndTailImputer can replace NA with a value at the left or right end of the distribution.\n", "\n", "In addition, it uses 3 different methods to identify the imputation values.\n", "\n", "In the following cells, we show how to use each method.\n", "\n", "## Gaussian, right tail\n", "\n", "Let's begin by finding the values automatically at the right tail, by using the mean and the standard deviation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "imputer = EndTailImputer(\n", " \n", " # uses mean and standard deviation to determine the value\n", " imputation_method='gaussian',\n", " \n", " # value at right tail of distribution\n", " tail='right',\n", " \n", " # multiply the std by 3\n", " fold=3,\n", " \n", " # the variables to impute\n", " variables=['LotFrontage', 'MasVnrArea'],\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EndTailImputer(variables=['LotFrontage', 'MasVnrArea'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find the imputation values\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotFrontage': 138.9022201686726, 'MasVnrArea': 648.3947111415165}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The values for the imputation\n", "\n", "imputer.imputer_dict_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we use different values for different variables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# impute the data\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check we no longer have NA\n", "\n", "train_t['LotFrontage'].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The variable distribution changed slightly with\n", "# more values accumulating towards the right tail\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "X_train['LotFrontage'].plot(kind='kde', ax=ax)\n", "train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')\n", "lines, labels = ax.get_legend_handles_labels()\n", "ax.legend(lines, labels, loc='best')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IQR, left tail\n", "\n", "Now, we will impute variables with values at the left tail. The values are identified using the inter-quartile range proximity rule. \n", "\n", "The IQR rule is better suited for skewed variables." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "imputer = EndTailImputer(\n", " \n", " # uses the inter-quartile range proximity rule\n", " imputation_method='iqr',\n", " \n", " # determines values at the left tail of the distribution\n", " tail='left',\n", " \n", " # multiplies the IQR by 3\n", " fold=3,\n", " \n", " # the variables to impute\n", " variables=['LotFrontage', 'MasVnrArea'],\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EndTailImputer(imputation_method='iqr', tail='left',\n", " variables=['LotFrontage', 'MasVnrArea'])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# finds the imputation values\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotFrontage': -8.0, 'MasVnrArea': -510.0}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# imputation values per variable\n", "\n", "imputer.imputer_dict_" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# transform the data\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 0\n", "MasVnrArea 0\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check we have no NA after the transformation\n", "\n", "train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The variable distribution changed with the\n", "# transformation, with more values\n", "# accumulating towards the left tail.\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "X_train['LotFrontage'].plot(kind='kde', ax=ax)\n", "train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')\n", "lines, labels = ax.get_legend_handles_labels()\n", "ax.legend(lines, labels, loc='best')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Impute with the maximum value\n", "\n", "We can find imputation values with a factor of the maximum variable value." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "imputer = EndTailImputer(\n", " \n", " # imputes beyond the maximum value\n", " imputation_method='max',\n", " \n", " # multiplies the maximum value by 3\n", " fold=3,\n", " \n", " # the variables to impute\n", " variables=['LotFrontage', 'MasVnrArea'],\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EndTailImputer(imputation_method='max', variables=['LotFrontage', 'MasVnrArea'])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find imputation values\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotFrontage': 939.0, 'MasVnrArea': 4800.0}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The imputation values.\n", "\n", "imputer.imputer_dict_" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 313.0\n", "MasVnrArea 1600.0\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the maximum values of the variables,\n", "# note how the imputer multiplied them by 3\n", "# to determine the imputation values.\n", "\n", "X_train[imputer.variables_].max()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# impute the data\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 0\n", "MasVnrArea 0\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check we have no NA in the imputed data\n", "\n", "train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The variable distribution changed with the\n", "# transformation, with now more values\n", "# beyond the maximum.\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "X_train['LotFrontage'].plot(kind='kde', ax=ax)\n", "train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')\n", "lines, labels = ax.get_legend_handles_labels()\n", "ax.legend(lines, labels, loc='best')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatically impute all variables\n", "\n", "As with all Feature-engine transformers, the EndTailImputer can also find and impute all numerical variables in the data." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Start the imputer\n", "\n", "imputer = EndTailImputer()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'gaussian'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the default parameters\n", "\n", "# how to find the imputation value\n", "imputer.imputation_method" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'right'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# which tail to use\n", "\n", "imputer.tail" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how far out\n", "imputer.fold" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EndTailImputer()" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find variables and imputation values\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['MSSubClass',\n", " 'LotFrontage',\n", " 'LotArea',\n", " 'OverallQual',\n", " 'OverallCond',\n", " 'YearBuilt',\n", " 'YearRemodAdd',\n", " 'MasVnrArea',\n", " 'BsmtFinSF1',\n", " 'BsmtFinSF2',\n", " 'BsmtUnfSF',\n", " 'TotalBsmtSF',\n", " '1stFlrSF',\n", " '2ndFlrSF',\n", " 'LowQualFinSF',\n", " 'GrLivArea',\n", " 'BsmtFullBath',\n", " 'BsmtHalfBath',\n", " 'FullBath',\n", " 'HalfBath',\n", " 'BedroomAbvGr',\n", " 'KitchenAbvGr',\n", " 'TotRmsAbvGrd',\n", " 'Fireplaces',\n", " 'GarageYrBlt',\n", " 'GarageCars',\n", " 'GarageArea',\n", " 'WoodDeckSF',\n", " 'OpenPorchSF',\n", " 'EnclosedPorch',\n", " '3SsnPorch',\n", " 'ScreenPorch',\n", " 'PoolArea',\n", " 'MiscVal',\n", " 'MoSold',\n", " 'YrSold']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The variables to impute\n", "\n", "imputer.variables_" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'MSSubClass': 183.0960051903714,\n", " 'LotFrontage': 138.9022201686726,\n", " 'LotArea': 41441.796589850215,\n", " 'OverallQual': 10.152919665538322,\n", " 'OverallCond': 8.918356149675976,\n", " 'YearBuilt': 2061.66731604675,\n", " 'YearRemodAdd': 2046.2161089423614,\n", " 'MasVnrArea': 648.3947111415165,\n", " 'BsmtFinSF1': 1732.0016007094835,\n", " 'BsmtFinSF2': 520.9882766560984,\n", " 'BsmtUnfSF': 1865.113698435333,\n", " 'TotalBsmtSF': 2286.0497168767233,\n", " '1stFlrSF': 2283.6805173062803,\n", " '2ndFlrSF': 1677.2392305771546,\n", " 'LowQualFinSF': 149.0787736885176,\n", " 'GrLivArea': 3075.569310556133,\n", " 'BsmtFullBath': 1.9636856192070633,\n", " 'BsmtHalfBath': 0.7637721815299992,\n", " 'FullBath': 3.2012740993879882,\n", " 'HalfBath': 1.877166869732324,\n", " 'BedroomAbvGr': 5.303758597292265,\n", " 'KitchenAbvGr': 1.7084277213255645,\n", " 'TotRmsAbvGrd': 11.395793721778118,\n", " 'Fireplaces': 2.519529226227064,\n", " 'GarageYrBlt': 2052.9707419772235,\n", " 'GarageCars': 3.966386813249906,\n", " 'GarageArea': 1095.8302008827814,\n", " 'WoodDeckSF': 480.04361090824267,\n", " 'OpenPorchSF': 250.26561495660084,\n", " 'EnclosedPorch': 216.43485488519244,\n", " '3SsnPorch': 89.5229867716376,\n", " 'ScreenPorch': 184.35773738383577,\n", " 'PoolArea': 101.82445982535369,\n", " 'MiscVal': 1817.7712851835915,\n", " 'MoSold': 14.42955308807171,\n", " 'YrSold': 2011.8643245428148}" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The imputation values\n", "\n", "imputer.imputer_dict_" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# impute the data\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sanity check:\n", "\n", "# No numerical variable with NA is left in the\n", "# transformed data.\n", "\n", "[v for v in train_t.columns if train_t[v].dtypes !=\n", " 'O' and train_t[v].isnull().sum() > 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fenotebook", "language": "python", "name": "fenotebook" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }