{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Missing value imputation: RandomSampleImputer\n", "\n", "\n", "The RandomSampleImputer extracts a random sample of observations where data is available, and uses it to replace the NA. It is suitable for numerical and categorical variables.\n", "\n", "To control the random sample extraction, there are various ways to set a seed and ensure or maximize reproducibility.\n", "\n", "\n", "**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**\n", "\n", "[Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3](http://jse.amstat.org/v19n3/decock.pdf)\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.2.0'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make sure you are using this \n", "# Feature-engine version.\n", "\n", "import feature_engine\n", "\n", "feature_engine.__version__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.imputation import RandomSampleImputer" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Download the data from Kaggle and store it\n", "# in the same folder as this notebook.\n", "\n", "data = pd.read_csv('houseprice.csv')\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1022, 79), (438, 79))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Separate the data into train and test sets.\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['Id', 'SalePrice'], axis=1),\n", " data['SalePrice'],\n", " test_size=0.3,\n", " random_state=0,\n", ")\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imputation in batch\n", "\n", "We can set the imputer to impute several observations in batch with a unique seed. This is the equivalent of setting the `random_state` to an integer in `pandas.sample()`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Start the imputer\n", "\n", "imputer = RandomSampleImputer(\n", "\n", " # the variables to impute\n", " variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],\n", "\n", " # the random state for reproducibility\n", " random_state=10,\n", "\n", " # equialent to setting random_state in\n", " # pandas.sample()\n", " seed='general',\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomSampleImputer(random_state=10,\n", " variables=['Alley', 'MasVnrType', 'LotFrontage',\n", " 'MasVnrArea'])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Stores a copy of the train set variables\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlleyMasVnrTypeLotFrontageMasVnrArea
64NaNBrkFaceNaN573.0
682NaNNoneNaN0.0
960NaNNone50.00.0
1384NaNNone60.00.0
1100NaNNone60.00.0
\n", "
" ], "text/plain": [ " Alley MasVnrType LotFrontage MasVnrArea\n", "64 NaN BrkFace NaN 573.0\n", "682 NaN None NaN 0.0\n", "960 NaN None 50.0 0.0\n", "1384 NaN None 60.0 0.0\n", "1100 NaN None 60.0 0.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the imputer saves a copy of the variables \n", "# from the training set to impute new data.\n", "\n", "imputer.X_.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alley 0.939335\n", "MasVnrType 0.004892\n", "LotFrontage 0.184932\n", "MasVnrArea 0.004892\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check missing data in train set\n", "\n", "X_train[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].isnull().mean()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# impute data\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alley 0.0\n", "MasVnrType 0.0\n", "LotFrontage 0.0\n", "MasVnrArea 0.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check missing data after the transformation\n", "\n", "train_t[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].isnull().mean()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# when using the random sample imputer, \n", "# the distribution of the variable does not change.\n", "\n", "# This imputation method is useful for models that \n", "# are sensitive to changes in the variable distributions.\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "X_train['LotFrontage'].plot(kind='kde', ax=ax)\n", "train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')\n", "lines, labels = ax.get_legend_handles_labels()\n", "ax.legend(lines, labels, loc='best')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specific seeds for each observation\n", "\n", "Sometimes, we want to guarantee that the same observation is imputed with the same value, run after run. \n", "\n", "To achieve this, we need to always use the same seed for every particular observation. \n", "\n", "To do this, we can use the values in neighboring variables as seed.\n", "\n", "In this case, the seed will be calculated observation per observation, either by adding or multiplying the seeding variable values, and passed to the random_state of pandas.sample(), which is used under the hood by the imputer.\n", "Then, a value will be extracted from the train set using that seed and used to replace the NAN in particular observation.\n", "\n", "**To know more about how the observation per seed is used check this [notebook](https://github.com/solegalli/feature-engineering-for-machine-learning/blob/master/Section-04-Missing-Data-Imputation/04.07-Random-Sample-Imputation.ipynb)** " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "imputer = RandomSampleImputer(\n", "\n", " # the values of these variables will be used as seed\n", " random_state=['MSSubClass', 'YrSold'],\n", "\n", " # 1 seed per observation\n", " seed='observation',\n", "\n", " # how to combine the values of the seeding variables\n", " seeding_method='add',\n", " \n", " # impute all variables, numerical and categorical\n", " variables=None,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomSampleImputer(random_state=['MSSubClass', 'YrSold'], seed='observation')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Stores a copy of the train set.\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...ScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleCondition
6460RLNaN9375PaveNaNRegLvlAllPubInside...00NaNGdPrvNaN022009WDNormal
682120RLNaN2887PaveNaNRegHLSAllPubInside...00NaNNaNNaN0112008WDNormal
96020RL50.07207PaveNaNIR1LvlAllPubInside...00NaNNaNNaN022010WDNormal
138450RL60.09060PaveNaNRegLvlAllPubInside...00NaNMnPrvNaN0102009WDNormal
110030RL60.08400PaveNaNRegBnkAllPubInside...00NaNNaNNaN012009WDNormal
..................................................................
76360RL82.09430PaveNaNRegLvlAllPubInside...1800NaNNaNNaN072009WDNormal
83520RL60.09600PaveNaNRegLvlAllPubInside...00NaNNaNNaN022010WDNormal
121690RM68.08930PaveNaNRegLvlAllPubInside...00NaNNaNNaN042010WDNormal
559120RLNaN3196PaveNaNRegLvlAllPubInside...00NaNNaNNaN0102006WDNormal
68460RL58.016770PaveNaNIR2LvlAllPubCulDSac...00NaNNaNNaN062010WDNormal
\n", "

1022 rows × 79 columns

\n", "
" ], "text/plain": [ " MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "64 60 RL NaN 9375 Pave NaN Reg \n", "682 120 RL NaN 2887 Pave NaN Reg \n", "960 20 RL 50.0 7207 Pave NaN IR1 \n", "1384 50 RL 60.0 9060 Pave NaN Reg \n", "1100 30 RL 60.0 8400 Pave NaN Reg \n", "... ... ... ... ... ... ... ... \n", "763 60 RL 82.0 9430 Pave NaN Reg \n", "835 20 RL 60.0 9600 Pave NaN Reg \n", "1216 90 RM 68.0 8930 Pave NaN Reg \n", "559 120 RL NaN 3196 Pave NaN Reg \n", "684 60 RL 58.0 16770 Pave NaN IR2 \n", "\n", " LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence \\\n", "64 Lvl AllPub Inside ... 0 0 NaN GdPrv \n", "682 HLS AllPub Inside ... 0 0 NaN NaN \n", "960 Lvl AllPub Inside ... 0 0 NaN NaN \n", "1384 Lvl AllPub Inside ... 0 0 NaN MnPrv \n", "1100 Bnk AllPub Inside ... 0 0 NaN NaN \n", "... ... ... ... ... ... ... ... ... \n", "763 Lvl AllPub Inside ... 180 0 NaN NaN \n", "835 Lvl AllPub Inside ... 0 0 NaN NaN \n", "1216 Lvl AllPub Inside ... 0 0 NaN NaN \n", "559 Lvl AllPub Inside ... 0 0 NaN NaN \n", "684 Lvl AllPub CulDSac ... 0 0 NaN NaN \n", "\n", " MiscFeature MiscVal MoSold YrSold SaleType SaleCondition \n", "64 NaN 0 2 2009 WD Normal \n", "682 NaN 0 11 2008 WD Normal \n", "960 NaN 0 2 2010 WD Normal \n", "1384 NaN 0 10 2009 WD Normal \n", "1100 NaN 0 1 2009 WD Normal \n", "... ... ... ... ... ... ... \n", "763 NaN 0 7 2009 WD Normal \n", "835 NaN 0 2 2010 WD Normal \n", "1216 NaN 0 4 2010 WD Normal \n", "559 NaN 0 10 2006 WD Normal \n", "684 NaN 0 6 2010 WD Normal \n", "\n", "[1022 rows x 79 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# takes a copy of the entire train set\n", "\n", "imputer.X_" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# imputes all variables.\n", "\n", "# this procedure takes a while because it is \n", "# done observation per observation.\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MSSubClass 0\n", "MSZoning 0\n", "LotFrontage 0\n", "LotArea 0\n", "Street 0\n", " ..\n", "MiscVal 0\n", "MoSold 0\n", "YrSold 0\n", "SaleType 0\n", "SaleCondition 0\n", "Length: 79, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No missing data in any variable\n", "# after the imputation.\n", "\n", "test_t.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# when using the random sample imputer, \n", "# the distribution of the variable does not change\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "X_train['LotFrontage'].plot(kind='kde', ax=ax)\n", "train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')\n", "lines, labels = ax.get_legend_handles_labels()\n", "ax.legend(lines, labels, loc='best')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fenotebook", "language": "python", "name": "fenotebook" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }