{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Missing value imputation: DropMissingData\n", "\n", "Deletes rows with missing values.\n", "\n", "DropMissingData works both with numerical and categorical variables.\n", "\n", "**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**\n", "\n", "[Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3](http://jse.amstat.org/v19n3/decock.pdf)\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.2.0'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make sure you are using this \n", "# Feature-engine version.\n", "\n", "import feature_engine\n", "\n", "feature_engine.__version__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.imputation import DropMissingData" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Download the data from Kaggle and store it\n", "# in the same folder as this notebook.\n", "\n", "data = pd.read_csv('houseprice.csv')\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1022, 79), (438, 79))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Separate the data into train and test sets.\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['Id', 'SalePrice'], axis=1),\n", " data['SalePrice'],\n", " test_size=0.3,\n", " random_state=0,\n", ")\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop data based on specific variables.\n", "\n", "We can drop observations that show NA in any of a subset of variables." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Drop data when there are NA in any of the indicated variables\n", "\n", "imputer = DropMissingData(\n", " variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],\n", " missing_only=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DropMissingData(missing_only=False,\n", " variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# variables from which observations with NA will be deleted\n", "\n", "imputer.variables_" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alley 960\n", "MasVnrType 5\n", "LotFrontage 189\n", "MasVnrArea 5\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of observations with NA before the transformation\n", "\n", "X_train[imputer.variables].isna().sum()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# After the transformation the rows with NA values are \n", "# deleted form the dataframe\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alley 0\n", "MasVnrType 0\n", "LotFrontage 0\n", "MasVnrArea 0\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of observations with NA after transformation\n", "\n", "train_t[imputer.variables].isna().sum()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1022, 79)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Shape of dataframe before transformation\n", "\n", "X_train.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(59, 79)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Shape of dataframe after transformation\n", "\n", "train_t.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(963, 79)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The \"return_na_data()\" method, returns a dataframe that contains\n", "# the observations with NA. \n", "\n", "# That is, the portion of the data that is dropped when\n", "# we apply the transform() method.\n", "\n", "tmp = imputer.return_na_data(X_train)\n", "\n", "tmp.shape" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "59" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# total obs - obs with NA = final dataframe shape\n", "# after the transformation\n", "\n", "1022-963" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, it is useful to retain the observation with NA in the production environment, to log which\n", "observations are not being scored by the model for example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop data when variables contain % of NA\n", "\n", "We can drop observations if they contain less than a required percentage of values in a subset of observations." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Drop data if an observation contains NA in \n", "# 2 of the 4 indicated variables (50%).\n", "\n", "imputer = DropMissingData(\n", " variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],\n", " missing_only=False,\n", " threshold=0.5,\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DropMissingData(missing_only=False, threshold=0.5,\n", " variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# After the transformation the rows with NA values are \n", "# deleted form the dataframe\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alley 955\n", "MasVnrType 0\n", "LotFrontage 188\n", "MasVnrArea 0\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of observations with NA after transformation\n", "\n", "train_t[imputer.variables].isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that not all missing observations were dropped, because we required the observation to have NA in more than 1 of the variables at the time. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatically select all variables\n", "\n", "We can drop obserations if they show NA in any variable in the dataset.\n", "\n", "When the parameter `variables` is left to None and the parameter `missing_only` is left to True, the imputer will evaluate observations based of all variables with missing data.\n", "\n", "When the parameter `variables` is left to None and the parameter `missing_only` is switched to False, the imputer will evaluate observations based of all variables.\n", "\n", "It is good practice to use `missing_only=True` when we set `variables=None`, so that the transformer handles the imputation automatically in a meaningful way.\n", "\n", "### Automatically find variables with NA" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DropMissingData()" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find variables with NA\n", "\n", "imputer = DropMissingData(missing_only=True)\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['LotFrontage',\n", " 'Alley',\n", " 'MasVnrType',\n", " 'MasVnrArea',\n", " 'BsmtQual',\n", " 'BsmtCond',\n", " 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2',\n", " 'Electrical',\n", " 'FireplaceQu',\n", " 'GarageType',\n", " 'GarageYrBlt',\n", " 'GarageFinish',\n", " 'GarageQual',\n", " 'GarageCond',\n", " 'PoolQC',\n", " 'Fence',\n", " 'MiscFeature']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# variables with NA in the train set\n", "\n", "imputer.variables_" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 189\n", "Alley 960\n", "MasVnrType 5\n", "MasVnrArea 5\n", "BsmtQual 24\n", "BsmtCond 24\n", "BsmtExposure 24\n", "BsmtFinType1 24\n", "BsmtFinType2 25\n", "Electrical 1\n", "FireplaceQu 478\n", "GarageType 54\n", "GarageYrBlt 54\n", "GarageFinish 54\n", "GarageQual 54\n", "GarageCond 54\n", "PoolQC 1019\n", "Fence 831\n", "MiscFeature 978\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of observations with NA\n", "\n", "X_train[imputer.variables_].isna().sum()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# After the transformation the rows with NA are deleted form the dataframe\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LotFrontage 0.0\n", "Alley 0.0\n", "MasVnrType 0.0\n", "MasVnrArea 0.0\n", "BsmtQual 0.0\n", "BsmtCond 0.0\n", "BsmtExposure 0.0\n", "BsmtFinType1 0.0\n", "BsmtFinType2 0.0\n", "Electrical 0.0\n", "FireplaceQu 0.0\n", "GarageType 0.0\n", "GarageYrBlt 0.0\n", "GarageFinish 0.0\n", "GarageQual 0.0\n", "GarageCond 0.0\n", "PoolQC 0.0\n", "Fence 0.0\n", "MiscFeature 0.0\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of observations with NA after the transformation\n", "\n", "train_t[imputer.variables_].isna().sum()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0, 79)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in this case, all observations will be dropped\n", "# because all of them show NA at least in 1 variable\n", "\n", "train_t.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop rows with % of missing data\n", "\n", "Not to end up with an empty dataframe, let's drop rows that have less than 75% of the variables with values." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DropMissingData(threshold=0.75)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find variables with NA\n", "\n", "imputer = DropMissingData(\n", " missing_only=True,\n", " threshold=0.75,\n", ")\n", "\n", "imputer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# After the transformation the rows with NA are deleted form the dataframe\n", "\n", "train_t = imputer.transform(X_train)\n", "test_t = imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1022, 79)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we do have some data left." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fenotebook", "language": "python", "name": "fenotebook" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }