{ "cells": [ { "cell_type": "markdown", "id": "checked-orange", "metadata": {}, "source": [ "# OutlierTrimmer\n", "The OutlierTrimmer() removes observations with outliers from the dataset.\n", "\n", "It works only with numerical variables. A list of variables can be indicated.\n", "Alternatively, the OutlierTrimmer() will select all numerical variables.\n", "\n", "The OutlierTrimmer() first calculates the maximum and /or minimum values\n", "beyond which a value will be considered an outlier, and thus removed.\n", "\n", "Limits are determined using:\n", "\n", "- a Gaussian approximation\n", "- the inter-quantile range proximity rule\n", "- percentiles.\n", "\n", "### Example:" ] }, { "cell_type": "code", "execution_count": 1, "id": "original-pasta", "metadata": {}, "outputs": [], "source": [ "# importing libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.outliers import OutlierTrimmer" ] }, { "cell_type": "code", "execution_count": 2, "id": "planned-programmer", "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv(\n", " 'https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['embarked'].fillna('C', inplace=True)\n", " data['fare'] = data['fare'].astype('float')\n", " data['fare'].fillna(data['fare'].median(), inplace=True)\n", " data['age'] = data['age'].astype('float')\n", " data['age'].fillna(data['age'].median(), inplace=True)\n", " data.drop(['name', 'ticket'], axis=1, inplace=True)\n", " return data\n", "\n", "# To plot histogram of given numerical feature\n", "\n", "\n", "def plot_hist(data, col):\n", " plt.figure(figsize=(8, 5))\n", " plt.hist(data[col], bins=30)\n", " plt.title(\"Distribution of \" + col)\n", " return plt.show()" ] }, { "cell_type": "code", "execution_count": 3, "id": "objective-professor", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarkedboatbodyhome.dest
67530male21.0007.775nSNaNNaNBrennes, Norway New York
55821female18.00213.000nS16NaNFinland / Minneapolis, MN
19410male30.00026.000CSNaNNaNBrockton, MA
21710male64.00026.000nSNaN263Isle of Wight, England
47320male28.0000.000nSNaNNaNBelfast
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked boat \\\n", "675 3 0 male 21.0 0 0 7.775 n S NaN \n", "558 2 1 female 18.0 0 2 13.000 n S 16 \n", "194 1 0 male 30.0 0 0 26.000 C S NaN \n", "217 1 0 male 64.0 0 0 26.000 n S NaN \n", "473 2 0 male 28.0 0 0 0.000 n S NaN \n", "\n", " body home.dest \n", "675 NaN Brennes, Norway New York \n", "558 NaN Finland / Minneapolis, MN \n", "194 NaN Brockton, MA \n", "217 263 Isle of Wight, England \n", "473 NaN Belfast " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Loading titanic dataset\n", "data = load_titanic()\n", "data.sample(5)" ] }, { "cell_type": "code", "execution_count": 4, "id": "nervous-interference", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape before removing outliers: (916, 11)\n", "test data shape before removing outliers: (393, 11)\n" ] } ], "source": [ "# let's separate into training and testing set\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(data.drop('survived', axis=1),\n", " data['survived'],\n", " test_size=0.3,\n", " random_state=0)\n", "\n", "print(\"train data shape before removing outliers:\", X_train.shape)\n", "print(\"test data shape before removing outliers:\", X_test.shape)" ] }, { "cell_type": "code", "execution_count": 5, "id": "medium-chile", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max age: 80.0\n", "Max fare: 512.3292\n", "Min age: 0.1667\n", "Min fare: 0.0\n" ] } ], "source": [ "# let's find out the maximum Age and maximum Fare in the titanic\n", "\n", "print(\"Max age:\", data.age.max())\n", "print(\"Max fare:\", data.fare.max())\n", "\n", "print(\"Min age:\", data.age.min())\n", "print(\"Min fare:\", data.fare.min())" ] }, { "cell_type": "code", "execution_count": 6, "id": "suburban-mills", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Histogram of age feature before capping outliers\n", "plot_hist(data, 'age')" ] }, { "cell_type": "code", "execution_count": 7, "id": "compatible-finish", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Histogram of fare feature before capping outliers\n", "plot_hist(data, 'fare')" ] }, { "cell_type": "markdown", "id": "weighted-palestinian", "metadata": {}, "source": [ "### Outlier trimming using Gaussian limits:\n", "The transformer will find the maximum and / or minimum values to\n", " trim the variables using the Gaussian approximation.\n", "\n", "\n", "- right tail: mean + 3* std\n", "- left tail: mean - 3* std" ] }, { "cell_type": "code", "execution_count": 8, "id": "micro-knitting", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OutlierTrimmer(variables=['age', 'fare'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Parameters\n", "----------\n", "\n", "capping_method : str, default=gaussian\n", " Desired capping method. Can take 'gaussian', 'iqr' or 'quantiles'.\n", " \n", "tail : str, default=right\n", " Whether to cap outliers on the right, left or both tails of the distribution.\n", " Can take 'left', 'right' or 'both'.\n", "\n", "fold: int or float, default=3\n", " How far out to to place the capping values. The number that will multiply\n", " the std or IQR to calculate the capping values.\n", "\n", "variables : list, default=None\n", "\n", "missing_values: string, default='raise'\n", " Indicates if missing values should be ignored or raised.'''\n", "\n", "# removing outliers based on right tail of age and fare columns using gaussian capping method\n", "trimmer = OutlierTrimmer(\n", " capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare'])\n", "\n", "# fitting trimmer object to training data\n", "trimmer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 9, "id": "revolutionary-giant", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'age': 67.49048447470315, 'fare': 174.78162171790441}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here we can find the maximum caps allowed\n", "trimmer.right_tail_caps_" ] }, { "cell_type": "code", "execution_count": 10, "id": "requested-paint", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this dictionary is empty, because we selected only right tail\n", "trimmer.left_tail_caps_" ] }, { "cell_type": "code", "execution_count": 15, "id": "extreme-contribution", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max age: 66.0\n", "Max fare: 164.8667\n" ] } ], "source": [ "# transforming the training and testing data\n", "train_t = trimmer.transform(X_train)\n", "test_t = trimmer.transform(X_test)\n", "\n", "# let's check the new maximum Age and maximum Fare in the titanic\n", "print(\"Max age:\", train_t.age.max())\n", "print(\"Max fare:\", train_t.fare.max())" ] }, { "cell_type": "code", "execution_count": 12, "id": "mobile-charger", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape after removing outliers: (887, 11)\n", "29 observations are removed\n", "\n", "test data shape after removing outliers: (376, 11)\n", "17 observations are removed\n" ] } ], "source": [ "print(\"train data shape after removing outliers:\", train_t.shape)\n", "print(f\"{X_train.shape[0] - train_t.shape[0]} observations are removed\\n\")\n", "\n", "print(\"test data shape after removing outliers:\", test_t.shape)\n", "print(f\"{X_test.shape[0] - test_t.shape[0]} observations are removed\")" ] }, { "cell_type": "markdown", "id": "duplicate-automation", "metadata": {}, "source": [ "### Gaussian approximation trimming, both tails" ] }, { "cell_type": "code", "execution_count": 16, "id": "fifteen-parker", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OutlierTrimmer(fold=2, tail='both', variables=['fare', 'age'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Trimming the outliers at both tails using gaussian method\n", "trimmer = OutlierTrimmer(\n", " capping_method='gaussian', tail='both', fold=2, variables=['fare', 'age'])\n", "trimmer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 17, "id": "meaningful-kinase", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum caps : {'fare': -62.30099726608475, 'age': 4.681562024142586}\n", "Maximum caps : {'fare': 127.36509792110658, 'age': 54.92869998459104}\n" ] } ], "source": [ "print(\"Minimum caps :\", trimmer.left_tail_caps_)\n", "\n", "print(\"Maximum caps :\", trimmer.right_tail_caps_)" ] }, { "cell_type": "code", "execution_count": 18, "id": "confidential-tradition", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape after removing outliers: (803, 11)\n", "113 observations are removed\n", "\n", "test data shape after removing outliers: (334, 11)\n", "59 observations are removed\n" ] } ], "source": [ "# transforming the training and testing data\n", "train_t = trimmer.transform(X_train)\n", "test_t = trimmer.transform(X_test)\n", "\n", "print(\"train data shape after removing outliers:\", train_t.shape)\n", "print(f\"{X_train.shape[0] - train_t.shape[0]} observations are removed\\n\")\n", "\n", "print(\"test data shape after removing outliers:\", test_t.shape)\n", "print(f\"{X_test.shape[0] - test_t.shape[0]} observations are removed\")" ] }, { "cell_type": "markdown", "id": "fundamental-address", "metadata": {}, "source": [ "### Inter Quartile Range, both tails\n", "The transformer will find the boundaries using the IQR proximity rule.\n", "**IQR limits:**\n", "\n", "- right tail: 75th quantile + 3* IQR\n", "- left tail: 25th quantile - 3* IQR\n", "\n", "where IQR is the inter-quartile range: 75th quantile - 25th quantile.\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "closed-knight", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OutlierTrimmer(capping_method='iqr', tail='both', variables=['age', 'fare'])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# trimming at both tails using iqr capping method\n", "trimmer = OutlierTrimmer(\n", " capping_method='iqr', tail='both', variables=['age', 'fare'])\n", "\n", "trimmer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 20, "id": "psychological-holmes", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum caps : {'age': -13.0, 'fare': -62.24179999999999}\n", "Maximum caps : {'age': 71.0, 'fare': 101.4126}\n" ] } ], "source": [ "print(\"Minimum caps :\", trimmer.left_tail_caps_)\n", "\n", "print(\"Maximum caps :\", trimmer.right_tail_caps_)" ] }, { "cell_type": "code", "execution_count": 21, "id": "neither-enlargement", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape after removing outliers: (857, 11)\n", "59 observations are removed\n", "\n", "test data shape after removing outliers: (365, 11)\n", "28 observations are removed\n" ] } ], "source": [ "# transforming the training and testing data\n", "train_t = trimmer.transform(X_train)\n", "test_t = trimmer.transform(X_test)\n", "\n", "print(\"train data shape after removing outliers:\", train_t.shape)\n", "print(f\"{X_train.shape[0] - train_t.shape[0]} observations are removed\\n\")\n", "\n", "print(\"test data shape after removing outliers:\", test_t.shape)\n", "print(f\"{X_test.shape[0] - test_t.shape[0]} observations are removed\")" ] }, { "cell_type": "markdown", "id": "robust-highland", "metadata": {}, "source": [ "### percentiles or quantiles:\n", "The limits are given by the percentiles.\n", "- right tail: 98th percentile\n", "- left tail: 2nd percentile" ] }, { "cell_type": "code", "execution_count": 23, "id": "egyptian-northwest", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OutlierTrimmer(capping_method='quantiles', fold=0.02, tail='both',\n", " variables=['age', 'fare'])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# trimming at both tails using quantiles capping method\n", "trimmer = OutlierTrimmer(capping_method='quantiles',\n", " tail='both', fold=0.02, variables=['age', 'fare'])\n", "\n", "trimmer.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 24, "id": "banner-logistics", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum caps : {'age': 2.0, 'fare': 6.44125}\n", "Maximum caps : {'age': 61.69999999999993, 'fare': 211.5}\n" ] } ], "source": [ "print(\"Minimum caps :\", trimmer.left_tail_caps_)\n", "\n", "print(\"Maximum caps :\", trimmer.right_tail_caps_)" ] }, { "cell_type": "code", "execution_count": 25, "id": "familiar-climate", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape after removing outliers: (852, 11)\n", "64 observations are removed\n", "\n", "test data shape after removing outliers: (358, 11)\n", "35 observations are removed\n" ] } ], "source": [ "# transforming the training and testing data\n", "train_t = trimmer.transform(X_train)\n", "test_t = trimmer.transform(X_test)\n", "\n", "print(\"train data shape after removing outliers:\", train_t.shape)\n", "print(f\"{X_train.shape[0] - train_t.shape[0]} observations are removed\\n\")\n", "\n", "print(\"test data shape after removing outliers:\", test_t.shape)\n", "print(f\"{X_test.shape[0] - test_t.shape[0]} observations are removed\")" ] }, { "cell_type": "code", "execution_count": 26, "id": "usual-playlist", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Histogram of age feature after removing outliers\n", "plot_hist(train_t, 'age')" ] }, { "cell_type": "code", "execution_count": 27, "id": "yellow-group", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Histogram of fare feature after removing outliers\n", "plot_hist(train_t, 'fare')" ] }, { "cell_type": "code", "execution_count": null, "id": "unavailable-geography", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }