{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Winsorizer\n", "Winzorizer finds maximum and minimum values following a Gaussian or skewed distribution as indicated. It can also cap the right, left or both ends of the distribution.\n", "\n", "The Winsorizer() caps maximum and / or minimum values of a variable.\n", "\n", "The Winsorizer() works only with numerical variables. A list of variables can\n", "be indicated. Alternatively, the Winsorizer() will select all numerical\n", "variables in the train set.\n", "\n", "The Winsorizer() first calculates the capping values at the end of the\n", "distribution. The values are determined using:\n", "\n", "- a Gaussian approximation,\n", "- the inter-quantile range proximity rule (IQR)\n", "- percentiles.\n", "\n", "\n", "### Example" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# importing libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.outliers import Winsorizer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv(\n", " 'https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['embarked'].fillna('C', inplace=True)\n", " data['fare'] = data['fare'].astype('float')\n", " data['fare'].fillna(data['fare'].median(), inplace=True)\n", " data['age'] = data['age'].astype('float')\n", " data['age'].fillna(data['age'].median(), inplace=True)\n", " data.drop(['name', 'ticket'], axis=1, inplace=True)\n", " return data\n", "\n", "# To plot histogram of given numerical feature\n", "\n", "\n", "def plot_hist(data, col):\n", " plt.figure(figsize=(8, 5))\n", " plt.hist(data[col], bins=30)\n", " plt.title(\"Distribution of \"+col)\n", " return plt.show()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | pclass | \n", "survived | \n", "sex | \n", "age | \n", "sibsp | \n", "parch | \n", "fare | \n", "cabin | \n", "embarked | \n", "boat | \n", "body | \n", "home.dest | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
157 | \n", "1 | \n", "0 | \n", "male | \n", "28.0 | \n", "0 | \n", "0 | \n", "51.8625 | \n", "E | \n", "S | \n", "NaN | \n", "NaN | \n", "Brighton, MA | \n", "
400 | \n", "2 | \n", "1 | \n", "female | \n", "34.0 | \n", "1 | \n", "1 | \n", "32.5000 | \n", "n | \n", "S | \n", "10 | \n", "NaN | \n", "Greenport, NY | \n", "
546 | \n", "2 | \n", "1 | \n", "female | \n", "28.0 | \n", "0 | \n", "0 | \n", "13.0000 | \n", "n | \n", "S | \n", "9 | \n", "NaN | \n", "Spain | \n", "
618 | \n", "3 | \n", "0 | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "8.0500 | \n", "n | \n", "S | \n", "NaN | \n", "NaN | \n", "Lower Clapton, Middlesex or Erdington, Birmingham | \n", "
1208 | \n", "3 | \n", "0 | \n", "female | \n", "9.0 | \n", "3 | \n", "2 | \n", "27.9000 | \n", "n | \n", "S | \n", "NaN | \n", "NaN | \n", "NaN | \n", "