{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Compare the effect of different scalers on data with outliers\n\nFeature 0 (median income in a block) and feature 5 (average house occupancy) of\nthe `california_housing_dataset` have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way they estimate the parameters used to shift and scale each\nfeature.\n\n:class:`~sklearn.preprocessing.QuantileTransformer` provides non-linear\ntransformations in which distances\nbetween marginal outliers and inliers are shrunk.\n:class:`~sklearn.preprocessing.PowerTransformer` provides\nnon-linear transformations in which data is mapped to a normal distribution to\nstabilize variance and minimize skewness.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib as mpl\nimport numpy as np\nfrom matplotlib import cm\nfrom matplotlib import pyplot as plt\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.preprocessing import (\n MaxAbsScaler,\n MinMaxScaler,\n Normalizer,\n PowerTransformer,\n QuantileTransformer,\n RobustScaler,\n StandardScaler,\n minmax_scale,\n)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\nfeature_names = dataset.feature_names\n\nfeature_mapping = {\n \"MedInc\": \"Median income in block\",\n \"HouseAge\": \"Median house age in block\",\n \"AveRooms\": \"Average number of rooms\",\n \"AveBedrms\": \"Average number of bedrooms\",\n \"Population\": \"Block population\",\n \"AveOccup\": \"Average house occupancy\",\n \"Latitude\": \"House block latitude\",\n \"Longitude\": \"House block longitude\",\n}\n\n# Take only 2 features to make visualization easier\n# Feature MedInc has a long tail distribution.\n# Feature AveOccup has a few but very large outliers.\nfeatures = [\"MedInc\", \"AveOccup\"]\nfeatures_idx = [feature_names.index(feature) for feature in features]\nX = X_full[:, features_idx]\ndistributions = [\n (\"Unscaled data\", X),\n (\"Data after standard scaling\", StandardScaler().fit_transform(X)),\n (\"Data after min-max scaling\", MinMaxScaler().fit_transform(X)),\n (\"Data after max-abs scaling\", MaxAbsScaler().fit_transform(X)),\n (\n \"Data after robust scaling\",\n RobustScaler(quantile_range=(25, 75)).fit_transform(X),\n ),\n (\n \"Data after power transformation (Yeo-Johnson)\",\n PowerTransformer(method=\"yeo-johnson\").fit_transform(X),\n ),\n (\n \"Data after power transformation (Box-Cox)\",\n PowerTransformer(method=\"box-cox\").fit_transform(X),\n ),\n (\n \"Data after quantile transformation (uniform pdf)\",\n QuantileTransformer(\n output_distribution=\"uniform\", random_state=42\n ).fit_transform(X),\n ),\n (\n \"Data after quantile transformation (gaussian pdf)\",\n QuantileTransformer(\n output_distribution=\"normal\", random_state=42\n ).fit_transform(X),\n ),\n (\"Data after sample-wise L2 normalizing\", Normalizer().fit_transform(X)),\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n# plasma does not exist in matplotlib < 1.5\ncmap = getattr(cm, \"plasma_r\", cm.hot_r)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return (\n (ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar,\n )\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\", x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cmap(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker=\"o\", s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines[\"top\"].set_visible(False)\n ax.spines[\"right\"].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines[\"left\"].set_position((\"outward\", 10))\n ax.spines[\"bottom\"].set_position((\"outward\", 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(\n X[:, 1], bins=hist_nbins, orientation=\"horizontal\", color=\"grey\", ec=\"grey\"\n )\n hist_X1.axis(\"off\")\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(\n X[:, 0], bins=hist_nbins, orientation=\"vertical\", color=\"grey\", ec=\"grey\"\n )\n hist_X0.axis(\"off\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two plots will be shown for each scaler/normalizer/transformer. The left\nfigure will show a scatter plot of the full data set while the right figure\nwill exclude the extreme values considering only 99 % of the data set,\nexcluding marginal outliers. In addition, the marginal distributions for each\nfeature will be shown on the sides of the scatter plot.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def make_plot(item_idx):\n title, X = distributions[item_idx]\n ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)\n axarr = (ax_zoom_out, ax_zoom_in)\n plot_distribution(\n axarr[0],\n X,\n y,\n hist_nbins=200,\n x0_label=feature_mapping[features[0]],\n x1_label=feature_mapping[features[1]],\n title=\"Full data\",\n )\n\n # zoom-in\n zoom_in_percentile_range = (0, 99)\n cutoffs_X0 = np.percentile(X[:, 0], zoom_in_percentile_range)\n cutoffs_X1 = np.percentile(X[:, 1], zoom_in_percentile_range)\n\n non_outliers_mask = np.all(X > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) & np.all(\n X < [cutoffs_X0[1], cutoffs_X1[1]], axis=1\n )\n plot_distribution(\n axarr[1],\n X[non_outliers_mask],\n y[non_outliers_mask],\n hist_nbins=50,\n x0_label=feature_mapping[features[0]],\n x1_label=feature_mapping[features[1]],\n title=\"Zoom-in\",\n )\n\n norm = mpl.colors.Normalize(y_full.min(), y_full.max())\n mpl.colorbar.ColorbarBase(\n ax_colorbar,\n cmap=cmap,\n norm=norm,\n orientation=\"vertical\",\n label=\"Color mapping for values of y\",\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## Original data\n\nEach transformation is plotted showing two transformed features, with the\nleft plot showing the entire dataset, and the right zoomed-in to show the\ndataset without the marginal outliers. A large majority of the samples are\ncompacted to a specific range, [0, 10] for the median income and [0, 6] for\nthe average house occupancy. Note that there are some marginal outliers (some\nblocks have average occupancy of more than 1200). Therefore, a specific\npre-processing can be very beneficial depending of the application. In the\nfollowing, we present some insights and behaviors of those pre-processing\nmethods in the presence of marginal outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## StandardScaler\n\n:class:`~sklearn.preprocessing.StandardScaler` removes the mean and scales\nthe data to unit variance. The scaling shrinks the range of the feature\nvalues as shown in the left figure below.\nHowever, the outliers have an influence when computing the empirical mean and\nstandard deviation. Note in particular that because the outliers on each\nfeature have different magnitudes, the spread of the transformed data on\neach feature is very different: most of the data lie in the [-2, 4] range for\nthe transformed median income feature while the same data is squeezed in the\nsmaller [-0.2, 0.2] range for the transformed average house occupancy.\n\n:class:`~sklearn.preprocessing.StandardScaler` therefore cannot guarantee\nbalanced feature scales in the\npresence of outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## MinMaxScaler\n\n:class:`~sklearn.preprocessing.MinMaxScaler` rescales the data set such that\nall feature values are in\nthe range [0, 1] as shown in the right panel below. However, this scaling\ncompresses all inliers into the narrow range [0, 0.005] for the transformed\naverage house occupancy.\n\nBoth :class:`~sklearn.preprocessing.StandardScaler` and\n:class:`~sklearn.preprocessing.MinMaxScaler` are very sensitive to the\npresence of outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## MaxAbsScaler\n\n:class:`~sklearn.preprocessing.MaxAbsScaler` is similar to\n:class:`~sklearn.preprocessing.MinMaxScaler` except that the\nvalues are mapped across several ranges depending on whether negative\nOR positive values are present. If only positive values are present, the\nrange is [0, 1]. If only negative values are present, the range is [-1, 0].\nIf both negative and positive values are present, the range is [-1, 1].\nOn positive only data, both :class:`~sklearn.preprocessing.MinMaxScaler`\nand :class:`~sklearn.preprocessing.MaxAbsScaler` behave similarly.\n:class:`~sklearn.preprocessing.MaxAbsScaler` therefore also suffers from\nthe presence of large outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## RobustScaler\n\nUnlike the previous scalers, the centering and scaling statistics of\n:class:`~sklearn.preprocessing.RobustScaler`\nare based on percentiles and are therefore not influenced by a small\nnumber of very large marginal outliers. Consequently, the resulting range of\nthe transformed feature values is larger than for the previous scalers and,\nmore importantly, are approximately similar: for both features most of the\ntransformed values lie in a [-2, 3] range as seen in the zoomed-in figure.\nNote that the outliers themselves are still present in the transformed data.\nIf a separate outlier clipping is desirable, a non-linear transformation is\nrequired (see below).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## PowerTransformer\n\n:class:`~sklearn.preprocessing.PowerTransformer` applies a power\ntransformation to each feature to make the data more Gaussian-like in order\nto stabilize variance and minimize skewness. Currently the Yeo-Johnson\nand Box-Cox transforms are supported and the optimal\nscaling factor is determined via maximum likelihood estimation in both\nmethods. By default, :class:`~sklearn.preprocessing.PowerTransformer` applies\nzero-mean, unit variance normalization. Note that\nBox-Cox can only be applied to strictly positive data. Income and average\nhouse occupancy happen to be strictly positive, but if negative values are\npresent the Yeo-Johnson transformed is preferred.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(5)\nmake_plot(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## QuantileTransformer (uniform output)\n\n:class:`~sklearn.preprocessing.QuantileTransformer` applies a non-linear\ntransformation such that the\nprobability density function of each feature will be mapped to a uniform\nor Gaussian distribution. In this case, all the data, including outliers,\nwill be mapped to a uniform distribution with the range [0, 1], making\noutliers indistinguishable from inliers.\n\n:class:`~sklearn.preprocessing.RobustScaler` and\n:class:`~sklearn.preprocessing.QuantileTransformer` are robust to outliers in\nthe sense that adding or removing outliers in the training set will yield\napproximately the same transformation. But contrary to\n:class:`~sklearn.preprocessing.RobustScaler`,\n:class:`~sklearn.preprocessing.QuantileTransformer` will also automatically\ncollapse any outlier by setting them to the a priori defined range boundaries\n(0 and 1). This can result in saturation artifacts for extreme values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## QuantileTransformer (Gaussian output)\n\nTo map to a Gaussian distribution, set the parameter\n``output_distribution='normal'``.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## Normalizer\n\nThe :class:`~sklearn.preprocessing.Normalizer` rescales the vector for each\nsample to have unit norm,\nindependently of the distribution of the samples. It can be seen on both\nfigures below where all samples are mapped onto the unit circle. In our\nexample the two selected features have only positive values; therefore the\ntransformed data only lie in the positive quadrant. This would not be the\ncase if some original features had a mix of positive and negative values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "make_plot(9)\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }