{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Robust covariance estimation and Mahalanobis distances relevance\n\nThis example shows covariance estimation with Mahalanobis\ndistances on Gaussian distributed data.\n\nFor Gaussian distributed data, the distance of an observation\n$x_i$ to the mode of the distribution can be computed using its\nMahalanobis distance:\n\n\\begin{align}d_{(\\mu,\\Sigma)}(x_i)^2 = (x_i - \\mu)^T\\Sigma^{-1}(x_i - \\mu)\\end{align}\n\nwhere $\\mu$ and $\\Sigma$ are the location and the covariance of\nthe underlying Gaussian distributions.\n\nIn practice, $\\mu$ and $\\Sigma$ are replaced by some\nestimates. The standard covariance maximum likelihood estimate (MLE) is very\nsensitive to the presence of outliers in the data set and therefore,\nthe downstream Mahalanobis distances also are. It would be better to\nuse a robust estimator of covariance to guarantee that the estimation is\nresistant to \"erroneous\" observations in the dataset and that the\ncalculated Mahalanobis distances accurately reflect the true\norganization of the observations.\n\nThe Minimum Covariance Determinant estimator (MCD) is a robust,\nhigh-breakdown point (i.e. it can be used to estimate the covariance\nmatrix of highly contaminated datasets, up to\n$\\frac{n_\\text{samples}-n_\\text{features}-1}{2}$ outliers)\nestimator of covariance. The idea behind the MCD is to find\n$\\frac{n_\\text{samples}+n_\\text{features}+1}{2}$\nobservations whose empirical covariance has the smallest determinant,\nyielding a \"pure\" subset of observations from which to compute\nstandards estimates of location and covariance. The MCD was introduced by\nP.J.Rousseuw in [1]_.\n\nThis example illustrates how the Mahalanobis distances are affected by\noutlying data. Observations drawn from a contaminating distribution\nare not distinguishable from the observations coming from the real,\nGaussian distribution when using standard covariance MLE based Mahalanobis\ndistances. Using MCD-based\nMahalanobis distances, the two populations become\ndistinguishable. Associated applications include outlier detection,\nobservation ranking and clustering.\n\n

Note

See also `sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py`

\n\n.. rubric:: References\n\n.. [1] P. J. Rousseeuw. [Least median of squares regression](http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf). J. Am\n Stat Ass, 79:871, 1984.\n.. [2] Wilson, E. B., & Hilferty, M. M. (1931). [The distribution of chi-square.](https://water.usgs.gov/osw/bulletin17b/Wilson_Hilferty_1931.pdf)\n Proceedings of the National Academy of Sciences of the United States\n of America, 17, 684-688.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n\nFirst, we generate a dataset of 125 samples and 2 features. Both features\nare Gaussian distributed with mean of 0 but feature 1 has a standard\ndeviation equal to 2 and feature 2 has a standard deviation equal to 1. Next,\n25 samples are replaced with Gaussian outlier samples where feature 1 has\na standard deviation equal to 1 and feature 2 has a standard deviation equal\nto 7.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\n# for consistent results\nnp.random.seed(7)\n\nn_samples = 125\nn_outliers = 25\nn_features = 2\n\n# generate Gaussian data of shape (125, 2)\ngen_cov = np.eye(n_features)\ngen_cov[0, 0] = 2.0\nX = np.dot(np.random.randn(n_samples, n_features), gen_cov)\n# add some outliers\noutliers_cov = np.eye(n_features)\noutliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.0\nX[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparison of results\n\nBelow, we fit MCD and MLE based covariance estimators to our data and print\nthe estimated covariance matrices. Note that the estimated variance of\nfeature 2 is much higher with the MLE based estimator (7.5) than\nthat of the MCD robust estimator (1.2). This shows that the MCD based\nrobust estimator is much more resistant to the outlier samples, which were\ndesigned to have a much larger variance in feature 2.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.covariance import EmpiricalCovariance, MinCovDet\n\n# fit a MCD robust estimator to data\nrobust_cov = MinCovDet().fit(X)\n# fit a MLE estimator to data\nemp_cov = EmpiricalCovariance().fit(X)\nprint(\n \"Estimated covariance matrix:\\nMCD (Robust):\\n{}\\nMLE:\\n{}\".format(\n robust_cov.covariance_, emp_cov.covariance_\n )\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To better visualize the difference, we plot contours of the\nMahalanobis distances calculated by both methods. Notice that the robust\nMCD based Mahalanobis distances fit the inlier black points much better,\nwhereas the MLE based distances are more influenced by the outlier\nred points.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.lines as mlines\n\nfig, ax = plt.subplots(figsize=(10, 5))\n# Plot data set\ninlier_plot = ax.scatter(X[:, 0], X[:, 1], color=\"black\", label=\"inliers\")\noutlier_plot = ax.scatter(\n X[:, 0][-n_outliers:], X[:, 1][-n_outliers:], color=\"red\", label=\"outliers\"\n)\nax.set_xlim(ax.get_xlim()[0], 10.0)\nax.set_title(\"Mahalanobis distances of a contaminated data set\")\n\n# Create meshgrid of feature 1 and feature 2 values\nxx, yy = np.meshgrid(\n np.linspace(plt.xlim()[0], plt.xlim()[1], 100),\n np.linspace(plt.ylim()[0], plt.ylim()[1], 100),\n)\nzz = np.c_[xx.ravel(), yy.ravel()]\n# Calculate the MLE based Mahalanobis distances of the meshgrid\nmahal_emp_cov = emp_cov.mahalanobis(zz)\nmahal_emp_cov = mahal_emp_cov.reshape(xx.shape)\nemp_cov_contour = plt.contour(\n xx, yy, np.sqrt(mahal_emp_cov), cmap=plt.cm.PuBu_r, linestyles=\"dashed\"\n)\n# Calculate the MCD based Mahalanobis distances\nmahal_robust_cov = robust_cov.mahalanobis(zz)\nmahal_robust_cov = mahal_robust_cov.reshape(xx.shape)\nrobust_contour = ax.contour(\n xx, yy, np.sqrt(mahal_robust_cov), cmap=plt.cm.YlOrBr_r, linestyles=\"dotted\"\n)\n\n# Add legend\nax.legend(\n [\n mlines.Line2D([], [], color=\"tab:blue\", linestyle=\"dashed\"),\n mlines.Line2D([], [], color=\"tab:orange\", linestyle=\"dotted\"),\n inlier_plot,\n outlier_plot,\n ],\n [\"MLE dist\", \"MCD dist\", \"inliers\", \"outliers\"],\n loc=\"upper right\",\n borderaxespad=0,\n)\n\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we highlight the ability of MCD based Mahalanobis distances to\ndistinguish outliers. We take the cubic root of the Mahalanobis distances,\nyielding approximately normal distributions (as suggested by Wilson and\nHilferty [2]_), then plot the values of inlier and outlier samples with\nboxplots. The distribution of outlier samples is more separated from the\ndistribution of inlier samples for robust MCD based Mahalanobis distances.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2)\nplt.subplots_adjust(wspace=0.6)\n\n# Calculate cubic root of MLE Mahalanobis distances for samples\nemp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33)\n# Plot boxplots\nax1.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=0.25)\n# Plot individual samples\nax1.plot(\n np.full(n_samples - n_outliers, 1.26),\n emp_mahal[:-n_outliers],\n \"+k\",\n markeredgewidth=1,\n)\nax1.plot(np.full(n_outliers, 2.26), emp_mahal[-n_outliers:], \"+k\", markeredgewidth=1)\nax1.axes.set_xticklabels((\"inliers\", \"outliers\"), size=15)\nax1.set_ylabel(r\"$\\sqrt[3]{\\rm{(Mahal. dist.)}}$\", size=16)\nax1.set_title(\"Using non-robust estimates\\n(Maximum Likelihood)\")\n\n# Calculate cubic root of MCD Mahalanobis distances for samples\nrobust_mahal = robust_cov.mahalanobis(X - robust_cov.location_) ** (0.33)\n# Plot boxplots\nax2.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]], widths=0.25)\n# Plot individual samples\nax2.plot(\n np.full(n_samples - n_outliers, 1.26),\n robust_mahal[:-n_outliers],\n \"+k\",\n markeredgewidth=1,\n)\nax2.plot(np.full(n_outliers, 2.26), robust_mahal[-n_outliers:], \"+k\", markeredgewidth=1)\nax2.axes.set_xticklabels((\"inliers\", \"outliers\"), size=15)\nax2.set_ylabel(r\"$\\sqrt[3]{\\rm{(Mahal. dist.)}}$\", size=16)\nax2.set_title(\"Using robust estimates\\n(Minimum Covariance Determinant)\")\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }