{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Robust covariance estimation and Mahalanobis distances relevance\n\nThis example shows covariance estimation with Mahalanobis\ndistances on Gaussian distributed data.\n\nFor Gaussian distributed data, the distance of an observation\n$x_i$ to the mode of the distribution can be computed using its\nMahalanobis distance:\n\n\\begin{align}d_{(\\mu,\\Sigma)}(x_i)^2 = (x_i - \\mu)^T\\Sigma^{-1}(x_i - \\mu)\\end{align}\n\nwhere $\\mu$ and $\\Sigma$ are the location and the covariance of\nthe underlying Gaussian distributions.\n\nIn practice, $\\mu$ and $\\Sigma$ are replaced by some\nestimates. The standard covariance maximum likelihood estimate (MLE) is very\nsensitive to the presence of outliers in the data set and therefore,\nthe downstream Mahalanobis distances also are. It would be better to\nuse a robust estimator of covariance to guarantee that the estimation is\nresistant to \"erroneous\" observations in the dataset and that the\ncalculated Mahalanobis distances accurately reflect the true\norganization of the observations.\n\nThe Minimum Covariance Determinant estimator (MCD) is a robust,\nhigh-breakdown point (i.e. it can be used to estimate the covariance\nmatrix of highly contaminated datasets, up to\n$\\frac{n_\\text{samples}-n_\\text{features}-1}{2}$ outliers)\nestimator of covariance. The idea behind the MCD is to find\n$\\frac{n_\\text{samples}+n_\\text{features}+1}{2}$\nobservations whose empirical covariance has the smallest determinant,\nyielding a \"pure\" subset of observations from which to compute\nstandards estimates of location and covariance. The MCD was introduced by\nP.J.Rousseuw in [1]_.\n\nThis example illustrates how the Mahalanobis distances are affected by\noutlying data. Observations drawn from a contaminating distribution\nare not distinguishable from the observations coming from the real,\nGaussian distribution when using standard covariance MLE based Mahalanobis\ndistances. Using MCD-based\nMahalanobis distances, the two populations become\ndistinguishable. Associated applications include outlier detection,\nobservation ranking and clustering.\n\n
See also `sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py`