{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Robust vs Empirical covariance estimate\n\nThe usual covariance maximum likelihood estimate is very sensitive to the\npresence of outliers in the data set. In such a case, it would be better to\nuse a robust estimator of covariance to guarantee that the estimation is\nresistant to \"erroneous\" observations in the data set. [1]_, [2]_\n\n## Minimum Covariance Determinant Estimator\nThe Minimum Covariance Determinant estimator is a robust, high-breakdown point\n(i.e. it can be used to estimate the covariance matrix of highly contaminated\ndatasets, up to\n$\\frac{n_\\text{samples} - n_\\text{features}-1}{2}$ outliers) estimator of\ncovariance. The idea is to find\n$\\frac{n_\\text{samples} + n_\\text{features}+1}{2}$\nobservations whose empirical covariance has the smallest determinant, yielding\na \"pure\" subset of observations from which to compute standards estimates of\nlocation and covariance. After a correction step aiming at compensating the\nfact that the estimates were learned from only a portion of the initial data,\nwe end up with robust estimates of the data set location and covariance.\n\nThe Minimum Covariance Determinant estimator (MCD) has been introduced by\nP.J.Rousseuw in [3]_.\n\n## Evaluation\nIn this example, we compare the estimation errors that are made when using\nvarious types of location and covariance estimates on contaminated Gaussian\ndistributed data sets:\n\n- The mean and the empirical covariance of the full dataset, which break\n down as soon as there are outliers in the data set\n- The robust MCD, that has a low error provided\n $n_\\text{samples} > 5n_\\text{features}$\n- The mean and the empirical covariance of the observations that are known\n to be good ones. This can be considered as a \"perfect\" MCD estimation,\n so one can trust our implementation by comparing to this case.\n\n\n## References\n.. [1] Johanna Hardin, David M Rocke. The distribution of robust distances.\n Journal of Computational and Graphical Statistics. December 1, 2005,\n 14(4): 928-946.\n.. [2] Zoubir A., Koivunen V., Chakhchoukh Y. and Muma M. (2012). Robust\n estimation in signal processing: A tutorial-style treatment of\n fundamental concepts. IEEE Signal Processing Magazine 29(4), 61-80.\n.. [3] P. J. Rousseeuw. Least median of squares regression. Journal of American\n Statistical Ass., 79:871, 1984.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.font_manager\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.covariance import EmpiricalCovariance, MinCovDet\n\n# example settings\nn_samples = 80\nn_features = 5\nrepeat = 10\n\nrange_n_outliers = np.concatenate(\n (\n np.linspace(0, n_samples / 8, 5),\n np.linspace(n_samples / 8, n_samples / 2, 5)[1:-1],\n )\n).astype(int)\n\n# definition of arrays to store results\nerr_loc_mcd = np.zeros((range_n_outliers.size, repeat))\nerr_cov_mcd = np.zeros((range_n_outliers.size, repeat))\nerr_loc_emp_full = np.zeros((range_n_outliers.size, repeat))\nerr_cov_emp_full = np.zeros((range_n_outliers.size, repeat))\nerr_loc_emp_pure = np.zeros((range_n_outliers.size, repeat))\nerr_cov_emp_pure = np.zeros((range_n_outliers.size, repeat))\n\n# computation\nfor i, n_outliers in enumerate(range_n_outliers):\n for j in range(repeat):\n rng = np.random.RandomState(i * j)\n\n # generate data\n X = rng.randn(n_samples, n_features)\n # add some outliers\n outliers_index = rng.permutation(n_samples)[:n_outliers]\n outliers_offset = 10.0 * (\n np.random.randint(2, size=(n_outliers, n_features)) - 0.5\n )\n X[outliers_index] += outliers_offset\n inliers_mask = np.ones(n_samples).astype(bool)\n inliers_mask[outliers_index] = False\n\n # fit a Minimum Covariance Determinant (MCD) robust estimator to data\n mcd = MinCovDet().fit(X)\n # compare raw robust estimates with the true location and covariance\n err_loc_mcd[i, j] = np.sum(mcd.location_**2)\n err_cov_mcd[i, j] = mcd.error_norm(np.eye(n_features))\n\n # compare estimators learned from the full data set with true\n # parameters\n err_loc_emp_full[i, j] = np.sum(X.mean(0) ** 2)\n err_cov_emp_full[i, j] = (\n EmpiricalCovariance().fit(X).error_norm(np.eye(n_features))\n )\n\n # compare with an empirical covariance learned from a pure data set\n # (i.e. \"perfect\" mcd)\n pure_X = X[inliers_mask]\n pure_location = pure_X.mean(0)\n pure_emp_cov = EmpiricalCovariance().fit(pure_X)\n err_loc_emp_pure[i, j] = np.sum(pure_location**2)\n err_cov_emp_pure[i, j] = pure_emp_cov.error_norm(np.eye(n_features))\n\n# Display results\nfont_prop = matplotlib.font_manager.FontProperties(size=11)\nplt.subplot(2, 1, 1)\nlw = 2\nplt.errorbar(\n range_n_outliers,\n err_loc_mcd.mean(1),\n yerr=err_loc_mcd.std(1) / np.sqrt(repeat),\n label=\"Robust location\",\n lw=lw,\n color=\"m\",\n)\nplt.errorbar(\n range_n_outliers,\n err_loc_emp_full.mean(1),\n yerr=err_loc_emp_full.std(1) / np.sqrt(repeat),\n label=\"Full data set mean\",\n lw=lw,\n color=\"green\",\n)\nplt.errorbar(\n range_n_outliers,\n err_loc_emp_pure.mean(1),\n yerr=err_loc_emp_pure.std(1) / np.sqrt(repeat),\n label=\"Pure data set mean\",\n lw=lw,\n color=\"black\",\n)\nplt.title(\"Influence of outliers on the location estimation\")\nplt.ylabel(r\"Error ($||\\mu - \\hat{\\mu}||_2^2$)\")\nplt.legend(loc=\"upper left\", prop=font_prop)\n\nplt.subplot(2, 1, 2)\nx_size = range_n_outliers.size\nplt.errorbar(\n range_n_outliers,\n err_cov_mcd.mean(1),\n yerr=err_cov_mcd.std(1),\n label=\"Robust covariance (mcd)\",\n color=\"m\",\n)\nplt.errorbar(\n range_n_outliers[: (x_size // 5 + 1)],\n err_cov_emp_full.mean(1)[: (x_size // 5 + 1)],\n yerr=err_cov_emp_full.std(1)[: (x_size // 5 + 1)],\n label=\"Full data set empirical covariance\",\n color=\"green\",\n)\nplt.plot(\n range_n_outliers[(x_size // 5) : (x_size // 2 - 1)],\n err_cov_emp_full.mean(1)[(x_size // 5) : (x_size // 2 - 1)],\n color=\"green\",\n ls=\"--\",\n)\nplt.errorbar(\n range_n_outliers,\n err_cov_emp_pure.mean(1),\n yerr=err_cov_emp_pure.std(1),\n label=\"Pure data set empirical covariance\",\n color=\"black\",\n)\nplt.title(\"Influence of outliers on the covariance estimation\")\nplt.xlabel(\"Amount of contamination (%)\")\nplt.ylabel(\"RMSE\")\nplt.legend(loc=\"upper center\", prop=font_prop)\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }