{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Theil-Sen Regression\n\nComputes a Theil-Sen Regression on a synthetic dataset.\n\nSee `theil_sen_regression` for more information on the regressor.\n\nCompared to the OLS (ordinary least squares) estimator, the Theil-Sen\nestimator is robust against outliers. It has a breakdown point of about 29.3%\nin case of a simple linear regression which means that it can tolerate\narbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional\ncase.\n\nThe estimation of the model is done by calculating the slopes and intercepts\nof a subpopulation of all possible combinations of p subsample points. If an\nintercept is fitted, p must be greater than or equal to n_features + 1. The\nfinal slope and intercept is then defined as the spatial median of these\nslopes and intercepts.\n\nIn certain cases Theil-Sen performs better than `RANSAC\n` which is also a robust method. This is illustrated in the\nsecond example below where outliers with respect to the x-axis perturb RANSAC.\nTuning the ``residual_threshold`` parameter of RANSAC remedies this but in\ngeneral a priori knowledge about the data and the nature of the outliers is\nneeded.\nDue to the computational complexity of Theil-Sen it is recommended to use it\nonly for small problems in terms of number of samples and features. For larger\nproblems the ``max_subpopulation`` parameter restricts the magnitude of all\npossible combinations of p subsample points to a randomly chosen subset and\ntherefore also limits the runtime. Therefore, Theil-Sen is applicable to larger\nproblems with the drawback of losing some of its mathematical properties since\nit then works on a random subset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.linear_model import LinearRegression, RANSACRegressor, TheilSenRegressor\n\nestimators = [\n (\"OLS\", LinearRegression()),\n (\"Theil-Sen\", TheilSenRegressor(random_state=42)),\n (\"RANSAC\", RANSACRegressor(random_state=42)),\n]\ncolors = {\"OLS\": \"turquoise\", \"Theil-Sen\": \"gold\", \"RANSAC\": \"lightgreen\"}\nlw = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers only in the y direction\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.random.seed(0)\nn_samples = 200\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nw = 3.0\nc = 2.0\nnoise = 0.1 * np.random.randn(n_samples)\ny = w * x + c + noise\n# 10% outliers\ny[-20:] += -20 * x[-20:]\nX = x[:, np.newaxis]\n\nplt.scatter(x, y, color=\"indigo\", marker=\"x\", s=40)\nline_x = np.array([-3, 3])\nfor name, estimator in estimators:\n t0 = time.time()\n estimator.fit(X, y)\n elapsed_time = time.time() - t0\n y_pred = estimator.predict(line_x.reshape(2, 1))\n plt.plot(\n line_x,\n y_pred,\n color=colors[name],\n linewidth=lw,\n label=\"%s (fit time: %.2fs)\" % (name, elapsed_time),\n )\n\nplt.axis(\"tight\")\nplt.legend(loc=\"upper left\")\n_ = plt.title(\"Corrupt y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers in the X direction\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.random.seed(0)\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nnoise = 0.1 * np.random.randn(n_samples)\ny = 3 * x + 2 + noise\n# 10% outliers\nx[-20:] = 9.9\ny[-20:] += 22\nX = x[:, np.newaxis]\n\nplt.figure()\nplt.scatter(x, y, color=\"indigo\", marker=\"x\", s=40)\n\nline_x = np.array([-3, 10])\nfor name, estimator in estimators:\n t0 = time.time()\n estimator.fit(X, y)\n elapsed_time = time.time() - t0\n y_pred = estimator.predict(line_x.reshape(2, 1))\n plt.plot(\n line_x,\n y_pred,\n color=colors[name],\n linewidth=lw,\n label=\"%s (fit time: %.2fs)\" % (name, elapsed_time),\n )\n\nplt.axis(\"tight\")\nplt.legend(loc=\"upper left\")\nplt.title(\"Corrupt x\")\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }