{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Theil-Sen Regression\n\nComputes a Theil-Sen Regression on a synthetic dataset.\n\nSee `theil_sen_regression` for more information on the regressor.\n\nCompared to the OLS (ordinary least squares) estimator, the Theil-Sen\nestimator is robust against outliers. It has a breakdown point of about 29.3%\nin case of a simple linear regression which means that it can tolerate\narbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional\ncase.\n\nThe estimation of the model is done by calculating the slopes and intercepts\nof a subpopulation of all possible combinations of p subsample points. If an\nintercept is fitted, p must be greater than or equal to n_features + 1. The\nfinal slope and intercept is then defined as the spatial median of these\nslopes and intercepts.\n\nIn certain cases Theil-Sen performs better than `RANSAC\n<ransac_regression>` which is also a robust method. This is illustrated in the\nsecond example below where outliers with respect to the x-axis perturb RANSAC.\nTuning the ``residual_threshold`` parameter of RANSAC remedies this but in\ngeneral a priori knowledge about the data and the nature of the outliers is\nneeded.\nDue to the computational complexity of Theil-Sen it is recommended to use it\nonly for small problems in terms of number of samples and features. For larger\nproblems the ``max_subpopulation`` parameter restricts the magnitude of all\npossible combinations of p subsample points to a randomly chosen subset and\ntherefore also limits the runtime. Therefore, Theil-Sen is applicable to larger\nproblems with the drawback of losing some of its mathematical properties since\nit then works on a random subset.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.linear_model import LinearRegression, RANSACRegressor, TheilSenRegressor\n\nestimators = [\n    (\"OLS\", LinearRegression()),\n    (\"Theil-Sen\", TheilSenRegressor(random_state=42)),\n    (\"RANSAC\", RANSACRegressor(random_state=42)),\n]\ncolors = {\"OLS\": \"turquoise\", \"Theil-Sen\": \"gold\", \"RANSAC\": \"lightgreen\"}\nlw = 2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Outliers only in the y direction\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "np.random.seed(0)\nn_samples = 200\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nw = 3.0\nc = 2.0\nnoise = 0.1 * np.random.randn(n_samples)\ny = w * x + c + noise\n# 10% outliers\ny[-20:] += -20 * x[-20:]\nX = x[:, np.newaxis]\n\nplt.scatter(x, y, color=\"indigo\", marker=\"x\", s=40)\nline_x = np.array([-3, 3])\nfor name, estimator in estimators:\n    t0 = time.time()\n    estimator.fit(X, y)\n    elapsed_time = time.time() - t0\n    y_pred = estimator.predict(line_x.reshape(2, 1))\n    plt.plot(\n        line_x,\n        y_pred,\n        color=colors[name],\n        linewidth=lw,\n        label=\"%s (fit time: %.2fs)\" % (name, elapsed_time),\n    )\n\nplt.axis(\"tight\")\nplt.legend(loc=\"upper right\")\n_ = plt.title(\"Corrupt y\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Outliers in the X direction\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "np.random.seed(0)\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nnoise = 0.1 * np.random.randn(n_samples)\ny = 3 * x + 2 + noise\n# 10% outliers\nx[-20:] = 9.9\ny[-20:] += 22\nX = x[:, np.newaxis]\n\nplt.figure()\nplt.scatter(x, y, color=\"indigo\", marker=\"x\", s=40)\n\nline_x = np.array([-3, 10])\nfor name, estimator in estimators:\n    t0 = time.time()\n    estimator.fit(X, y)\n    elapsed_time = time.time() - t0\n    y_pred = estimator.predict(line_x.reshape(2, 1))\n    plt.plot(\n        line_x,\n        y_pred,\n        color=colors[name],\n        linewidth=lw,\n        label=\"%s (fit time: %.2fs)\" % (name, elapsed_time),\n    )\n\nplt.axis(\"tight\")\nplt.legend(loc=\"upper left\")\nplt.title(\"Corrupt x\")\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}