{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Quantile regression\n\nThis example illustrates how quantile regression can predict non-trivial\nconditional quantiles.\n\nThe left figure shows the case when the error distribution is normal,\nbut has non-constant variance, i.e. with heteroscedasticity.\n\nThe right figure shows an example of an asymmetric error distribution,\nnamely the Pareto distribution.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset generation\n\nTo illustrate the behaviour of quantile regression, we will generate two\nsynthetic datasets. The true generative random processes for both datasets\nwill be composed by the same expected value with a linear relationship with a\nsingle feature `x`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nrng = np.random.RandomState(42)\nx = np.linspace(start=0, stop=10, num=100)\nX = x[:, np.newaxis]\ny_true_mean = 10 + 0.5 * x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will create two subsequent problems by changing the distribution of the\ntarget `y` while keeping the same expected value:\n\n- in the first case, a heteroscedastic Normal noise is added;\n- in the second case, an asymmetric Pareto noise is added.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_normal = y_true_mean + rng.normal(loc=0, scale=0.5 + 0.5 * x, size=x.shape[0])\na = 5\ny_pareto = y_true_mean + 10 * (rng.pareto(a, size=x.shape[0]) - 1 / (a - 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first visualize the datasets as well as the distribution of the\nresiduals `y - mean(y)`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\n_, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 11), sharex=\"row\", sharey=\"row\")\n\naxs[0, 0].plot(x, y_true_mean, label=\"True mean\")\naxs[0, 0].scatter(x, y_normal, color=\"black\", alpha=0.5, label=\"Observations\")\naxs[1, 0].hist(y_true_mean - y_normal, edgecolor=\"black\")\n\n\naxs[0, 1].plot(x, y_true_mean, label=\"True mean\")\naxs[0, 1].scatter(x, y_pareto, color=\"black\", alpha=0.5, label=\"Observations\")\naxs[1, 1].hist(y_true_mean - y_pareto, edgecolor=\"black\")\n\naxs[0, 0].set_title(\"Dataset with heteroscedastic Normal distributed targets\")\naxs[0, 1].set_title(\"Dataset with asymmetric Pareto distributed target\")\naxs[1, 0].set_title(\n \"Residuals distribution for heteroscedastic Normal distributed targets\"\n)\naxs[1, 1].set_title(\"Residuals distribution for asymmetric Pareto distributed target\")\naxs[0, 0].legend()\naxs[0, 1].legend()\naxs[0, 0].set_ylabel(\"y\")\naxs[1, 0].set_ylabel(\"Counts\")\naxs[0, 1].set_xlabel(\"x\")\naxs[0, 0].set_xlabel(\"x\")\naxs[1, 0].set_xlabel(\"Residuals\")\n_ = axs[1, 1].set_xlabel(\"Residuals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the heteroscedastic Normal distributed target, we observe that the\nvariance of the noise is increasing when the value of the feature `x` is\nincreasing.\n\nWith the asymmetric Pareto distributed target, we observe that the positive\nresiduals are bounded.\n\nThese types of noisy targets make the estimation via\n:class:`~sklearn.linear_model.LinearRegression` less efficient, i.e. we need\nmore data to get stable results and, in addition, large outliers can have a\nhuge impact on the fitted coefficients. (Stated otherwise: in a setting with\nconstant variance, ordinary least squares estimators converge much faster to\nthe *true* coefficients with increasing sample size.)\n\nIn this asymmetric setting, the median or different quantiles give additional\ninsights. On top of that, median estimation is much more robust to outliers\nand heavy tailed distributions. But note that extreme quantiles are estimated\nby very few data points. 95% quantile are more or less estimated by the 5%\nlargest values and thus also a bit sensitive outliers.\n\nIn the remainder of this tutorial, we will show how\n:class:`~sklearn.linear_model.QuantileRegressor` can be used in practice and\ngive the intuition into the properties of the fitted models. Finally,\nwe will compare the both :class:`~sklearn.linear_model.QuantileRegressor`\nand :class:`~sklearn.linear_model.LinearRegression`.\n\n## Fitting a `QuantileRegressor`\n\nIn this section, we want to estimate the conditional median as well as\na low and high quantile fixed at 5% and 95%, respectively. Thus, we will get\nthree linear models, one for each quantile.\n\nWe will use the quantiles at 5% and 95% to find the outliers in the training\nsample beyond the central 90% interval.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import QuantileRegressor\n\nquantiles = [0.05, 0.5, 0.95]\npredictions = {}\nout_bounds_predictions = np.zeros_like(y_true_mean, dtype=np.bool_)\nfor quantile in quantiles:\n qr = QuantileRegressor(quantile=quantile, alpha=0)\n y_pred = qr.fit(X, y_normal).predict(X)\n predictions[quantile] = y_pred\n\n if quantile == min(quantiles):\n out_bounds_predictions = np.logical_or(\n out_bounds_predictions, y_pred >= y_normal\n )\n elif quantile == max(quantiles):\n out_bounds_predictions = np.logical_or(\n out_bounds_predictions, y_pred <= y_normal\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can plot the three linear models and the distinguished samples that\nare within the central 90% interval from samples that are outside this\ninterval.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(X, y_true_mean, color=\"black\", linestyle=\"dashed\", label=\"True mean\")\n\nfor quantile, y_pred in predictions.items():\n plt.plot(X, y_pred, label=f\"Quantile: {quantile}\")\n\nplt.scatter(\n x[out_bounds_predictions],\n y_normal[out_bounds_predictions],\n color=\"black\",\n marker=\"+\",\n alpha=0.5,\n label=\"Outside interval\",\n)\nplt.scatter(\n x[~out_bounds_predictions],\n y_normal[~out_bounds_predictions],\n color=\"black\",\n alpha=0.5,\n label=\"Inside interval\",\n)\n\nplt.legend()\nplt.xlabel(\"x\")\nplt.ylabel(\"y\")\n_ = plt.title(\"Quantiles of heteroscedastic Normal distributed target\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the noise is still Normally distributed, in particular is symmetric,\nthe true conditional mean and the true conditional median coincide. Indeed,\nwe see that the estimated median almost hits the true mean. We observe the\neffect of having an increasing noise variance on the 5% and 95% quantiles:\nthe slopes of those quantiles are very different and the interval between\nthem becomes wider with increasing `x`.\n\nTo get an additional intuition regarding the meaning of the 5% and 95%\nquantiles estimators, one can count the number of samples above and below the\npredicted quantiles (represented by a cross on the above plot), considering\nthat we have a total of 100 samples.\n\nWe can repeat the same experiment using the asymmetric Pareto distributed\ntarget.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "quantiles = [0.05, 0.5, 0.95]\npredictions = {}\nout_bounds_predictions = np.zeros_like(y_true_mean, dtype=np.bool_)\nfor quantile in quantiles:\n qr = QuantileRegressor(quantile=quantile, alpha=0)\n y_pred = qr.fit(X, y_pareto).predict(X)\n predictions[quantile] = y_pred\n\n if quantile == min(quantiles):\n out_bounds_predictions = np.logical_or(\n out_bounds_predictions, y_pred >= y_pareto\n )\n elif quantile == max(quantiles):\n out_bounds_predictions = np.logical_or(\n out_bounds_predictions, y_pred <= y_pareto\n )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(X, y_true_mean, color=\"black\", linestyle=\"dashed\", label=\"True mean\")\n\nfor quantile, y_pred in predictions.items():\n plt.plot(X, y_pred, label=f\"Quantile: {quantile}\")\n\nplt.scatter(\n x[out_bounds_predictions],\n y_pareto[out_bounds_predictions],\n color=\"black\",\n marker=\"+\",\n alpha=0.5,\n label=\"Outside interval\",\n)\nplt.scatter(\n x[~out_bounds_predictions],\n y_pareto[~out_bounds_predictions],\n color=\"black\",\n alpha=0.5,\n label=\"Inside interval\",\n)\n\nplt.legend()\nplt.xlabel(\"x\")\nplt.ylabel(\"y\")\n_ = plt.title(\"Quantiles of asymmetric Pareto distributed target\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the asymmetry of the distribution of the noise, we observe that the\ntrue mean and estimated conditional median are different. We also observe\nthat each quantile model has different parameters to better fit the desired\nquantile. Note that ideally, all quantiles would be parallel in this case,\nwhich would become more visible with more data points or less extreme\nquantiles, e.g. 10% and 90%.\n\n## Comparing `QuantileRegressor` and `LinearRegression`\n\nIn this section, we will linger on the difference regarding the error that\n:class:`~sklearn.linear_model.QuantileRegressor` and\n:class:`~sklearn.linear_model.LinearRegression` are minimizing.\n\nIndeed, :class:`~sklearn.linear_model.LinearRegression` is a least squares\napproach minimizing the mean squared error (MSE) between the training and\npredicted targets. In contrast,\n:class:`~sklearn.linear_model.QuantileRegressor` with `quantile=0.5`\nminimizes the mean absolute error (MAE) instead.\n\nLet's first compute the training errors of such models in terms of mean\nsquared error and mean absolute error. We will use the asymmetric Pareto\ndistributed target to make it more interesting as mean and median are not\nequal.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import mean_absolute_error, mean_squared_error\n\nlinear_regression = LinearRegression()\nquantile_regression = QuantileRegressor(quantile=0.5, alpha=0)\n\ny_pred_lr = linear_regression.fit(X, y_pareto).predict(X)\ny_pred_qr = quantile_regression.fit(X, y_pareto).predict(X)\n\nprint(\n f\"\"\"Training error (in-sample performance)\n {linear_regression.__class__.__name__}:\n MAE = {mean_absolute_error(y_pareto, y_pred_lr):.3f}\n MSE = {mean_squared_error(y_pareto, y_pred_lr):.3f}\n {quantile_regression.__class__.__name__}:\n MAE = {mean_absolute_error(y_pareto, y_pred_qr):.3f}\n MSE = {mean_squared_error(y_pareto, y_pred_qr):.3f}\n \"\"\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the training set, we see that MAE is lower for\n:class:`~sklearn.linear_model.QuantileRegressor` than\n:class:`~sklearn.linear_model.LinearRegression`. In contrast to that, MSE is\nlower for :class:`~sklearn.linear_model.LinearRegression` than\n:class:`~sklearn.linear_model.QuantileRegressor`. These results confirms that\nMAE is the loss minimized by :class:`~sklearn.linear_model.QuantileRegressor`\nwhile MSE is the loss minimized\n:class:`~sklearn.linear_model.LinearRegression`.\n\nWe can make a similar evaluation by looking at the test error obtained by\ncross-validation.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate\n\ncv_results_lr = cross_validate(\n linear_regression,\n X,\n y_pareto,\n cv=3,\n scoring=[\"neg_mean_absolute_error\", \"neg_mean_squared_error\"],\n)\ncv_results_qr = cross_validate(\n quantile_regression,\n X,\n y_pareto,\n cv=3,\n scoring=[\"neg_mean_absolute_error\", \"neg_mean_squared_error\"],\n)\nprint(\n f\"\"\"Test error (cross-validated performance)\n {linear_regression.__class__.__name__}:\n MAE = {-cv_results_lr[\"test_neg_mean_absolute_error\"].mean():.3f}\n MSE = {-cv_results_lr[\"test_neg_mean_squared_error\"].mean():.3f}\n {quantile_regression.__class__.__name__}:\n MAE = {-cv_results_qr[\"test_neg_mean_absolute_error\"].mean():.3f}\n MSE = {-cv_results_qr[\"test_neg_mean_squared_error\"].mean():.3f}\n \"\"\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reach similar conclusions on the out-of-sample evaluation.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }