{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Effect of transforming the targets in regression model\n\nIn this example, we give an overview of\n:class:`~sklearn.compose.TransformedTargetRegressor`. We use two examples\nto illustrate the benefit of transforming the targets before learning a linear\nregression model. The first example uses synthetic data while the second\nexample is based on the Ames housing data set.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nprint(__doc__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Synthetic example\n\nA synthetic random regression dataset is generated. The targets ``y`` are\nmodified by:\n\n1. translating all targets such that all entries are\n non-negative (by adding the absolute value of the lowest ``y``) and\n2. applying an exponential function to obtain non-linear\n targets which cannot be fitted using a simple linear model.\n\nTherefore, a logarithmic (`np.log1p`) and an exponential function\n(`np.expm1`) will be used to transform the targets before training a linear\nregression model and using it for prediction.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import make_regression\n\nX, y = make_regression(n_samples=10_000, noise=100, random_state=0)\ny = np.expm1((y + abs(y.min())) / 200)\ny_trans = np.log1p(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we plot the probability density functions of the target\nbefore and after applying the logarithmic functions.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.model_selection import train_test_split\n\nf, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, density=True)\nax0.set_xlim([0, 2000])\nax0.set_ylabel(\"Probability\")\nax0.set_xlabel(\"Target\")\nax0.set_title(\"Target distribution\")\n\nax1.hist(y_trans, bins=100, density=True)\nax1.set_ylabel(\"Probability\")\nax1.set_xlabel(\"Target\")\nax1.set_title(\"Transformed target distribution\")\n\nf.suptitle(\"Synthetic data\", y=1.05)\nplt.tight_layout()\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first, a linear model will be applied on the original targets. Due to the\nnon-linearity, the model trained will not be precise during\nprediction. Subsequently, a logarithmic function is used to linearize the\ntargets, allowing better prediction even with a similar linear model as\nreported by the median absolute error (MedAE).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import median_absolute_error, r2_score\n\n\ndef compute_score(y_true, y_pred):\n return {\n \"R2\": f\"{r2_score(y_true, y_pred):.3f}\",\n \"MedAE\": f\"{median_absolute_error(y_true, y_pred):.3f}\",\n }" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.compose import TransformedTargetRegressor\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.metrics import PredictionErrorDisplay\n\nf, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nridge_cv = RidgeCV().fit(X_train, y_train)\ny_pred_ridge = ridge_cv.predict(X_test)\n\nridge_cv_with_trans_target = TransformedTargetRegressor(\n regressor=RidgeCV(), func=np.log1p, inverse_func=np.expm1\n).fit(X_train, y_train)\ny_pred_ridge_with_trans_target = ridge_cv_with_trans_target.predict(X_test)\n\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge,\n kind=\"actual_vs_predicted\",\n ax=ax0,\n scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge_with_trans_target,\n kind=\"actual_vs_predicted\",\n ax=ax1,\n scatter_kwargs={\"alpha\": 0.5},\n)\n\n# Add the score in the legend of each axis\nfor ax, y_pred in zip([ax0, ax1], [y_pred_ridge, y_pred_ridge_with_trans_target]):\n for name, score in compute_score(y_test, y_pred).items():\n ax.plot([], [], \" \", label=f\"{name}={score}\")\n ax.legend(loc=\"upper left\")\n\nax0.set_title(\"Ridge regression \\n without target transformation\")\nax1.set_title(\"Ridge regression \\n with target transformation\")\nf.suptitle(\"Synthetic data\", y=1.05)\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Real-world data set\n\nIn a similar manner, the Ames housing data set is used to show the impact\nof transforming the targets before learning a model. In this example, the\ntarget to be predicted is the selling price of each house.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\nfrom sklearn.preprocessing import quantile_transform\n\names = fetch_openml(name=\"house_prices\", as_frame=True)\n# Keep only numeric columns\nX = ames.data.select_dtypes(np.number)\n# Remove columns with NaN or Inf values\nX = X.drop(columns=[\"LotFrontage\", \"GarageYrBlt\", \"MasVnrArea\"])\n# Let the price be in k$\ny = ames.target / 1000\ny_trans = quantile_transform(\n y.to_frame(), n_quantiles=900, output_distribution=\"normal\", copy=True\n).squeeze()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A :class:`~sklearn.preprocessing.QuantileTransformer` is used to normalize\nthe target distribution before applying a\n:class:`~sklearn.linear_model.RidgeCV` model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, density=True)\nax0.set_ylabel(\"Probability\")\nax0.set_xlabel(\"Target\")\nax0.set_title(\"Target distribution\")\n\nax1.hist(y_trans, bins=100, density=True)\nax1.set_ylabel(\"Probability\")\nax1.set_xlabel(\"Target\")\nax1.set_title(\"Transformed target distribution\")\n\nf.suptitle(\"Ames housing data: selling price\", y=1.05)\nplt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The effect of the transformer is weaker than on the synthetic data. However,\nthe transformation results in an increase in $R^2$ and large decrease\nof the MedAE. The residual plot (predicted target - true target vs predicted\ntarget) without target transformation takes on a curved, 'reverse smile'\nshape due to residual values that vary depending on the value of predicted\ntarget. With target transformation, the shape is more linear indicating\nbetter model fit.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import QuantileTransformer\n\nf, (ax0, ax1) = plt.subplots(2, 2, sharey=\"row\", figsize=(6.5, 8))\n\nridge_cv = RidgeCV().fit(X_train, y_train)\ny_pred_ridge = ridge_cv.predict(X_test)\n\nridge_cv_with_trans_target = TransformedTargetRegressor(\n regressor=RidgeCV(),\n transformer=QuantileTransformer(n_quantiles=900, output_distribution=\"normal\"),\n).fit(X_train, y_train)\ny_pred_ridge_with_trans_target = ridge_cv_with_trans_target.predict(X_test)\n\n# plot the actual vs predicted values\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge,\n kind=\"actual_vs_predicted\",\n ax=ax0[0],\n scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge_with_trans_target,\n kind=\"actual_vs_predicted\",\n ax=ax0[1],\n scatter_kwargs={\"alpha\": 0.5},\n)\n\n# Add the score in the legend of each axis\nfor ax, y_pred in zip([ax0[0], ax0[1]], [y_pred_ridge, y_pred_ridge_with_trans_target]):\n for name, score in compute_score(y_test, y_pred).items():\n ax.plot([], [], \" \", label=f\"{name}={score}\")\n ax.legend(loc=\"upper left\")\n\nax0[0].set_title(\"Ridge regression \\n without target transformation\")\nax0[1].set_title(\"Ridge regression \\n with target transformation\")\n\n# plot the residuals vs the predicted values\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge,\n kind=\"residual_vs_predicted\",\n ax=ax1[0],\n scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n y_test,\n y_pred_ridge_with_trans_target,\n kind=\"residual_vs_predicted\",\n ax=ax1[1],\n scatter_kwargs={\"alpha\": 0.5},\n)\nax1[0].set_title(\"Ridge regression \\n without target transformation\")\nax1[1].set_title(\"Ridge regression \\n with target transformation\")\n\nf.suptitle(\"Ames housing data: selling price\", y=1.05)\nplt.tight_layout()\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }