{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Using KBinsDiscretizer to discretize continuous features\n\nThe example compares prediction result of linear regression (linear model)\nand decision tree (tree based model) with and without discretization of\nreal-valued features.\n\nAs is shown in the result before discretization, linear model is fast to\nbuild and relatively straightforward to interpret, but can only model\nlinear relationships, while decision tree can build a much more complex model\nof the data. One way to make linear model more powerful on continuous data\nis to use discretization (also known as binning). In the example, we\ndiscretize the feature and one-hot encode the transformed data. Note that if\nthe bins are not reasonably wide, there would appear to be a substantially\nincreased risk of overfitting, so the discretizer parameters should usually\nbe tuned under cross validation.\n\nAfter discretization, linear regression and decision tree make exactly the\nsame prediction. As features are constant within each bin, any model must\npredict the same value for all points within a bin. Compared with the result\nbefore discretization, linear model become much more flexible while decision\ntree gets much less flexible. Note that binning features generally has no\nbeneficial effect for tree-based models, as these models can learn to split\nup the data anywhere.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.preprocessing import KBinsDiscretizer\nfrom sklearn.tree import DecisionTreeRegressor\n\n# construct the dataset\nrnd = np.random.RandomState(42)\nX = rnd.uniform(-3, 3, size=100)\ny = np.sin(X) + rnd.normal(size=len(X)) / 3\nX = X.reshape(-1, 1)\n\n# transform the dataset with KBinsDiscretizer\nenc = KBinsDiscretizer(n_bins=10, encode=\"onehot\")\nX_binned = enc.fit_transform(X)\n\n# predict with original dataset\nfig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))\nline = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)\nreg = LinearRegression().fit(X, y)\nax1.plot(line, reg.predict(line), linewidth=2, color=\"green\", label=\"linear regression\")\nreg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X, y)\nax1.plot(line, reg.predict(line), linewidth=2, color=\"red\", label=\"decision tree\")\nax1.plot(X[:, 0], y, \"o\", c=\"k\")\nax1.legend(loc=\"best\")\nax1.set_ylabel(\"Regression output\")\nax1.set_xlabel(\"Input feature\")\nax1.set_title(\"Result before discretization\")\n\n# predict with transformed dataset\nline_binned = enc.transform(line)\nreg = LinearRegression().fit(X_binned, y)\nax2.plot(\n line,\n reg.predict(line_binned),\n linewidth=2,\n color=\"green\",\n linestyle=\"-\",\n label=\"linear regression\",\n)\nreg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X_binned, y)\nax2.plot(\n line,\n reg.predict(line_binned),\n linewidth=2,\n color=\"red\",\n linestyle=\":\",\n label=\"decision tree\",\n)\nax2.plot(X[:, 0], y, \"o\", c=\"k\")\nax2.vlines(enc.bin_edges_[0], *plt.gca().get_ylim(), linewidth=1, alpha=0.2)\nax2.legend(loc=\"best\")\nax2.set_xlabel(\"Input feature\")\nax2.set_title(\"Result after discretization\")\n\nplt.tight_layout()\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }