{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Release Highlights for scikit-learn 0.23\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 0.23! Many bug fixes\nand improvements were added, as well as some new key features. We detail\nbelow a few of the major features of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes `.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalized Linear Models, and Poisson loss for gradient boosting\nLong-awaited Generalized Linear Models with non-normal loss functions are now\navailable. In particular, three new regressors were implemented:\n:class:`~sklearn.linear_model.PoissonRegressor`,\n:class:`~sklearn.linear_model.GammaRegressor`, and\n:class:`~sklearn.linear_model.TweedieRegressor`. The Poisson regressor can be\nused to model positive integer counts, or relative frequencies. Read more in\nthe `User Guide `. Additionally,\n:class:`~sklearn.ensemble.HistGradientBoostingRegressor` supports a new\n'poisson' loss as well.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import PoissonRegressor\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\nn_samples, n_features = 1000, 20\nrng = np.random.RandomState(0)\nX = rng.randn(n_samples, n_features)\n# positive integer target correlated with X[:, 5] with many zeros:\ny = rng.poisson(lam=np.exp(X[:, 5]) / 2)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)\nglm = PoissonRegressor()\ngbdt = HistGradientBoostingRegressor(loss=\"poisson\", learning_rate=0.01)\nglm.fit(X_train, y_train)\ngbdt.fit(X_train, y_train)\nprint(glm.score(X_test, y_test))\nprint(gbdt.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rich visual representation of estimators\nEstimators can now be visualized in notebooks by enabling the\n`display='diagram'` option. This is particularly useful to summarise the\nstructure of pipelines and other composite estimators, with interactivity to\nprovide detail. Click on the example image below to expand Pipeline\nelements. See `visualizing_composite_estimators` for how you can use\nthis feature.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import set_config\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.linear_model import LogisticRegression\n\nset_config(display=\"diagram\")\n\nnum_proc = make_pipeline(SimpleImputer(strategy=\"median\"), StandardScaler())\n\ncat_proc = make_pipeline(\n SimpleImputer(strategy=\"constant\", fill_value=\"missing\"),\n OneHotEncoder(handle_unknown=\"ignore\"),\n)\n\npreprocessor = make_column_transformer(\n (num_proc, (\"feat1\", \"feat3\")), (cat_proc, (\"feat0\", \"feat2\"))\n)\n\nclf = make_pipeline(preprocessor, LogisticRegression())\nclf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scalability and stability improvements to KMeans\nThe :class:`~sklearn.cluster.KMeans` estimator was entirely re-worked, and it\nis now significantly faster and more stable. In addition, the Elkan algorithm\nis now compatible with sparse matrices. The estimator uses OpenMP based\nparallelism instead of relying on joblib, so the `n_jobs` parameter has no\neffect anymore. For more details on how to control the number of threads,\nplease refer to our `parallelism` notes.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import make_blobs\nfrom sklearn.metrics import completeness_score\n\nrng = np.random.RandomState(0)\nX, y = make_blobs(random_state=rng)\nX = scipy.sparse.csr_matrix(X)\nX_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)\nkmeans = KMeans(n_init=\"auto\").fit(X_train)\nprint(completeness_score(kmeans.predict(X_test), y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improvements to the histogram-based Gradient Boosting estimators\nVarious improvements were made to\n:class:`~sklearn.ensemble.HistGradientBoostingClassifier` and\n:class:`~sklearn.ensemble.HistGradientBoostingRegressor`. On top of the\nPoisson loss mentioned above, these estimators now support `sample\nweights `. Also, an automatic early-stopping criterion was added:\nearly-stopping is enabled by default when the number of samples exceeds 10k.\nFinally, users can now define `monotonic constraints\n` to constrain the predictions based on the variations of\nspecific features. In the following example, we construct a target that is\ngenerally positively correlated with the first feature, with some noise.\nApplying monotoinc constraints allows the prediction to capture the global\neffect of the first feature, instead of fitting the noise. For a usecase\nexample, see `sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom matplotlib import pyplot as plt\nfrom sklearn.model_selection import train_test_split\n\n# from sklearn.inspection import plot_partial_dependence\nfrom sklearn.inspection import PartialDependenceDisplay\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\nn_samples = 500\nrng = np.random.RandomState(0)\nX = rng.randn(n_samples, 2)\nnoise = rng.normal(loc=0.0, scale=0.01, size=n_samples)\ny = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise\n\ngbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)\ngbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)\n\n# plot_partial_dependence has been removed in version 1.2. From 1.2, use\n# PartialDependenceDisplay instead.\n# disp = plot_partial_dependence(\ndisp = PartialDependenceDisplay.from_estimator(\n gbdt_no_cst,\n X,\n features=[0],\n feature_names=[\"feature 0\"],\n line_kw={\"linewidth\": 4, \"label\": \"unconstrained\", \"color\": \"tab:blue\"},\n)\n# plot_partial_dependence(\nPartialDependenceDisplay.from_estimator(\n gbdt_cst,\n X,\n features=[0],\n line_kw={\"linewidth\": 4, \"label\": \"constrained\", \"color\": \"tab:orange\"},\n ax=disp.axes_,\n)\ndisp.axes_[0, 0].plot(\n X[:, 0], y, \"o\", alpha=0.5, zorder=-1, label=\"samples\", color=\"tab:green\"\n)\ndisp.axes_[0, 0].set_ylim(-3, 3)\ndisp.axes_[0, 0].set_xlim(-1, 1)\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample-weight support for Lasso and ElasticNet\nThe two linear regressors :class:`~sklearn.linear_model.Lasso` and\n:class:`~sklearn.linear_model.ElasticNet` now support sample weights.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_regression\nfrom sklearn.linear_model import Lasso\nimport numpy as np\n\nn_samples, n_features = 1000, 20\nrng = np.random.RandomState(0)\nX, y = make_regression(n_samples, n_features, random_state=rng)\nsample_weight = rng.rand(n_samples)\nX_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(\n X, y, sample_weight, random_state=rng\n)\nreg = Lasso()\nreg.fit(X_train, y_train, sample_weight=sw_train)\nprint(reg.score(X_test, y_test, sw_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }