{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Release Highlights for scikit-learn 1.1\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.1! Many bug fixes\nand improvements were added, as well as some new key features. We detail\nbelow a few of the major features of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes `.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## Quantile loss in :class:`~ensemble.HistGradientBoostingRegressor`\n:class:`~ensemble.HistGradientBoostingRegressor` can model quantiles with\n`loss=\"quantile\"` and the new parameter `quantile`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\n# Simple regression function for X * cos(X)\nrng = np.random.RandomState(42)\nX_1d = np.linspace(0, 10, num=2000)\nX = X_1d.reshape(-1, 1)\ny = X_1d * np.cos(X_1d) + rng.normal(scale=X_1d / 3)\n\nquantiles = [0.95, 0.5, 0.05]\nparameters = dict(loss=\"quantile\", max_bins=32, max_iter=50)\nhist_quantiles = {\n f\"quantile={quantile:.2f}\": HistGradientBoostingRegressor(\n **parameters, quantile=quantile\n ).fit(X, y)\n for quantile in quantiles\n}\n\nfig, ax = plt.subplots()\nax.plot(X_1d, y, \"o\", alpha=0.5, markersize=1)\nfor quantile, hist in hist_quantiles.items():\n ax.plot(X_1d, hist.predict(X), label=quantile)\n_ = ax.legend(loc=\"lower left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a usecase example, see\n`sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py`\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `get_feature_names_out` Available in all Transformers\n:term:`get_feature_names_out` is now available in all transformers, thereby\nconcluding the implementation of\n[SLEP007](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html)_.\nThis enables :class:`~pipeline.Pipeline` to construct the output feature names for\nmore complex pipelines:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\nX, y = fetch_openml(\n \"titanic\", version=1, as_frame=True, return_X_y=True, parser=\"pandas\"\n)\nnumeric_features = [\"age\", \"fare\"]\nnumeric_transformer = make_pipeline(SimpleImputer(strategy=\"median\"), StandardScaler())\ncategorical_features = [\"embarked\", \"pclass\"]\n\npreprocessor = ColumnTransformer(\n [\n (\"num\", numeric_transformer, numeric_features),\n (\n \"cat\",\n OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n categorical_features,\n ),\n ],\n verbose_feature_names_out=False,\n)\nlog_reg = make_pipeline(preprocessor, SelectKBest(k=7), LogisticRegression())\nlog_reg.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we slice the pipeline to include all the steps but the last one. The output\nfeature names of this pipeline slice are the features put into logistic\nregression. These names correspond directly to the coefficients in the logistic\nregression:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nlog_reg_input_features = log_reg[:-1].get_feature_names_out()\npd.Series(log_reg[-1].coef_.ravel(), index=log_reg_input_features).plot.bar()\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping infrequent categories in :class:`~preprocessing.OneHotEncoder`\n:class:`~preprocessing.OneHotEncoder` supports aggregating infrequent\ncategories into a single output for each feature. The parameters to enable\nthe gathering of infrequent categories are `min_frequency` and\n`max_categories`. See the `User Guide `\nfor more details.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.preprocessing import OneHotEncoder\n\nX = np.array(\n [[\"dog\"] * 5 + [\"cat\"] * 20 + [\"rabbit\"] * 10 + [\"snake\"] * 3], dtype=object\n).T\nenc = OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)\nenc.infrequent_categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since dog and snake are infrequent categories, they are grouped together when\ntransformed:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "encoded = enc.transform(np.array([[\"dog\"], [\"snake\"], [\"cat\"], [\"rabbit\"]]))\npd.DataFrame(encoded, columns=enc.get_feature_names_out())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance improvements\nReductions on pairwise distances for dense float64 datasets has been refactored\nto better take advantage of non-blocking thread parallelism. For example,\n:meth:`neighbors.NearestNeighbors.kneighbors` and\n:meth:`neighbors.NearestNeighbors.radius_neighbors` can respectively be up to \u00d720 and\n\u00d75 faster than previously. In summary, the following functions and estimators\nnow benefit from improved performance:\n\n- :func:`metrics.pairwise_distances_argmin`\n- :func:`metrics.pairwise_distances_argmin_min`\n- :class:`cluster.AffinityPropagation`\n- :class:`cluster.Birch`\n- :class:`cluster.MeanShift`\n- :class:`cluster.OPTICS`\n- :class:`cluster.SpectralClustering`\n- :func:`feature_selection.mutual_info_regression`\n- :class:`neighbors.KNeighborsClassifier`\n- :class:`neighbors.KNeighborsRegressor`\n- :class:`neighbors.RadiusNeighborsClassifier`\n- :class:`neighbors.RadiusNeighborsRegressor`\n- :class:`neighbors.LocalOutlierFactor`\n- :class:`neighbors.NearestNeighbors`\n- :class:`manifold.Isomap`\n- :class:`manifold.LocallyLinearEmbedding`\n- :class:`manifold.TSNE`\n- :func:`manifold.trustworthiness`\n- :class:`semi_supervised.LabelPropagation`\n- :class:`semi_supervised.LabelSpreading`\n\nTo know more about the technical details of this work, you can read\n[this suite of blog posts](https://blog.scikit-learn.org/technical/performances/).\n\nMoreover, the computation of loss functions has been refactored using\nCython resulting in performance improvements for the following estimators:\n\n- :class:`linear_model.LogisticRegression`\n- :class:`linear_model.GammaRegressor`\n- :class:`linear_model.PoissonRegressor`\n- :class:`linear_model.TweedieRegressor`\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## :class:`~decomposition.MiniBatchNMF`: an online version of NMF\nThe new class :class:`~decomposition.MiniBatchNMF` implements a faster but\nless accurate version of non-negative matrix factorization\n(:class:`~decomposition.NMF`). :class:`~decomposition.MiniBatchNMF` divides the\ndata into mini-batches and optimizes the NMF model in an online manner by\ncycling over the mini-batches, making it better suited for large datasets. In\nparticular, it implements `partial_fit`, which can be used for online\nlearning when the data is not readily available from the start, or when the\ndata does not fit into memory.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.decomposition import MiniBatchNMF\n\nrng = np.random.RandomState(0)\nn_samples, n_features, n_components = 10, 10, 5\ntrue_W = rng.uniform(size=(n_samples, n_components))\ntrue_H = rng.uniform(size=(n_components, n_features))\nX = true_W @ true_H\n\nnmf = MiniBatchNMF(n_components=n_components, random_state=0)\n\nfor _ in range(10):\n nmf.partial_fit(X)\n\nW = nmf.transform(X)\nH = nmf.components_\nX_reconstructed = W @ H\n\nprint(\n \"relative reconstruction error: \",\n f\"{np.sum((X - X_reconstructed) ** 2) / np.sum(X**2):.5f}\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## :class:`~cluster.BisectingKMeans`: divide and cluster\nThe new class :class:`~cluster.BisectingKMeans` is a variant of\n:class:`~cluster.KMeans`, using divisive hierarchical clustering. Instead of\ncreating all centroids at once, centroids are picked progressively based on a\nprevious clustering: a cluster is split into two new clusters repeatedly\nuntil the target number of clusters is reached, giving a hierarchical\nstructure to the clustering.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.cluster import BisectingKMeans, KMeans\nfrom sklearn.datasets import make_blobs\n\nX, _ = make_blobs(n_samples=1000, centers=2, random_state=0)\n\nkm = KMeans(n_clusters=5, random_state=0, n_init=\"auto\").fit(X)\nbisect_km = BisectingKMeans(n_clusters=5, random_state=0).fit(X)\n\nfig, ax = plt.subplots(1, 2, figsize=(10, 5))\nax[0].scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)\nax[0].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=20, c=\"r\")\nax[0].set_title(\"KMeans\")\n\nax[1].scatter(X[:, 0], X[:, 1], s=10, c=bisect_km.labels_)\nax[1].scatter(\n bisect_km.cluster_centers_[:, 0], bisect_km.cluster_centers_[:, 1], s=20, c=\"r\"\n)\n_ = ax[1].set_title(\"BisectingKMeans\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }