{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Release Highlights for scikit-learn 1.1\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.1! Many bug fixes\nand improvements were added, as well as some new key features. We detail\nbelow a few of the major features of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes `.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## Quantile loss in :class:`~ensemble.HistGradientBoostingRegressor`\n:class:`~ensemble.HistGradientBoostingRegressor` can model quantiles with\n`loss=\"quantile\"` and the new parameter `quantile`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n# Simple regression function for X * cos(X)\nrng = np.random.RandomState(42)\nX_1d = np.linspace(0, 10, num=2000)\nX = X_1d.reshape(-1, 1)\ny = X_1d * np.cos(X_1d) + rng.normal(scale=X_1d / 3)\n\nquantiles = [0.95, 0.5, 0.05]\nparameters = dict(loss=\"quantile\", max_bins=32, max_iter=50)\nhist_quantiles = {\n f\"quantile={quantile:.2f}\": HistGradientBoostingRegressor(\n **parameters, quantile=quantile\n ).fit(X, y)\n for quantile in quantiles\n}\n\nfig, ax = plt.subplots()\nax.plot(X_1d, y, \"o\", alpha=0.5, markersize=1)\nfor quantile, hist in hist_quantiles.items():\n ax.plot(X_1d, hist.predict(X), label=quantile)\n_ = ax.legend(loc=\"lower left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a usecase example, see\n`sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py`\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `get_feature_names_out` Available in all Transformers\n:term:`get_feature_names_out` is now available in all Transformers. This enables\n:class:`~pipeline.Pipeline` to construct the output feature names for more complex\npipelines:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.linear_model import LogisticRegression\n\nX, y = fetch_openml(\n \"titanic\", version=1, as_frame=True, return_X_y=True, parser=\"pandas\"\n)\nnumeric_features = [\"age\", \"fare\"]\nnumeric_transformer = make_pipeline(SimpleImputer(strategy=\"median\"), StandardScaler())\ncategorical_features = [\"embarked\", \"pclass\"]\n\npreprocessor = ColumnTransformer(\n [\n (\"num\", numeric_transformer, numeric_features),\n (\n \"cat\",\n OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n categorical_features,\n ),\n ],\n verbose_feature_names_out=False,\n)\nlog_reg = make_pipeline(preprocessor, SelectKBest(k=7), LogisticRegression())\nlog_reg.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we slice the pipeline to include all the steps but the last one. The output\nfeature names of this pipeline slice are the features put into logistic\nregression. These names correspond directly to the coefficients in the logistic\nregression:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nlog_reg_input_features = log_reg[:-1].get_feature_names_out()\npd.Series(log_reg[-1].coef_.ravel(), index=log_reg_input_features).plot.bar()\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping infrequent categories in :class:`~preprocessing.OneHotEncoder`\n:class:`~preprocessing.OneHotEncoder` supports aggregating infrequent\ncategories into a single output for each feature. The parameters to enable\nthe gathering of infrequent categories are `min_frequency` and\n`max_categories`. See the `User Guide `\nfor more details.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\nimport numpy as np\n\nX = np.array(\n [[\"dog\"] * 5 + [\"cat\"] * 20 + [\"rabbit\"] * 10 + [\"snake\"] * 3], dtype=object\n).T\nenc = OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)\nenc.infrequent_categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since dog and snake are infrequent categories, they are grouped together when\ntransformed:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "encoded = enc.transform(np.array([[\"dog\"], [\"snake\"], [\"cat\"], [\"rabbit\"]]))\npd.DataFrame(encoded, columns=enc.get_feature_names_out())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance improvements\nReductions on pairwise distances for dense float64 datasets has been refactored\nto better take advantage of non-blocking thread parallelism. For example,\n:meth:`neighbors.NearestNeighbors.kneighbors` and\n:meth:`neighbors.NearestNeighbors.radius_neighbors` can respectively be up to \u00d720 and\n\u00d75 faster than previously. In summary, the following functions and estimators\nnow benefit from improved performance:\n\n- :func:`metrics.pairwise_distances_argmin`\n- :func:`metrics.pairwise_distances_argmin_min`\n- :class:`cluster.AffinityPropagation`\n- :class:`cluster.Birch`\n- :class:`cluster.MeanShift`\n- :class:`cluster.OPTICS`\n- :class:`cluster.SpectralClustering`\n- :func:`feature_selection.mutual_info_regression`\n- :class:`neighbors.KNeighborsClassifier`\n- :class:`neighbors.KNeighborsRegressor`\n- :class:`neighbors.RadiusNeighborsClassifier`\n- :class:`neighbors.RadiusNeighborsRegressor`\n- :class:`neighbors.LocalOutlierFactor`\n- :class:`neighbors.NearestNeighbors`\n- :class:`manifold.Isomap`\n- :class:`manifold.LocallyLinearEmbedding`\n- :class:`manifold.TSNE`\n- :func:`manifold.trustworthiness`\n- :class:`semi_supervised.LabelPropagation`\n- :class:`semi_supervised.LabelSpreading`\n\nTo know more about the technical details of this work, you can read\n[this suite of blog posts](https://blog.scikit-learn.org/technical/performances/).\n\nMoreover, the computation of loss functions has been refactored using\nCython resulting in performance improvements for the following estimators:\n\n- :class:`linear_model.LogisticRegression`\n- :class:`linear_model.GammaRegressor`\n- :class:`linear_model.PoissonRegressor`\n- :class:`linear_model.TweedieRegressor`\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## :class:`~decomposition.MiniBatchNMF`: an online version of NMF\nThe new class :class:`~decomposition.MiniBatchNMF` implements a faster but\nless accurate version of non-negative matrix factorization\n(:class:`~decomposition.NMF`). :class:`~decomposition.MiniBatchNMF` divides the\ndata into mini-batches and optimizes the NMF model in an online manner by\ncycling over the mini-batches, making it better suited for large datasets. In\nparticular, it implements `partial_fit`, which can be used for online\nlearning when the data is not readily available from the start, or when the\ndata does not fit into memory.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom sklearn.decomposition import MiniBatchNMF\n\nrng = np.random.RandomState(0)\nn_samples, n_features, n_components = 10, 10, 5\ntrue_W = rng.uniform(size=(n_samples, n_components))\ntrue_H = rng.uniform(size=(n_components, n_features))\nX = true_W @ true_H\n\nnmf = MiniBatchNMF(n_components=n_components, random_state=0)\n\nfor _ in range(10):\n nmf.partial_fit(X)\n\nW = nmf.transform(X)\nH = nmf.components_\nX_reconstructed = W @ H\n\nprint(\n f\"relative reconstruction error: \",\n f\"{np.sum((X - X_reconstructed) ** 2) / np.sum(X**2):.5f}\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## :class:`~cluster.BisectingKMeans`: divide and cluster\nThe new class :class:`~cluster.BisectingKMeans` is a variant of\n:class:`~cluster.KMeans`, using divisive hierarchical clustering. Instead of\ncreating all centroids at once, centroids are picked progressively based on a\nprevious clustering: a cluster is split into two new clusters repeatedly\nuntil the target number of clusters is reached, giving a hierarchical\nstructure to the clustering.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\nfrom sklearn.cluster import KMeans, BisectingKMeans\nimport matplotlib.pyplot as plt\n\nX, _ = make_blobs(n_samples=1000, centers=2, random_state=0)\n\nkm = KMeans(n_clusters=5, random_state=0, n_init=\"auto\").fit(X)\nbisect_km = BisectingKMeans(n_clusters=5, random_state=0).fit(X)\n\nfig, ax = plt.subplots(1, 2, figsize=(10, 5))\nax[0].scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)\nax[0].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=20, c=\"r\")\nax[0].set_title(\"KMeans\")\n\nax[1].scatter(X[:, 0], X[:, 1], s=10, c=bisect_km.labels_)\nax[1].scatter(\n bisect_km.cluster_centers_[:, 0], bisect_km.cluster_centers_[:, 1], s=20, c=\"r\"\n)\n_ = ax[1].set_title(\"BisectingKMeans\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }