{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Release Highlights for scikit-learn 0.22\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 0.22, which comes\nwith many bug fixes and new features! We detail below a few of the major\nfeatures of this release. For an exhaustive list of all the changes, please\nrefer to the `release notes `.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New plotting API\n\nA new plotting API is available for creating visualizations. This new API\nallows for quickly adjusting the visuals of a plot without involving any\nrecomputation. It is also possible to add different plots to the same\nfigure. The following example illustrates `plot_roc_curve`,\nbut other plots utilities are supported like\n`plot_partial_dependence`,\n`plot_precision_recall_curve`, and\n`plot_confusion_matrix`. Read more about this new API in the\n`User Guide `.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import make_classification\nfrom sklearn.ensemble import RandomForestClassifier\n\n# from sklearn.metrics import plot_roc_curve\nfrom sklearn.metrics import RocCurveDisplay\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVC\nfrom sklearn.utils.fixes import parse_version\n\nX, y = make_classification(random_state=0)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\nsvc = SVC(random_state=42)\nsvc.fit(X_train, y_train)\nrfc = RandomForestClassifier(random_state=42)\nrfc.fit(X_train, y_train)\n\n# plot_roc_curve has been removed in version 1.2. From 1.2, use RocCurveDisplay instead.\n# svc_disp = plot_roc_curve(svc, X_test, y_test)\n# rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_)\nsvc_disp = RocCurveDisplay.from_estimator(svc, X_test, y_test)\nrfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=svc_disp.ax_)\nrfc_disp.figure_.suptitle(\"ROC curve comparison\")\n\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacking Classifier and Regressor\n:class:`~ensemble.StackingClassifier` and\n:class:`~ensemble.StackingRegressor`\nallow you to have a stack of estimators with a final classifier or\na regressor.\nStacked generalization consists in stacking the output of individual\nestimators and use a classifier to compute the final prediction. Stacking\nallows to use the strength of each individual estimator by using their output\nas input of a final estimator.\nBase estimators are fitted on the full ``X`` while\nthe final estimator is trained using cross-validated predictions of the\nbase estimators using ``cross_val_predict``.\n\nRead more in the `User Guide `.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\nfrom sklearn.ensemble import StackingClassifier\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import LinearSVC\n\nX, y = load_iris(return_X_y=True)\nestimators = [\n (\"rf\", RandomForestClassifier(n_estimators=10, random_state=42)),\n (\"svr\", make_pipeline(StandardScaler(), LinearSVC(dual=\"auto\", random_state=42))),\n]\nclf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\nclf.fit(X_train, y_train).score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Permutation-based feature importance\n\nThe :func:`inspection.permutation_importance` can be used to get an\nestimate of the importance of each feature, for any fitted estimator:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.datasets import make_classification\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.inspection import permutation_importance\n\nX, y = make_classification(random_state=0, n_features=5, n_informative=3)\nfeature_names = np.array([f\"x_{i}\" for i in range(X.shape[1])])\n\nrf = RandomForestClassifier(random_state=0).fit(X, y)\nresult = permutation_importance(rf, X, y, n_repeats=10, random_state=0, n_jobs=2)\n\nfig, ax = plt.subplots()\nsorted_idx = result.importances_mean.argsort()\n\n# `labels` argument in boxplot is deprecated in matplotlib 3.9 and has been\n# renamed to `tick_labels`. The following code handles this, but as a\n# scikit-learn user you probably can write simpler code by using `labels=...`\n# (matplotlib < 3.9) or `tick_labels=...` (matplotlib >= 3.9).\ntick_labels_parameter_name = (\n \"tick_labels\"\n if parse_version(matplotlib.__version__) >= parse_version(\"3.9\")\n else \"labels\"\n)\ntick_labels_dict = {tick_labels_parameter_name: feature_names[sorted_idx]}\nax.boxplot(result.importances[sorted_idx].T, vert=False, **tick_labels_dict)\nax.set_title(\"Permutation Importance of each feature\")\nax.set_ylabel(\"Features\")\nfig.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Native support for missing values for gradient boosting\n\nThe :class:`ensemble.HistGradientBoostingClassifier`\nand :class:`ensemble.HistGradientBoostingRegressor` now have native\nsupport for missing values (NaNs). This means that there is no need for\nimputing data when training or predicting.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingClassifier\n\nX = np.array([0, 1, 2, np.nan]).reshape(-1, 1)\ny = [0, 0, 1, 1]\n\ngbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)\nprint(gbdt.predict(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Precomputed sparse nearest neighbors graph\nMost estimators based on nearest neighbors graphs now accept precomputed\nsparse graphs as input, to reuse the same graph for multiple estimator fits.\nTo use this feature in a pipeline, one can use the `memory` parameter, along\nwith one of the two new transformers,\n:class:`neighbors.KNeighborsTransformer` and\n:class:`neighbors.RadiusNeighborsTransformer`. The precomputation\ncan also be performed by custom estimators to use alternative\nimplementations, such as approximate nearest neighbors methods.\nSee more details in the `User Guide `.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from tempfile import TemporaryDirectory\n\nfrom sklearn.manifold import Isomap\nfrom sklearn.neighbors import KNeighborsTransformer\nfrom sklearn.pipeline import make_pipeline\n\nX, y = make_classification(random_state=0)\n\nwith TemporaryDirectory(prefix=\"sklearn_cache_\") as tmpdir:\n estimator = make_pipeline(\n KNeighborsTransformer(n_neighbors=10, mode=\"distance\"),\n Isomap(n_neighbors=10, metric=\"precomputed\"),\n memory=tmpdir,\n )\n estimator.fit(X)\n\n # We can decrease the number of neighbors and the graph will not be\n # recomputed.\n estimator.set_params(isomap__n_neighbors=5)\n estimator.fit(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## KNN Based Imputation\nWe now support imputation for completing missing values using k-Nearest\nNeighbors.\n\nEach sample's missing values are imputed using the mean value from\n``n_neighbors`` nearest neighbors found in the training set. Two samples are\nclose if the features that neither is missing are close.\nBy default, a euclidean distance metric\nthat supports missing values,\n:func:`~sklearn.metrics.pairwise.nan_euclidean_distances`, is used to find the nearest\nneighbors.\n\nRead more in the `User Guide `.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.impute import KNNImputer\n\nX = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]\nimputer = KNNImputer(n_neighbors=2)\nprint(imputer.fit_transform(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree pruning\n\nIt is now possible to prune most tree-based estimators once the trees are\nbuilt. The pruning is based on minimal cost-complexity. Read more in the\n`User Guide ` for details.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = make_classification(random_state=0)\n\nrf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)\nprint(\n \"Average number of nodes without pruning {:.1f}\".format(\n np.mean([e.tree_.node_count for e in rf.estimators_])\n )\n)\n\nrf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y)\nprint(\n \"Average number of nodes with pruning {:.1f}\".format(\n np.mean([e.tree_.node_count for e in rf.estimators_])\n )\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieve dataframes from OpenML\n:func:`datasets.fetch_openml` can now return pandas dataframe and thus\nproperly handle datasets with heterogeneous data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\ntitanic = fetch_openml(\"titanic\", version=1, as_frame=True, parser=\"pandas\")\nprint(titanic.data.head()[[\"pclass\", \"embarked\"]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking scikit-learn compatibility of an estimator\nDevelopers can check the compatibility of their scikit-learn compatible\nestimators using :func:`~utils.estimator_checks.check_estimator`. For\ninstance, the ``check_estimator(LinearSVC())`` passes.\n\nWe now provide a ``pytest`` specific decorator which allows ``pytest``\nto run all checks independently and report the checks that are failing.\n\n..note::\n This entry was slightly updated in version 0.24, where passing classes\n isn't supported anymore: pass instances instead.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\nfrom sklearn.tree import DecisionTreeRegressor\nfrom sklearn.utils.estimator_checks import parametrize_with_checks\n\n\n@parametrize_with_checks([LogisticRegression(), DecisionTreeRegressor()])\ndef test_sklearn_compatible_estimator(estimator, check):\n check(estimator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ROC AUC now supports multiclass classification\nThe :func:`~sklearn.metrics.roc_auc_score` function can also be used in multi-class\nclassification. Two averaging strategies are currently supported: the\none-vs-one algorithm computes the average of the pairwise ROC AUC scores, and\nthe one-vs-rest algorithm computes the average of the ROC AUC scores for each\nclass against all other classes. In both cases, the multiclass ROC AUC scores\nare computed from the probability estimates that a sample belongs to a\nparticular class according to the model. The OvO and OvR algorithms support\nweighting uniformly (``average='macro'``) and weighting by the prevalence\n(``average='weighted'``).\n\nRead more in the `User Guide `.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_classification\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.svm import SVC\n\nX, y = make_classification(n_classes=4, n_informative=16)\nclf = SVC(decision_function_shape=\"ovo\", probability=True).fit(X, y)\nprint(roc_auc_score(y, clf.predict_proba(X), multi_class=\"ovo\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }