{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Release Highlights for scikit-learn 1.5\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.5! Many bug fixes\nand improvements were added, as well as some key new features. Below we\ndetail the highlights of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes `.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## FixedThresholdClassifier: Setting the decision threshold of a binary classifier\nAll binary classifiers of scikit-learn use a fixed decision threshold of 0.5\nto convert probability estimates (i.e. output of `predict_proba`) into class\npredictions. However, 0.5 is almost never the desired threshold for a given\nproblem. :class:`~model_selection.FixedThresholdClassifier` allows wrapping any\nbinary classifier and setting a custom decision threshold.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import ConfusionMatrixDisplay\n\n\nX, y = make_classification(n_samples=10_000, weights=[0.9, 0.1], random_state=0)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\nclassifier_05 = LogisticRegression(C=1e6, random_state=0).fit(X_train, y_train)\n_ = ConfusionMatrixDisplay.from_estimator(classifier_05, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lowering the threshold, i.e. allowing more samples to be classified as the positive\nclass, increases the number of true positives at the cost of more false positives\n(as is well known from the concavity of the ROC curve).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import FixedThresholdClassifier\n\nclassifier_01 = FixedThresholdClassifier(classifier_05, threshold=0.1)\nclassifier_01.fit(X_train, y_train)\n_ = ConfusionMatrixDisplay.from_estimator(classifier_01, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier\nThe decision threshold of a binary classifier can be tuned to optimize a\ngiven metric, using :class:`~model_selection.TunedThresholdClassifierCV`.\n\nIt is particularly useful to find the best decision threshold when the model\nis meant to be deployed in a specific application context where we can assign\ndifferent gains or costs for true positives, true negatives, false positives,\nand false negatives.\n\nLet's illustrate this by considering an arbitrary case where:\n\n- each true positive gains 1 unit of profit, e.g. euro, year of life in good\n health, etc.;\n- true negatives gain or cost nothing;\n- each false negative costs 2;\n- each false positive costs 0.1.\n\nOur metric quantifies the average profit per sample, which is defined by the\nfollowing Python function:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n\n\ndef custom_score(y_observed, y_pred):\n tn, fp, fn, tp = confusion_matrix(y_observed, y_pred, normalize=\"all\").ravel()\n return tp - 2 * fn - 0.1 * fp\n\n\nprint(\"Untuned decision threshold: 0.5\")\nprint(f\"Custom score: {custom_score(y_test, classifier_05.predict(X_test)):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is interesting to observe that the average gain per prediction is negative\nwhich means that this decision system is making a loss on average.\n\nTuning the threshold to optimize this custom metric gives a smaller threshold\nthat allows more samples to be classified as the positive class. As a result,\nthe average gain per prediction improves.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import TunedThresholdClassifierCV\nfrom sklearn.metrics import make_scorer\n\ncustom_scorer = make_scorer(\n custom_score, response_method=\"predict\", greater_is_better=True\n)\ntuned_classifier = TunedThresholdClassifierCV(\n classifier_05, cv=5, scoring=custom_scorer\n).fit(X, y)\n\nprint(f\"Tuned decision threshold: {tuned_classifier.best_threshold_:.3f}\")\nprint(f\"Custom score: {custom_score(y_test, tuned_classifier.predict(X_test)):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that tuning the decision threshold can turn a machine\nlearning-based system that makes a loss on average into a beneficial one.\n\nIn practice, defining a meaningful application-specific metric might involve\nmaking those costs for bad predictions and gains for good predictions depend on\nauxiliary metadata specific to each individual data point such as the amount\nof a transaction in a fraud detection system.\n\nTo achieve this, :class:`~model_selection.TunedThresholdClassifierCV`\nleverages metadata routing support (`Metadata Routing User\nGuide`) allowing to optimize complex business metrics as\ndetailed in `Post-tuning the decision threshold for cost-sensitive\nlearning\n`.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance improvements in PCA\n:class:`~decomposition.PCA` has a new solver, `\"covariance_eigh\"`, which is\nup to an order of magnitude faster and more memory efficient than the other\nsolvers for datasets with many data points and few features.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_low_rank_matrix\nfrom sklearn.decomposition import PCA\n\nX = make_low_rank_matrix(\n n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0\n)\n\npca = PCA(n_components=10, svd_solver=\"covariance_eigh\").fit(X)\nprint(f\"Explained variance: {pca.explained_variance_ratio_.sum():.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new solver also accepts sparse input data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy.sparse import random\n\nX = random(10_000, 100, format=\"csr\", random_state=0)\n\npca = PCA(n_components=10, svd_solver=\"covariance_eigh\").fit(X)\nprint(f\"Explained variance: {pca.explained_variance_ratio_.sum():.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `\"full\"` solver has also been improved to use less memory and allows\nfaster transformation. The default `svd_solver=\"auto\"`` option takes\nadvantage of the new solver and is now able to select an appropriate solver\nfor sparse datasets.\n\nSimilarly to most other PCA solvers, the new `\"covariance_eigh\"` solver can leverage\nGPU computation if the input data is passed as a PyTorch or CuPy array by\nenabling the experimental support for `Array API `.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ColumnTransformer is subscriptable\nThe transformers of a :class:`~compose.ColumnTransformer` can now be directly\naccessed using indexing by name.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\n\nX = np.array([[0, 1, 2], [3, 4, 5]])\ncolumn_transformer = ColumnTransformer(\n [(\"std_scaler\", StandardScaler(), [0]), (\"one_hot\", OneHotEncoder(), [1, 2])]\n)\n\ncolumn_transformer.fit(X)\n\nprint(column_transformer[\"std_scaler\"])\nprint(column_transformer[\"one_hot\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom imputation strategies for the SimpleImputer\n:class:`~impute.SimpleImputer` now supports custom strategies for imputation,\nusing a callable that computes a scalar value from the non missing values of\na column vector.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer\n\nX = np.array(\n [\n [-1.1, 1.1, 1.1],\n [3.9, -1.2, np.nan],\n [np.nan, 1.3, np.nan],\n [-0.1, -1.4, -1.4],\n [-4.9, 1.5, -1.5],\n [np.nan, 1.6, 1.6],\n ]\n)\n\n\ndef smallest_abs(arr):\n \"\"\"Return the smallest absolute value of a 1D array.\"\"\"\n return np.min(np.abs(arr))\n\n\nimputer = SimpleImputer(strategy=smallest_abs)\n\nimputer.fit_transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pairwise distances with non-numeric arrays\n:func:`~metrics.pairwise_distances` can now compute distances between\nnon-numeric arrays using a callable metric.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import pairwise_distances\n\nX = [\"cat\", \"dog\"]\nY = [\"cat\", \"fox\"]\n\n\ndef levenshtein_distance(x, y):\n \"\"\"Return the Levenshtein distance between two strings.\"\"\"\n if x == \"\" or y == \"\":\n return max(len(x), len(y))\n if x[0] == y[0]:\n return levenshtein_distance(x[1:], y[1:])\n return 1 + min(\n levenshtein_distance(x[1:], y),\n levenshtein_distance(x, y[1:]),\n levenshtein_distance(x[1:], y[1:]),\n )\n\n\npairwise_distances(X, Y, metric=levenshtein_distance)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }