{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Precision-Recall\n\nExample of Precision-Recall metric to evaluate classifier output quality.\n\nPrecision-Recall is a useful measure of success of prediction when the\nclasses are very imbalanced. In information retrieval, precision is a\nmeasure of the fraction of relevant items among actually returned items while recall\nis a measure of the fraction of items that were returned among all items that should\nhave been returned. 'Relevancy' here refers to items that are\npostively labeled, i.e., true positives and false negatives.\n\nPrecision ($P$) is defined as the number of true positives ($T_p$)\nover the number of true positives plus the number of false positives\n($F_p$).\n\n\\begin{align}P = \\frac{T_p}{T_p+F_p}\\end{align}\n\nRecall ($R$) is defined as the number of true positives ($T_p$)\nover the number of true positives plus the number of false negatives\n($F_n$).\n\n\\begin{align}R = \\frac{T_p}{T_p + F_n}\\end{align}\n\nThe precision-recall curve shows the tradeoff between precision and\nrecall for different thresholds. A high area under the curve represents\nboth high recall and high precision. High precision is achieved by having\nfew false positives in the returned results, and high recall is achieved by\nhaving few false negatives in the relevant results.\nHigh scores for both show that the classifier is returning\naccurate results (high precision), as well as returning a majority of all relevant\nresults (high recall).\n\nA system with high recall but low precision returns most of the relevant items, but\nthe proportion of returned results that are incorrectly labeled is high. A\nsystem with high precision but low recall is just the opposite, returning very\nfew of the relevant items, but most of its predicted labels are correct when compared\nto the actual labels. An ideal system with high precision and high recall will\nreturn most of the relevant items, with most results labeled correctly.\n\nThe definition of precision ($\\frac{T_p}{T_p + F_p}$) shows that lowering\nthe threshold of a classifier may increase the denominator, by increasing the\nnumber of results returned. If the threshold was previously set too high, the\nnew results may all be true positives, which will increase precision. If the\nprevious threshold was about right or too low, further lowering the threshold\nwill introduce false positives, decreasing precision.\n\nRecall is defined as $\\frac{T_p}{T_p+F_n}$, where $T_p+F_n$ does\nnot depend on the classifier threshold. Changing the classifier threshold can only\nchange the numerator, $T_p$. Lowering the classifier\nthreshold may increase recall, by increasing the number of true positive\nresults. It is also possible that lowering the threshold may leave recall\nunchanged, while the precision fluctuates. Thus, precision does not necessarily\ndecrease with recall.\n\nThe relationship between recall and precision can be observed in the\nstairstep area of the plot - at the edges of these steps a small change\nin the threshold considerably reduces precision, with only a minor gain in\nrecall.\n\n**Average precision** (AP) summarizes such a plot as the weighted mean of\nprecisions achieved at each threshold, with the increase in recall from the\nprevious threshold used as the weight:\n\n$\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n$\n\nwhere $P_n$ and $R_n$ are the precision and recall at the\nnth threshold. A pair $(R_k, P_k)$ is referred to as an\n*operating point*.\n\nAP and the trapezoidal area under the operating points\n(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall\ncurve that lead to different results. Read more in the\n`User Guide `.\n\nPrecision-recall curves are typically used in binary classification to study\nthe output of a classifier. In order to extend the precision-recall curve and\naverage precision to multi-class or multi-label classification, it is necessary\nto binarize the output. One curve can be drawn per label, but one can also draw\na precision-recall curve by considering each element of the label indicator\nmatrix as a binary prediction (`micro-averaging `).\n\n

Note

See also :func:`sklearn.metrics.average_precision_score`,\n :func:`sklearn.metrics.recall_score`,\n :func:`sklearn.metrics.precision_score`,\n :func:`sklearn.metrics.f1_score`

\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In binary classification settings\n\n### Dataset and model\n\nWe will use a Linear SVC classifier to differentiate two types of irises.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\nX, y = load_iris(return_X_y=True)\n\n# Add noisy features\nrandom_state = np.random.RandomState(0)\nn_samples, n_features = X.shape\nX = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)\n\n# Limit to the two first classes, and split into training and test\nX_train, X_test, y_train, y_test = train_test_split(\n X[y < 2], y[y < 2], test_size=0.5, random_state=random_state\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear SVC will expect each feature to have a similar range of values. Thus,\nwe will first scale the data using a\n:class:`~sklearn.preprocessing.StandardScaler`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import LinearSVC\n\nclassifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))\nclassifier.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the Precision-Recall curve\n\nTo plot the precision-recall curve, you should use\n:class:`~sklearn.metrics.PrecisionRecallDisplay`. Indeed, there is two\nmethods available depending if you already computed the predictions of the\nclassifier or not.\n\nLet's first plot the precision-recall curve without the classifier\npredictions. We use\n:func:`~sklearn.metrics.PrecisionRecallDisplay.from_estimator` that\ncomputes the predictions for us before plotting the curve.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import PrecisionRecallDisplay\n\ndisplay = PrecisionRecallDisplay.from_estimator(\n classifier, X_test, y_test, name=\"LinearSVC\", plot_chance_level=True, despine=True\n)\n_ = display.ax_.set_title(\"2-class Precision-Recall curve\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we already got the estimated probabilities or scores for\nour model, then we can use\n:func:`~sklearn.metrics.PrecisionRecallDisplay.from_predictions`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_score = classifier.decision_function(X_test)\n\ndisplay = PrecisionRecallDisplay.from_predictions(\n y_test, y_score, name=\"LinearSVC\", plot_chance_level=True, despine=True\n)\n_ = display.ax_.set_title(\"2-class Precision-Recall curve\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In multi-label settings\n\nThe precision-recall curve does not support the multilabel setting. However,\none can decide how to handle this case. We show such an example below.\n\n### Create multi-label data, fit, and predict\n\nWe create a multi-label dataset, to illustrate the precision-recall in\nmulti-label settings.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import label_binarize\n\n# Use label_binarize to be multi-label like settings\nY = label_binarize(y, classes=[0, 1, 2])\nn_classes = Y.shape[1]\n\n# Split into training and test\nX_train, X_test, Y_train, Y_test = train_test_split(\n X, Y, test_size=0.5, random_state=random_state\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use :class:`~sklearn.multiclass.OneVsRestClassifier` for multi-label\nprediction.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.multiclass import OneVsRestClassifier\n\nclassifier = OneVsRestClassifier(\n make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))\n)\nclassifier.fit(X_train, Y_train)\ny_score = classifier.decision_function(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The average precision score in multi-label settings\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import average_precision_score, precision_recall_curve\n\n# For each class\nprecision = dict()\nrecall = dict()\naverage_precision = dict()\nfor i in range(n_classes):\n precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])\n average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])\n\n# A \"micro-average\": quantifying score on all classes jointly\nprecision[\"micro\"], recall[\"micro\"], _ = precision_recall_curve(\n Y_test.ravel(), y_score.ravel()\n)\naverage_precision[\"micro\"] = average_precision_score(Y_test, y_score, average=\"micro\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the micro-averaged Precision-Recall curve\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from collections import Counter\n\ndisplay = PrecisionRecallDisplay(\n recall=recall[\"micro\"],\n precision=precision[\"micro\"],\n average_precision=average_precision[\"micro\"],\n prevalence_pos_label=Counter(Y_test.ravel())[1] / Y_test.size,\n)\ndisplay.plot(plot_chance_level=True, despine=True)\n_ = display.ax_.set_title(\"Micro-averaged over all classes\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot Precision-Recall curve for each class and iso-f1 curves\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import cycle\n\nimport matplotlib.pyplot as plt\n\n# setup plot details\ncolors = cycle([\"navy\", \"turquoise\", \"darkorange\", \"cornflowerblue\", \"teal\"])\n\n_, ax = plt.subplots(figsize=(7, 8))\n\nf_scores = np.linspace(0.2, 0.8, num=4)\nlines, labels = [], []\nfor f_score in f_scores:\n x = np.linspace(0.01, 1)\n y = f_score * x / (2 * x - f_score)\n (l,) = plt.plot(x[y >= 0], y[y >= 0], color=\"gray\", alpha=0.2)\n plt.annotate(\"f1={0:0.1f}\".format(f_score), xy=(0.9, y[45] + 0.02))\n\ndisplay = PrecisionRecallDisplay(\n recall=recall[\"micro\"],\n precision=precision[\"micro\"],\n average_precision=average_precision[\"micro\"],\n)\ndisplay.plot(ax=ax, name=\"Micro-average precision-recall\", color=\"gold\")\n\nfor i, color in zip(range(n_classes), colors):\n display = PrecisionRecallDisplay(\n recall=recall[i],\n precision=precision[i],\n average_precision=average_precision[i],\n )\n display.plot(\n ax=ax, name=f\"Precision-recall for class {i}\", color=color, despine=True\n )\n\n# add the legend for the iso-f1 curves\nhandles, labels = display.ax_.get_legend_handles_labels()\nhandles.extend([l])\nlabels.extend([\"iso-f1 curves\"])\n# set the legend and the axes\nax.legend(handles=handles, labels=labels, loc=\"best\")\nax.set_title(\"Extension of Precision-Recall curve to multi-class\")\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }