{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Post-hoc tuning the cut-off point of decision function\n\nOnce a binary classifier is trained, the :term:`predict` method outputs class label\npredictions corresponding to a thresholding of either the :term:`decision_function` or\nthe :term:`predict_proba` output. The default threshold is defined as a posterior\nprobability estimate of 0.5 or a decision score of 0.0. However, this default strategy\nmay not be optimal for the task at hand.\n\nThis example shows how to use the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` to tune the decision\nthreshold, depending on a metric of interest.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The diabetes dataset\n\nTo illustrate the tuning of the decision threshold, we will use the diabetes dataset.\nThis dataset is available on OpenML: https://www.openml.org/d/37. We use the\n:func:`~sklearn.datasets.fetch_openml` function to fetch this dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\ndiabetes = fetch_openml(data_id=37, as_frame=True, parser=\"pandas\")\ndata, target = diabetes.data, diabetes.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We look at the target to understand the type of problem we are dealing with.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that we are dealing with a binary classification problem. Since the\nlabels are not encoded as 0 and 1, we make it explicit that we consider the class\nlabeled \"tested_negative\" as the negative class (which is also the most frequent)\nand the class labeled \"tested_positive\" the positive as the positive class:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "neg_label, pos_label = target.value_counts().index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also observe that this binary problem is slightly imbalanced where we have\naround twice more samples from the negative class than from the positive class. When\nit comes to evaluation, we should consider this aspect to interpret the results.\n\n## Our vanilla classifier\n\nWe define a basic predictive model composed of a scaler followed by a logistic\nregression classifier.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nmodel = make_pipeline(StandardScaler(), LogisticRegression())\nmodel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We evaluate our model using cross-validation. We use the accuracy and the balanced\naccuracy to report the performance of our model. The balanced accuracy is a metric\nthat is less sensitive to class imbalance and will allow us to put the accuracy\nscore in perspective.\n\nCross-validation allows us to study the variance of the decision threshold across\ndifferent splits of the data. However, the dataset is rather small and it would be\ndetrimental to use more than 5 folds to evaluate the dispersion. Therefore, we use\na :class:`~sklearn.model_selection.RepeatedStratifiedKFold` where we apply several\nrepetitions of 5-fold cross-validation.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nfrom sklearn.model_selection import RepeatedStratifiedKFold, cross_validate\n\nscoring = [\"accuracy\", \"balanced_accuracy\"]\ncv_scores = [\n \"train_accuracy\",\n \"test_accuracy\",\n \"train_balanced_accuracy\",\n \"test_balanced_accuracy\",\n]\ncv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)\ncv_results_vanilla_model = pd.DataFrame(\n cross_validate(\n model,\n data,\n target,\n scoring=scoring,\n cv=cv,\n return_train_score=True,\n return_estimator=True,\n )\n)\ncv_results_vanilla_model[cv_scores].aggregate([\"mean\", \"std\"]).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our predictive model succeeds to grasp the relationship between the data and the\ntarget. The training and testing scores are close to each other, meaning that our\npredictive model is not overfitting. We can also observe that the balanced accuracy is\nlower than the accuracy, due to the class imbalance previously mentioned.\n\nFor this classifier, we let the decision threshold, used convert the probability of\nthe positive class into a class prediction, to its default value: 0.5. However, this\nthreshold might not be optimal. If our interest is to maximize the balanced accuracy,\nwe should select another threshold that would maximize this metric.\n\nThe :class:`~sklearn.model_selection.TunedThresholdClassifierCV` meta-estimator allows\nto tune the decision threshold of a classifier given a metric of interest.\n\n## Tuning the decision threshold\n\nWe create a :class:`~sklearn.model_selection.TunedThresholdClassifierCV` and\nconfigure it to maximize the balanced accuracy. We evaluate the model using the same\ncross-validation strategy as previously.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import TunedThresholdClassifierCV\n\ntuned_model = TunedThresholdClassifierCV(estimator=model, scoring=\"balanced_accuracy\")\ncv_results_tuned_model = pd.DataFrame(\n cross_validate(\n tuned_model,\n data,\n target,\n scoring=scoring,\n cv=cv,\n return_train_score=True,\n return_estimator=True,\n )\n)\ncv_results_tuned_model[cv_scores].aggregate([\"mean\", \"std\"]).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In comparison with the vanilla model, we observe that the balanced accuracy score\nincreased. Of course, it comes at the cost of a lower accuracy score. It means that\nour model is now more sensitive to the positive class but makes more mistakes on the\nnegative class.\n\nHowever, it is important to note that this tuned predictive model is internally the\nsame model as the vanilla model: they have the same fitted coefficients.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nvanilla_model_coef = pd.DataFrame(\n [est[-1].coef_.ravel() for est in cv_results_vanilla_model[\"estimator\"]],\n columns=diabetes.feature_names,\n)\ntuned_model_coef = pd.DataFrame(\n [est.estimator_[-1].coef_.ravel() for est in cv_results_tuned_model[\"estimator\"]],\n columns=diabetes.feature_names,\n)\n\nfig, ax = plt.subplots(ncols=2, figsize=(12, 4), sharex=True, sharey=True)\nvanilla_model_coef.boxplot(ax=ax[0])\nax[0].set_ylabel(\"Coefficient value\")\nax[0].set_title(\"Vanilla model\")\ntuned_model_coef.boxplot(ax=ax[1])\nax[1].set_title(\"Tuned model\")\n_ = fig.suptitle(\"Coefficients of the predictive models\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the decision threshold of each model was changed during the cross-validation.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "decision_threshold = pd.Series(\n [est.best_threshold_ for est in cv_results_tuned_model[\"estimator\"]],\n)\nax = decision_threshold.plot.kde()\nax.axvline(\n decision_threshold.mean(),\n color=\"k\",\n linestyle=\"--\",\n label=f\"Mean decision threshold: {decision_threshold.mean():.2f}\",\n)\nax.set_xlabel(\"Decision threshold\")\nax.legend(loc=\"upper right\")\n_ = ax.set_title(\n \"Distribution of the decision threshold \\nacross different cross-validation folds\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In average, a decision threshold around 0.32 maximizes the balanced accuracy, which is\ndifferent from the default decision threshold of 0.5. Thus tuning the decision\nthreshold is particularly important when the output of the predictive model\nis used to make decisions. Besides, the metric used to tune the decision threshold\nshould be chosen carefully. Here, we used the balanced accuracy but it might not be\nthe most appropriate metric for the problem at hand. The choice of the \"right\" metric\nis usually problem-dependent and might require some domain knowledge. Refer to the\nexample entitled,\n`sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py`,\nfor more details.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }