{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the\n:term:`decision_function` or the :term:`predict_proba` output. For a binary classifier,\nthe default threshold is defined as a posterior probability estimate of 0.5 or a\ndecision score of 0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n\n.. rubric :: References\n\n.. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n.. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cost-sensitive learning with constant gains and costs\n\nIn this first section, we illustrate the use of the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` in a setting of\ncost-sensitive learning when the gains and costs associated to each entry of the\nconfusion matrix are constant. We use the problematic presented in [2]_ using the\n\"Statlog\" German credit dataset [1]_.\n\n### \"Statlog\" German credit dataset\n\nWe fetch the German credit dataset from OpenML.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sklearn\nfrom sklearn.datasets import fetch_openml\n\nsklearn.set_config(transform_output=\"pandas\")\n\ngerman_credit = fetch_openml(data_id=31, as_frame=True, parser=\"pandas\")\nX, y = german_credit.data, german_credit.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We check the feature types available in `X`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many features are categorical and usually string-encoded. We need to encode\nthese categories when we develop our predictive model. Let's check the targets.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another observation is that the dataset is imbalanced. We would need to be careful\nwhen evaluating our predictive model and use a family of metrics that are adapted\nto this setting.\n\nIn addition, we observe that the target is string-encoded. Some metrics\n(e.g. precision and recall) require to provide the label of interest also called\nthe \"positive label\". Here, we define that our goal is to predict whether or not\na sample is a \"bad\" credit.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pos_label, neg_label = \"bad\", \"good\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To carry our analysis, we split our dataset using a single stratified split.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are ready to design our predictive model and the associated evaluation strategy.\n\n### Evaluation metrics\n\nIn this section, we define a set of metrics that we use later. To see\nthe effect of tuning the cut-off point, we evaluate the predictive model using\nthe Receiver Operating Characteristic (ROC) curve and the Precision-Recall curve.\nThe values reported on these plots are therefore the true positive rate (TPR),\nalso known as the recall or the sensitivity, and the false positive rate (FPR),\nalso known as the specificity, for the ROC curve and the precision and recall for\nthe Precision-Recall curve.\n\nFrom these four metrics, scikit-learn does not provide a scorer for the FPR. We\ntherefore need to define a small custom function to compute it.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n\n\ndef fpr_score(y, y_pred, neg_label, pos_label):\n cm = confusion_matrix(y, y_pred, labels=[neg_label, pos_label])\n tn, fp, _, _ = cm.ravel()\n tnr = tn / (tn + fp)\n return 1 - tnr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As previously stated, the \"positive label\" is not defined as the value \"1\" and calling\nsome of the metrics with this non-standard value raise an error. We need to\nprovide the indication of the \"positive label\" to the metrics.\n\nWe therefore need to define a scikit-learn scorer using\n:func:`~sklearn.metrics.make_scorer` where the information is passed. We store all\nthe custom scorers in a dictionary. To use them, we need to pass the fitted model,\nthe data and the target on which we want to evaluate the predictive model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import make_scorer, precision_score, recall_score\n\ntpr_score = recall_score # TPR and recall are the same metric\nscoring = {\n \"precision\": make_scorer(precision_score, pos_label=pos_label),\n \"recall\": make_scorer(recall_score, pos_label=pos_label),\n \"fpr\": make_scorer(fpr_score, neg_label=neg_label, pos_label=pos_label),\n \"tpr\": make_scorer(tpr_score, pos_label=pos_label),\n}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, the original research [1]_ defines a custom business metric. We\ncall a \"business metric\" any metric function that aims at quantifying how the\npredictions (correct or wrong) might impact the business value of deploying a\ngiven machine learning model in a specific application context. For our\ncredit prediction task, the authors provide a custom cost-matrix which\nencodes that classifying a a \"bad\" credit as \"good\" is 5 times more costly on\naverage than the opposite: it is less costly for the financing institution to\nnot grant a credit to a potential customer that will not default (and\ntherefore miss a good customer that would have otherwise both reimbursed the\ncredit and payed interests) than to grant a credit to a customer that will\ndefault.\n\nWe define a python function that weight the confusion matrix and return the\noverall cost.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\n\ndef credit_gain_score(y, y_pred, neg_label, pos_label):\n cm = confusion_matrix(y, y_pred, labels=[neg_label, pos_label])\n # The rows of the confusion matrix hold the counts of observed classes\n # while the columns hold counts of predicted classes. Recall that here we\n # consider \"bad\" as the positive class (second row and column).\n # Scikit-learn model selection tools expect that we follow a convention\n # that \"higher\" means \"better\", hence the following gain matrix assigns\n # negative gains (costs) to the two kinds of prediction errors:\n # - a gain of -1 for each false positive (\"good\" credit labeled as \"bad\"),\n # - a gain of -5 for each false negative (\"bad\" credit labeled as \"good\"),\n # The true positives and true negatives are assigned null gains in this\n # metric.\n #\n # Note that theoretically, given that our model is calibrated and our data\n # set representative and large enough, we do not need to tune the\n # threshold, but can safely set it to the cost ration 1/5, as stated by Eq.\n # (2) in Elkan paper [2]_.\n gain_matrix = np.array(\n [\n [0, -1], # -1 gain for false positives\n [-5, 0], # -5 gain for false negatives\n ]\n )\n return np.sum(cm * gain_matrix)\n\n\nscoring[\"credit_gain\"] = make_scorer(\n credit_gain_score, neg_label=neg_label, pos_label=pos_label\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vanilla predictive model\n\nWe use :class:`~sklearn.ensemble.HistGradientBoostingClassifier` as a predictive model\nthat natively handles categorical features and missing values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingClassifier\n\nmodel = HistGradientBoostingClassifier(\n categorical_features=\"from_dtype\", random_state=0\n).fit(X_train, y_train)\nmodel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We evaluate the performance of our predictive model using the ROC and Precision-Recall\ncurves.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay\n\nfig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))\n\nPrecisionRecallDisplay.from_estimator(\n model, X_test, y_test, pos_label=pos_label, ax=axs[0], name=\"GBDT\"\n)\naxs[0].plot(\n scoring[\"recall\"](model, X_test, y_test),\n scoring[\"precision\"](model, X_test, y_test),\n marker=\"o\",\n markersize=10,\n color=\"tab:blue\",\n label=\"Default cut-off point at a probability of 0.5\",\n)\naxs[0].set_title(\"Precision-Recall curve\")\naxs[0].legend()\n\nRocCurveDisplay.from_estimator(\n model,\n X_test,\n y_test,\n pos_label=pos_label,\n ax=axs[1],\n name=\"GBDT\",\n plot_chance_level=True,\n)\naxs[1].plot(\n scoring[\"fpr\"](model, X_test, y_test),\n scoring[\"tpr\"](model, X_test, y_test),\n marker=\"o\",\n markersize=10,\n color=\"tab:blue\",\n label=\"Default cut-off point at a probability of 0.5\",\n)\naxs[1].set_title(\"ROC curve\")\naxs[1].legend()\n_ = fig.suptitle(\"Evaluation of the vanilla GBDT model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We recall that these curves give insights on the statistical performance of the\npredictive model for different cut-off points. For the Precision-Recall curve, the\nreported metrics are the precision and recall and for the ROC curve, the reported\nmetrics are the TPR (same as recall) and FPR.\n\nHere, the different cut-off points correspond to different levels of posterior\nprobability estimates ranging between 0 and 1. By default, `model.predict` uses a\ncut-off point at a probability estimate of 0.5. The metrics for such a cut-off point\nare reported with the blue dot on the curves: it corresponds to the statistical\nperformance of the model when using `model.predict`.\n\nHowever, we recall that the original aim was to minimize the cost (or maximize the\ngain) as defined by the business metric. We can compute the value of the business\nmetric:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"Business defined metric: {scoring['credit_gain'](model, X_test, y_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this stage we don't know if any other cut-off can lead to a greater gain. To find\nthe optimal one, we need to compute the cost-gain using the business metric for all\npossible cut-off points and choose the best. This strategy can be quite tedious to\nimplement by hand, but the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` class is here to help us.\nIt automatically computes the cost-gain for all possible cut-off points and optimizes\nfor the `scoring`.\n\n\n### Tuning the cut-off point\n\nWe use :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to tune the\ncut-off point. We need to provide the business metric to optimize as well as the\npositive label. Internally, the optimum cut-off point is chosen such that it maximizes\nthe business metric via cross-validation. By default a 5-fold stratified\ncross-validation is used.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import TunedThresholdClassifierCV\n\ntuned_model = TunedThresholdClassifierCV(\n estimator=model,\n scoring=scoring[\"credit_gain\"],\n store_cv_results=True, # necessary to inspect all results\n)\ntuned_model.fit(X_train, y_train)\nprint(f\"{tuned_model.best_threshold_=:0.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We plot the ROC and Precision-Recall curves for the vanilla model and the tuned model.\nAlso we plot the cut-off points that would be used by each model. Because, we are\nreusing the same code later, we define a function that generates the plots.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot_roc_pr_curves(vanilla_model, tuned_model, *, title):\n fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(21, 6))\n\n linestyles = (\"dashed\", \"dotted\")\n markerstyles = (\"o\", \">\")\n colors = (\"tab:blue\", \"tab:orange\")\n names = (\"Vanilla GBDT\", \"Tuned GBDT\")\n for idx, (est, linestyle, marker, color, name) in enumerate(\n zip((vanilla_model, tuned_model), linestyles, markerstyles, colors, names)\n ):\n decision_threshold = getattr(est, \"best_threshold_\", 0.5)\n PrecisionRecallDisplay.from_estimator(\n est,\n X_test,\n y_test,\n pos_label=pos_label,\n linestyle=linestyle,\n color=color,\n ax=axs[0],\n name=name,\n )\n axs[0].plot(\n scoring[\"recall\"](est, X_test, y_test),\n scoring[\"precision\"](est, X_test, y_test),\n marker,\n markersize=10,\n color=color,\n label=f\"Cut-off point at probability of {decision_threshold:.2f}\",\n )\n RocCurveDisplay.from_estimator(\n est,\n X_test,\n y_test,\n pos_label=pos_label,\n linestyle=linestyle,\n color=color,\n ax=axs[1],\n name=name,\n plot_chance_level=idx == 1,\n )\n axs[1].plot(\n scoring[\"fpr\"](est, X_test, y_test),\n scoring[\"tpr\"](est, X_test, y_test),\n marker,\n markersize=10,\n color=color,\n label=f\"Cut-off point at probability of {decision_threshold:.2f}\",\n )\n\n axs[0].set_title(\"Precision-Recall curve\")\n axs[0].legend()\n axs[1].set_title(\"ROC curve\")\n axs[1].legend()\n\n axs[2].plot(\n tuned_model.cv_results_[\"thresholds\"],\n tuned_model.cv_results_[\"scores\"],\n color=\"tab:orange\",\n )\n axs[2].plot(\n tuned_model.best_threshold_,\n tuned_model.best_score_,\n \"o\",\n markersize=10,\n color=\"tab:orange\",\n label=\"Optimal cut-off point for the business metric\",\n )\n axs[2].legend()\n axs[2].set_xlabel(\"Decision threshold (probability)\")\n axs[2].set_ylabel(\"Objective score (using cost-matrix)\")\n axs[2].set_title(\"Objective score as a function of the decision threshold\")\n fig.suptitle(title)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "title = \"Comparison of the cut-off point for the vanilla and tuned GBDT model\"\nplot_roc_pr_curves(model, tuned_model, title=title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first remark is that both classifiers have exactly the same ROC and\nPrecision-Recall curves. It is expected because by default, the classifier is fitted\non the same training data. In a later section, we discuss more in detail the\navailable options regarding model refitting and cross-validation.\n\nThe second remark is that the cut-off points of the vanilla and tuned model are\ndifferent. To understand why the tuned model has chosen this cut-off point, we can\nlook at the right-hand side plot that plots the objective score that is our exactly\nthe same as our business metric. We see that the optimum threshold corresponds to the\nmaximum of the objective score. This maximum is reached for a decision threshold\nmuch lower than 0.5: the tuned model enjoys a much higher recall at the cost of\nof significantly lower precision: the tuned model is much more eager to\npredict the \"bad\" class label to larger fraction of individuals.\n\nWe can now check if choosing this cut-off point leads to a better score on the testing\nset:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"Business defined metric: {scoring['credit_gain'](tuned_model, X_test, y_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that tuning the decision threshold almost improves our business gains\nby factor of 2.\n\n\n### Consideration regarding model refitting and cross-validation\n\nIn the above experiment, we used the default setting of the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV`. In particular, the\ncut-off point is tuned using a 5-fold stratified cross-validation. Also, the\nunderlying predictive model is refitted on the entire training data once the cut-off\npoint is chosen.\n\nThese two strategies can be changed by providing the `refit` and `cv` parameters.\nFor instance, one could provide a fitted `estimator` and set `cv=\"prefit\"`, in which\ncase the cut-off point is found on the entire dataset provided at fitting time.\nAlso, the underlying classifier is not be refitted by setting `refit=False`. Here, we\ncan try to do such experiment.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.fit(X_train, y_train)\ntuned_model.set_params(cv=\"prefit\", refit=False).fit(X_train, y_train)\nprint(f\"{tuned_model.best_threshold_=:0.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we evaluate our model with the same approach as before:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "title = \"Tuned GBDT model without refitting and using the entire dataset\"\nplot_roc_pr_curves(model, tuned_model, title=title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe the that the optimum cut-off point is different from the one found\nin the previous experiment. If we look at the right-hand side plot, we\nobserve that the business gain has large plateau of near-optimal 0 gain for a\nlarge span of decision thresholds. This behavior is symptomatic of an\noverfitting. Because we disable cross-validation, we tuned the cut-off point\non the same set as the model was trained on, and this is the reason for the\nobserved overfitting.\n\nThis option should therefore be used with caution. One needs to make sure that the\ndata provided at fitting time to the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` is not the same as the\ndata used to train the underlying classifier. This could happen sometimes when the\nidea is just to tune the predictive model on a completely new validation set without a\ncostly complete refit.\n\nWhen cross-validation is too costly, a potential alternative is to use a\nsingle train-test split by providing a floating number in range `[0, 1]` to the `cv`\nparameter. It splits the data into a training and testing set. Let's explore this\noption:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tuned_model.set_params(cv=0.75).fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "title = \"Tuned GBDT model without refitting and using the entire dataset\"\nplot_roc_pr_curves(model, tuned_model, title=title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regarding the cut-off point, we observe that the optimum is similar to the multiple\nrepeated cross-validation case. However, be aware that a single split does not account\nfor the variability of the fit/predict process and thus we are unable to know if there\nis any variance in the cut-off point. The repeated cross-validation averages out\nthis effect.\n\nAnother observation concerns the ROC and Precision-Recall curves of the tuned model.\nAs expected, these curves differ from those of the vanilla model, given that we\ntrained the underlying classifier on a subset of the data provided during fitting and\nreserved a validation set for tuning the cut-off point.\n\n## Cost-sensitive learning when gains and costs are not constant\n\nAs stated in [2]_, gains and costs are generally not constant in real-world problems.\nIn this section, we use a similar example as in [2]_ for the problem of\ndetecting fraud in credit card transaction records.\n\n### The credit card dataset\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "credit_card = fetch_openml(data_id=1597, as_frame=True, parser=\"pandas\")\ncredit_card.frame.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains information about credit card records from which some are\nfraudulent and others are legitimate. The goal is therefore to predict whether or\nnot a credit card record is fraudulent.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "columns_to_drop = [\"Class\"]\ndata = credit_card.frame.drop(columns=columns_to_drop)\ntarget = credit_card.frame[\"Class\"].astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we check the class distribution of the datasets.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is highly imbalanced with fraudulent transaction representing only 0.17%\nof the data. Since we are interested in training a machine learning model, we should\nalso make sure that we have enough samples in the minority class to train the model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that we have around 500 samples that is on the low end of the number of\nsamples required to train a machine learning model. In addition of the target\ndistribution, we check the distribution of the amount of the\nfraudulent transactions.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=30)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount (\u20ac)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction\nis estimated to a loss of 5\u20ac while the refusal of a fraudulent transaction is\nestimated to a gain of 50\u20ac. Therefore, we define the following function to\ncompute the total benefit of a given decision:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def business_metric(y_true, y_pred, amount):\n mask_true_positive = (y_true == 1) & (y_pred == 1)\n mask_true_negative = (y_true == 0) & (y_pred == 0)\n mask_false_positive = (y_true == 0) & (y_pred == 1)\n mask_false_negative = (y_true == 1) & (y_pred == 0)\n fraudulent_refuse = mask_true_positive.sum() * 50\n fraudulent_accept = -amount[mask_false_negative].sum()\n legitimate_refuse = mask_false_positive.sum() * -5\n legitimate_accept = (amount[mask_true_negative] * 0.02).sum()\n return fraudulent_refuse + fraudulent_accept + legitimate_refuse + legitimate_accept" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this business metric, we create a scikit-learn scorer that given a fitted\nclassifier and a test set compute the business metric. In this regard, we use\nthe :func:`~sklearn.metrics.make_scorer` factory. The variable `amount` is an\nadditional metadata to be passed to the scorer and we need to use\n`metadata routing ` to take into account this information.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sklearn.set_config(enable_metadata_routing=True)\nbusiness_scorer = make_scorer(business_metric).set_score_request(amount=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So at this stage, we observe that the amount of the transaction is used twice: once\nas a feature to train our predictive model and once as a metadata to compute the\nthe business metric and thus the statistical performance of our model. When used as a\nfeature, we are only required to have a column in `data` that contains the amount of\neach transaction. To use this information as metadata, we need to have an external\nvariable that we can pass to the scorer or the model that internally routes this\nmetadata to the scorer. So let's create this variable.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "amount = credit_card.frame[\"Amount\"].to_numpy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n\ndata_train, data_test, target_train, target_test, amount_train, amount_test = (\n train_test_split(\n data, target, amount, stratify=target, test_size=0.5, random_state=42\n )\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first evaluate some baseline policies to serve as reference. Recall that\nclass \"0\" is the legitimate class and class \"1\" is the fraudulent class.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.dummy import DummyClassifier\n\nalways_accept_policy = DummyClassifier(strategy=\"constant\", constant=0)\nalways_accept_policy.fit(data_train, target_train)\nbenefit = business_scorer(\n always_accept_policy, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit of the 'always accept' policy: {benefit:,.2f}\u20ac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A policy that considers all transactions as legitimate would create a profit of\naround 220,000\u20ac. We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "always_reject_policy = DummyClassifier(strategy=\"constant\", constant=1)\nalways_reject_policy.fit(data_train, target_train)\nbenefit = business_scorer(\n always_reject_policy, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit of the 'always reject' policy: {benefit:,.2f}\u20ac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such a policy would entail a catastrophic loss: around 670,000\u20ac. This is\nexpected since the vast majority of the transactions are legitimate and the\npolicy would refuse them at a non-trivial cost.\n\nA predictive model that adapts the accept/reject decisions on a per\ntransaction basis should ideally allow us to make a profit larger than the\n220,000\u20ac of the best of our constant baseline policies.\n\nWe start with a logistic regression model with the default decision threshold\nat 0.5. Here we tune the hyperparameter `C` of the logistic regression with a\nproper scoring rule (the log loss) to ensure that the model's probabilistic\npredictions returned by its `predict_proba` method are as accurate as\npossible, irrespectively of the choice of the value of the decision\nthreshold.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\nmodel" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\n \"Benefit of logistic regression with default threshold: \"\n f\"{business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The business metric shows that our predictive model with a default decision\nthreshold is already winning over the baseline in terms of profit and it would be\nalready beneficial to use it to accept or reject transactions instead of\naccepting all transactions.\n\n### Tuning the decision threshold\n\nNow the question is: is our model optimum for the type of decision that we want to do?\nUp to now, we did not optimize the decision threshold. We use the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` to optimize the decision\ngiven our business scorer. To avoid a nested cross-validation, we will use the\nbest estimator found during the previous grid-search.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tuned_model = TunedThresholdClassifierCV(\n estimator=model.best_estimator_,\n scoring=business_scorer,\n thresholds=100,\n n_jobs=2,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since our business scorer requires the amount of each transaction, we need to pass\nthis information in the `fit` method. The\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` is in charge of\nautomatically dispatching this metadata to the underlying scorer.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tuned_model.fit(data_train, target_train, amount=amount_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the tuned decision threshold is far away from the default 0.5:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"Tuned decision threshold: {tuned_model.best_threshold_:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\n \"Benefit of logistic regression with a tuned threshold: \"\n f\"{business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that tuning the decision threshold increases the expected profit\nwhen deploying our model - as indicated by the business metric. It is therefore\nvaluable, whenever possible, to optimize the decision threshold with respect\nto the business metric.\n\n### Manually setting the decision threshold instead of tuning it\n\nIn the previous example, we used the\n:class:`~sklearn.model_selection.TunedThresholdClassifierCV` to find the optimal\ndecision threshold. However, in some cases, we might have some prior knowledge about\nthe problem at hand and we might be happy to set the decision threshold manually.\n\nThe class :class:`~sklearn.model_selection.FixedThresholdClassifier` allows us to\nmanually set the decision threshold. At prediction time, it behave as the previous\ntuned model but no search is performed during the fitting process. Note that here\nwe use :class:`~sklearn.frozen.FrozenEstimator` to wrap the predictive model to\navoid any refitting.\n\nHere, we will reuse the decision threshold found in the previous section to create a\nnew model and check that it gives the same results.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.frozen import FrozenEstimator\nfrom sklearn.model_selection import FixedThresholdClassifier\n\nmodel_fixed_threshold = FixedThresholdClassifier(\n estimator=FrozenEstimator(model), threshold=tuned_model.best_threshold_\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit of logistic regression with a tuned threshold: {business_score:,.2f}\u20ac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that we obtained the exact same results but the fitting process\nwas much faster since we did not perform any hyper-parameter search.\n\nFinally, the estimate of the (average) business metric itself can be unreliable, in\nparticular when the number of data points in the minority class is very small.\nAny business impact estimated by cross-validation of a business metric on\nhistorical data (offline evaluation) should ideally be confirmed by A/B testing\non live data (online evaluation). Note however that A/B testing models is\nbeyond the scope of the scikit-learn library itself.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }