{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Performance_Estimation)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Performance_Estimation) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance Estimation - Estimating Accuracy for Binary Classification Problems"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/performance_estimation.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once your model is deployed, monitoring its performance plays a crucial role in ensuring the quality of your ML system. To calculate metrics such as accuracy, labels are required. However, in many cases, labels can be unavailable, partially available or come in a delayed fashion.\n",
"\n",
"In this notebook, we will show one possible way of estimating the performance of your model without having access to the labels. We will use the [Ecommerce dataset](https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html) to demonstrate the process.\n",
"\n",
"We will cover:\n",
"\n",
"- [Importance Weighting for Accuracy Estimation - Rationale](#rationale)\n",
"- [The scenario - Covariate Shift with the Ecommerce Dataset](#scenario)\n",
"- [Using whylogs to estimate accuracy](#whylogs)\n",
"- [Uploading the results to WhyLabs](#whylabs)\n",
"- [Conclusion](#conclusion)\n",
"\n",
"First, let's define the scope of this example:\n",
"\n",
"We are concerned with estimating the __accuracy__ of a __binary classification model__ for an __unlabeled target__ dataset. We will do so by leveraging a __labeled reference__, or baseline, dataset."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importance Weighting for Accuracy Estimation \n",
"\n",
"As previously stated, the challenge is to estimate the accuracy of a model without having access to the labels.\n",
"\n",
"One way to do so is to use a labeled reference dataset. This can be your test set, or a subset of it. We can then identify ways to segment both the reference and the target datasets. For example, we might segment a dataset according to age, profession, or location. We can then calculate the reference dataset's accuracy for each segment. To finally estimate the overall accuracy of the target dataset, we can use the reference dataset's accuracy as a proxy, and weight it according to the proportion of the target dataset's segments.\n",
"\n",
"Let's see how this works in practice.\n",
"\n",
"### Reference Dataset\n",
"\n",
"Assume we have a reference dataset for which we have labels. We then segment this dataset into 4 different categories: A, B, C, and D. Since we have the labels, we can then calculate the accuracy of each segment. If the chosen segments are mutually exclusive and exhaustive, we can also calculate the reference dataset's overall accuracy by simply weighting the accuracy of each segment by its proportion in the reference dataset. So, if we have the following accuracies and proportions:\n",
"\n",
"> Mutually exclusive and exhaustive segments means that our segments don't overlap with each other, and the sum of the segments equal to the complete dataset.\n",
"\n",
"__Reference Dataset__\n",
"\n",
"| Segment | Accuracy | Proportion |\n",
"|:-------:|:--------:|:----------:|\n",
"| A | 0.92 | 32% |\n",
"| B | 0.56 | 27% |\n",
"| C | 0.67 | 16% |\n",
"| D | 0.75 | 25% |\n",
"\n",
"The overall accuracy of the reference dataset is:\n",
"\n",
"$Acc_{ref} = p_{A}*Acc_{A} + p_{B}*Acc_{B} + p_{C}*Acc_{C} + p_{D}*Acc_{D} $\n",
"\n",
"$Acc_{ref} = 0.32*0.92 + 0.27*0.56 + 0.16*0.67 + 0.25*0.75 = 0.74 $\n",
"\n",
"### Target Dataset\n",
"\n",
"Great! So, how do we use this information to estimate the accuracy of our target dataset?\n",
"\n",
"\n",
"Suppose we have the following information about our target dataset:\n",
"\n",
"__Target Dataset__\n",
"\n",
"| Segment | Accuracy | Proportion |\n",
"|:-------:|:--------:|:----------:|\n",
"| A | ? | 16% |\n",
"| B | ? | 67% |\n",
"| C | ? | 6% |\n",
"| D | ? | 11% |\n",
"\n",
"We can see that the proportions for each segment are different from the reference dataset, with a significant increase in the proportion of data belonging to segment B. We don't have the accuracy for each segment, and we want to estimate the overall accuracy of the target dataset. The intuition is that we can use the reference dataset's accuracy for each segment as a proxy, and weight those accuracies by the proportion of each segment in the target dataset. This will give us an estimate of the overall accuracy of the target dataset.\n",
"\n",
"So, if we denote $\\overline{Acc}_{target}$ as the estimated overall accuracy of the target dataset, we can calculate it as follows:\n",
"\n",
"$\\overline{Acc}_{target} = p_{A}*Acc_{ref_A} + p_{B}*Acc_{ref_B} + p_{C}*Acc_{ref_C} + p_{D}*Acc_{ref_D} $\n",
"\n",
"$\\overline{Acc}_{target} = 0.16*0.92 + 0.67*0.56 + 0.06*0.67 + 0.11*0.75 = 0.645$\n",
"\n",
"Where $Acc_{ref_A}$, $Acc_{ref_B}$, $Acc_{ref_C}$, and $Acc_{ref_D}$ are the accuracies of the reference dataset for each segment."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Considerations\n",
"\n",
"In the example above, we're considering that the main reason for the difference in performance between the reference and target datasets is due to a change in the distribution of the input data. This is known as __covariate shift__.\n",
"\n",
"That said, there are other reasons why the performance of a model can change, such as (but not limited to):\n",
"\n",
"- __Concept drift__: The relationship between the input data and the target variable changes. This is known as __concept drift__.\n",
"- __Covariate shift to unknown regions of the feature space__: Suppose our model was trained of a demographic with age 15-70. If during production we receive data from a demographic with age 0-14 or above 70, the model's performance will likely decrease.\n",
"- __Data quality issues__: Missing values, outliers, data schema changes, etc. can affect the performance of a model.\n",
"\n",
"For all of above examples, it is possible that the importance weighting approach presented here will not yield accurate estimates.\n",
"\n",
"Another important consideration is the importance of choosing the proper segments to perform the importance weighting. The segments in the target dataset must be a subset of the reference dataset, as an unseen segment will not have an associated accuracy. The segments also should ideally have high variance in training accuracies: if all segments have the same accuracy, then weighting them would not make much sense. Additionally, as stated previously, in this example we are assuming that the segments are mutually exclusive and exhaustive."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Scenario \n",
"\n",
"Let's see how this approach works in practice with the [Ecommerce dataset](https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html).\n",
"\n",
"This dataset contrains transactions made by customers of an online store. The goal is to predict whether a product should be given a discount or not, based on the product's category, market price, product rating and sales history. It is a binary classification task, as the `output_discount` column contains a `1` if the product should be given a discount, and a `0` otherwise.\n",
"\n",
"We will segment the dataset according to the `category` column. This column contains 11 different categories such as `Beverages`, `Vegetables` or `Baby Care`.\n",
"\n",
"We want to simulate a scenario where the distribution of the input data changes, leading to changes in the model's performance. To do so, we will get data for 7 different days, and \"perturb\" the data for each day: we will pick 4 different categories for each day, and randomly subsample each category between a range of 10%-30%: that is, we will end up with 10 to 30% of the original segment size for each subsampled category.\n",
"\n",
"For the reference dataset, we will use the dataset that was originally used to test the model in an experimental (pre-deployment) stage.\n",
"\n",
"In this example, we actually have the labels for the perturbed days. We'll end the example with the plot showing the estimated accuracy vs. the real accuracy for each day."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accuracy Estimation with whylogs \n",
"\n",
"Let's finally see how we can use whylogs to estimate the accuracy of a model.\n",
"\n",
"first, let's install whylogs and the required extras:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install 'whylogs[datasets]'"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting the Reference and Target Datasets\n",
"\n",
"The first thing we need is to get:\n",
"\n",
"- Reference dataset\n",
"- 7 daily \"perturbed\" target datasets\n",
"\n",
"This dataset is already in the whylogs' `datasets` module, so we'll source it from there. We'll then arrange the data into a more proper format, and then create a function to perturb each day by random subsampling a subset of our categories."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from whylogs.datasets import Ecommerce\n",
"\n",
"dataset = Ecommerce()\n",
"\n",
"baseline = dataset.get_baseline()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | product | \n", "sales_last_week | \n", "market_price | \n", "rating | \n", "category | \n", "output_discount | \n", "output_prediction | \n", "
---|---|---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2023-02-28 00:00:00+00:00 | \n", "Wood - Centre Filled Bar Infused With Dark Mou... | \n", "1 | \n", "350.0 | \n", "4.500000 | \n", "Snacks and Branded Foods | \n", "0 | \n", "1 | \n", "
2023-02-28 00:00:00+00:00 | \n", "Toasted Almonds | \n", "1 | \n", "399.0 | \n", "3.944479 | \n", "Gourmet and World Food | \n", "1 | \n", "0 | \n", "
2023-02-28 00:00:00+00:00 | \n", "Instant Thai Noodles - Hot & Spicy Tomyum | \n", "1 | \n", "95.0 | \n", "3.300000 | \n", "Gourmet and World Food | \n", "0 | \n", "0 | \n", "
2023-02-28 00:00:00+00:00 | \n", "Thokku - Vathakozhambu | \n", "1 | \n", "336.0 | \n", "4.300000 | \n", "Snacks and Branded Foods | \n", "0 | \n", "1 | \n", "
2023-02-28 00:00:00+00:00 | \n", "Beetroot Powder | \n", "1 | \n", "150.0 | \n", "3.944479 | \n", "Gourmet and World Food | \n", "0 | \n", "0 | \n", "