{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Performance_Estimation)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Performance_Estimation) to leverage the power of whylogs and WhyLabs together!*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Performance Estimation - Estimating Accuracy for Binary Classification Problems" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/performance_estimation.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Once your model is deployed, monitoring its performance plays a crucial role in ensuring the quality of your ML system. To calculate metrics such as accuracy, labels are required. However, in many cases, labels can be unavailable, partially available or come in a delayed fashion.\n", "\n", "In this notebook, we will show one possible way of estimating the performance of your model without having access to the labels. We will use the [Ecommerce dataset](https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html) to demonstrate the process.\n", "\n", "We will cover:\n", "\n", "- [Importance Weighting for Accuracy Estimation - Rationale](#rationale)\n", "- [The scenario - Covariate Shift with the Ecommerce Dataset](#scenario)\n", "- [Using whylogs to estimate accuracy](#whylogs)\n", "- [Uploading the results to WhyLabs](#whylabs)\n", "- [Conclusion](#conclusion)\n", "\n", "First, let's define the scope of this example:\n", "\n", "We are concerned with estimating the __accuracy__ of a __binary classification model__ for an __unlabeled target__ dataset. We will do so by leveraging a __labeled reference__, or baseline, dataset." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Importance Weighting for Accuracy Estimation \n", "\n", "As previously stated, the challenge is to estimate the accuracy of a model without having access to the labels.\n", "\n", "One way to do so is to use a labeled reference dataset. This can be your test set, or a subset of it. We can then identify ways to segment both the reference and the target datasets. For example, we might segment a dataset according to age, profession, or location. We can then calculate the reference dataset's accuracy for each segment. To finally estimate the overall accuracy of the target dataset, we can use the reference dataset's accuracy as a proxy, and weight it according to the proportion of the target dataset's segments.\n", "\n", "Let's see how this works in practice.\n", "\n", "### Reference Dataset\n", "\n", "Assume we have a reference dataset for which we have labels. We then segment this dataset into 4 different categories: A, B, C, and D. Since we have the labels, we can then calculate the accuracy of each segment. If the chosen segments are mutually exclusive and exhaustive, we can also calculate the reference dataset's overall accuracy by simply weighting the accuracy of each segment by its proportion in the reference dataset. So, if we have the following accuracies and proportions:\n", "\n", "> Mutually exclusive and exhaustive segments means that our segments don't overlap with each other, and the sum of the segments equal to the complete dataset.\n", "\n", "__Reference Dataset__\n", "\n", "| Segment | Accuracy | Proportion |\n", "|:-------:|:--------:|:----------:|\n", "| A | 0.92 | 32% |\n", "| B | 0.56 | 27% |\n", "| C | 0.67 | 16% |\n", "| D | 0.75 | 25% |\n", "\n", "The overall accuracy of the reference dataset is:\n", "\n", "$Acc_{ref} = p_{A}*Acc_{A} + p_{B}*Acc_{B} + p_{C}*Acc_{C} + p_{D}*Acc_{D} $\n", "\n", "$Acc_{ref} = 0.32*0.92 + 0.27*0.56 + 0.16*0.67 + 0.25*0.75 = 0.74 $\n", "\n", "### Target Dataset\n", "\n", "Great! So, how do we use this information to estimate the accuracy of our target dataset?\n", "\n", "\n", "Suppose we have the following information about our target dataset:\n", "\n", "__Target Dataset__\n", "\n", "| Segment | Accuracy | Proportion |\n", "|:-------:|:--------:|:----------:|\n", "| A | ? | 16% |\n", "| B | ? | 67% |\n", "| C | ? | 6% |\n", "| D | ? | 11% |\n", "\n", "We can see that the proportions for each segment are different from the reference dataset, with a significant increase in the proportion of data belonging to segment B. We don't have the accuracy for each segment, and we want to estimate the overall accuracy of the target dataset. The intuition is that we can use the reference dataset's accuracy for each segment as a proxy, and weight those accuracies by the proportion of each segment in the target dataset. This will give us an estimate of the overall accuracy of the target dataset.\n", "\n", "So, if we denote $\\overline{Acc}_{target}$ as the estimated overall accuracy of the target dataset, we can calculate it as follows:\n", "\n", "$\\overline{Acc}_{target} = p_{A}*Acc_{ref_A} + p_{B}*Acc_{ref_B} + p_{C}*Acc_{ref_C} + p_{D}*Acc_{ref_D} $\n", "\n", "$\\overline{Acc}_{target} = 0.16*0.92 + 0.67*0.56 + 0.06*0.67 + 0.11*0.75 = 0.645$\n", "\n", "Where $Acc_{ref_A}$, $Acc_{ref_B}$, $Acc_{ref_C}$, and $Acc_{ref_D}$ are the accuracies of the reference dataset for each segment." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Considerations\n", "\n", "In the example above, we're considering that the main reason for the difference in performance between the reference and target datasets is due to a change in the distribution of the input data. This is known as __covariate shift__.\n", "\n", "That said, there are other reasons why the performance of a model can change, such as (but not limited to):\n", "\n", "- __Concept drift__: The relationship between the input data and the target variable changes. This is known as __concept drift__.\n", "- __Covariate shift to unknown regions of the feature space__: Suppose our model was trained of a demographic with age 15-70. If during production we receive data from a demographic with age 0-14 or above 70, the model's performance will likely decrease.\n", "- __Data quality issues__: Missing values, outliers, data schema changes, etc. can affect the performance of a model.\n", "\n", "For all of above examples, it is possible that the importance weighting approach presented here will not yield accurate estimates.\n", "\n", "Another important consideration is the importance of choosing the proper segments to perform the importance weighting. The segments in the target dataset must be a subset of the reference dataset, as an unseen segment will not have an associated accuracy. The segments also should ideally have high variance in training accuracies: if all segments have the same accuracy, then weighting them would not make much sense. Additionally, as stated previously, in this example we are assuming that the segments are mutually exclusive and exhaustive." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## The Scenario \n", "\n", "Let's see how this approach works in practice with the [Ecommerce dataset](https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html).\n", "\n", "This dataset contrains transactions made by customers of an online store. The goal is to predict whether a product should be given a discount or not, based on the product's category, market price, product rating and sales history. It is a binary classification task, as the `output_discount` column contains a `1` if the product should be given a discount, and a `0` otherwise.\n", "\n", "We will segment the dataset according to the `category` column. This column contains 11 different categories such as `Beverages`, `Vegetables` or `Baby Care`.\n", "\n", "We want to simulate a scenario where the distribution of the input data changes, leading to changes in the model's performance. To do so, we will get data for 7 different days, and \"perturb\" the data for each day: we will pick 4 different categories for each day, and randomly subsample each category between a range of 10%-30%: that is, we will end up with 10 to 30% of the original segment size for each subsampled category.\n", "\n", "For the reference dataset, we will use the dataset that was originally used to test the model in an experimental (pre-deployment) stage.\n", "\n", "In this example, we actually have the labels for the perturbed days. We'll end the example with the plot showing the estimated accuracy vs. the real accuracy for each day." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Accuracy Estimation with whylogs \n", "\n", "Let's finally see how we can use whylogs to estimate the accuracy of a model.\n", "\n", "first, let's install whylogs and the required extras:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install 'whylogs[datasets]'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Getting the Reference and Target Datasets\n", "\n", "The first thing we need is to get:\n", "\n", "- Reference dataset\n", "- 7 daily \"perturbed\" target datasets\n", "\n", "This dataset is already in the whylogs' `datasets` module, so we'll source it from there. We'll then arrange the data into a more proper format, and then create a function to perturb each day by random subsampling a subset of our categories." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from whylogs.datasets import Ecommerce\n", "\n", "dataset = Ecommerce()\n", "\n", "baseline = dataset.get_baseline()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
productsales_last_weekmarket_priceratingcategoryoutput_discountoutput_prediction
date
2023-02-28 00:00:00+00:00Wood - Centre Filled Bar Infused With Dark Mou...1350.04.500000Snacks and Branded Foods01
2023-02-28 00:00:00+00:00Toasted Almonds1399.03.944479Gourmet and World Food10
2023-02-28 00:00:00+00:00Instant Thai Noodles - Hot & Spicy Tomyum195.03.300000Gourmet and World Food00
2023-02-28 00:00:00+00:00Thokku - Vathakozhambu1336.04.300000Snacks and Branded Foods01
2023-02-28 00:00:00+00:00Beetroot Powder1150.03.944479Gourmet and World Food00
\n", "
" ], "text/plain": [ " product \\\n", "date \n", "2023-02-28 00:00:00+00:00 Wood - Centre Filled Bar Infused With Dark Mou... \n", "2023-02-28 00:00:00+00:00 Toasted Almonds \n", "2023-02-28 00:00:00+00:00 Instant Thai Noodles - Hot & Spicy Tomyum \n", "2023-02-28 00:00:00+00:00 Thokku - Vathakozhambu \n", "2023-02-28 00:00:00+00:00 Beetroot Powder \n", "\n", " sales_last_week market_price rating \\\n", "date \n", "2023-02-28 00:00:00+00:00 1 350.0 4.500000 \n", "2023-02-28 00:00:00+00:00 1 399.0 3.944479 \n", "2023-02-28 00:00:00+00:00 1 95.0 3.300000 \n", "2023-02-28 00:00:00+00:00 1 336.0 4.300000 \n", "2023-02-28 00:00:00+00:00 1 150.0 3.944479 \n", "\n", " category output_discount \\\n", "date \n", "2023-02-28 00:00:00+00:00 Snacks and Branded Foods 0 \n", "2023-02-28 00:00:00+00:00 Gourmet and World Food 1 \n", "2023-02-28 00:00:00+00:00 Gourmet and World Food 0 \n", "2023-02-28 00:00:00+00:00 Snacks and Branded Foods 0 \n", "2023-02-28 00:00:00+00:00 Gourmet and World Food 0 \n", "\n", " output_prediction \n", "date \n", "2023-02-28 00:00:00+00:00 1 \n", "2023-02-28 00:00:00+00:00 0 \n", "2023-02-28 00:00:00+00:00 0 \n", "2023-02-28 00:00:00+00:00 1 \n", "2023-02-28 00:00:00+00:00 0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.options.mode.chained_assignment = None # default='warn’\n", "\n", "def arrange_df(batch):\n", " df = batch.features\n", " df['output_discount'] = batch.target['output_discount']\n", " df['output_prediction'] = batch.prediction['output_prediction']\n", " return df\n", "\n", "reference_df = arrange_df(baseline)\n", "reference_df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Those are the categories that will be randomly subsampled for each day:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "perturbations_by_day = {\n", " 0 : ['Kitchen, Garden and Pets','Beauty and Hygiene', 'Fruits and Vegetables','Bakery, Cakes and Dairy'],\n", " 1 : ['Snacks and Branded Foods','Beverages', 'Baby Care', 'Gourmet and World Food'],\n", " 2 : ['Beauty and Hygiene','Kitchen, Garden and Pets', 'Bakery, Cakes and Dairy','Fruits and Vegetables'],\n", " 3 : ['Foodgrains, Oil and Masala','Cleaning and Household','Eggs, Meat and Fish','Bakery, Cakes and Dairy'],\n", " 4 : ['Cleaning and Household','Gourmet and World Food','Kitchen, Garden and Pets','Beauty and Hygiene'],\n", " 5 : ['Baby Care','Bakery, Cakes and Dairy','Kitchen, Garden and Pets'],\n", " 6 : ['Beverages', 'Eggs, Meat and Fish', 'Foodgrains, Oil and Masala'],\n", "\n", " }" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And the following function will be used to perturb the data:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "\n", "def random_subsample_on_column(df, column, lower_pct=0.1 , upper_pct=0.3, classes = 'all'):\n", " \"\"\"Subsample each class in a column to a random percentage of the total.\n", "\n", " The percentage is sampled uniformly between lower_pct and upper_pct.\n", " If classes is not 'all', then only subsample the classes in classes.\n", "\n", " Args:\n", " df (pd.DataFrame): The dataframe to subsample.\n", " column (str): The column to subsample on.\n", " lower_pct (float): The lower bound of the percentage to subsample.\n", " upper_pct (float): The upper bound of the percentage to subsample.\n", " classes (list): The classes to subsample. If 'all', then subsample all classes.\n", "\n", " \"\"\"\n", " if classes == 'all':\n", " class_names = df[column].unique()\n", " elif isinstance(classes, list):\n", " assert all([c in df[column].unique() for c in classes]), \"Classes must be in the column\"\n", " class_names = classes\n", " for c in class_names:\n", " sub_df = df.loc[df[column]==c]\n", " n = int(len(sub_df) * (lower_pct + (upper_pct - lower_pct) * np.random.random()))\n", " # remove n rows from the class\n", " df = df.loc[df[column] != c].append(sub_df.sample(n=n))\n", " return df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's get our perturbed dfs: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "batches = dataset.get_inference_data(number_batches=7)\n", "\n", "perturbed_dfs = []\n", "for day, batch in enumerate(batches):\n", " unperturbed_df = arrange_df(batch)\n", " perturbed_df = random_subsample_on_column(unperturbed_df, 'category', lower_pct=0.1, upper_pct=0.3, classes = perturbations_by_day[day])\n", " perturbed_dfs.append(perturbed_df)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### whylogs Profiling\n", "\n", "We'll start by profiling the reference dataset, and then the target datasets. We'll use the `category` column as the segment column, and we'll use the `output_discount` column as the target column.\n", "\n", "\n", "For each profiling process, we'll end up with 11 different whylogs profiles - one for each category. Those 11 profiles will be encapsulated in a `SegmentedResultSet`. Those sets are what we'll use to perform the accuracy estimation.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "from whylogs.core.segmentation_partition import segment_on_column\n", "from whylogs.core.schema import DatasetSchema\n", "\n", "def log_dataset(df, labeled=True):\n", " segment_column = \"category\"\n", " segmented_schema = DatasetSchema(segments=segment_on_column(segment_column))\n", " \n", " # Just to be sure that we're not using actual labels/metrics for the target dataset.\n", " if labeled:\n", " results = why.log_classification_metrics(\n", " df,\n", " target_column = \"output_discount\",\n", " prediction_column = \"output_prediction\",\n", " schema=segmented_schema,\n", " log_full_data=True\n", " )\n", " return results\n", " else:\n", " results = why.log(df, schema=segmented_schema)\n", " return results\n", "\n", "reference_results = log_dataset(reference_df, labeled=True)\n", "perturbed_results_list = [log_dataset(perturbed_df, labeled=False) for perturbed_df in perturbed_dfs]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that for the reference data we're logging by calling `why.log_classification_metrics`, which will give us access to the performance metrics. As for the target data, since we don't have labels, we're logging by calling `why.log`, which will give us access to the counts for each segment." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Estimating accuracy with the `performance_estimation` module\n", "\n", "Once we have the Result Sets available for reference and target datasets, we can use the `AcccuracyEstimator` from the `performance_estimation` module to estimate the accuracy of the target dataset.\n", "\n", "Let's do it for the first day to demonstrate:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Day 0 accuracy - estimated: 0.6387135902595148\n" ] } ], "source": [ "from whylogs.experimental.performance_estimation.estimators import AccuracyEstimator\n", "\n", "estimator = AccuracyEstimator(reference_result_set = reference_results)\n", "\n", "first_day_result = perturbed_results_list[0]\n", "estimation_result = estimator.estimate(first_day_result)\n", "\n", "print(f\"Day 0 accuracy - estimated: {estimation_result.accuracy}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "When initializing the estimator, we pass our reference result set. It will leverage the performance metrics for each segment's profiles to perform the accuracy estimation.\n", "\n", "When asked for an estimation for the target dataset, it will weight the reference result accuracies by the proportion of each segment in the target dataset, and return the estimated accuracy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Is this a good estimate? Let's see how it compares to the real accuracy.\n", "\n", "### Plotting the estimated accuracy vs. the real accuracy\n", "\n", "\n", "To plot the accuracies for the 7 days, let's define the different types of accuracies in the plot:\n", "\n", "- Real accuracy: The real accuracy for the perturbed data. This would only be available if we had all the labels, which is the problem we're trying to solve.\n", "- Estimated accuracy: The accuracy estimated by the `AccuracyEstimator`.\n", "- Pre-deployment accuracy: If we didn't have the labels, nor did any kind of estimation, our best guess would be to assume the model's performance is the same as it was during the pre-deployment stage. This is the accuracy we would use if we didn't have any other information.\n", "\n", "Let's first get a list for all 3 different types of accuracies for each day. Then we'll plot them in a line chart.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Day 0 accuracy - real: 0.644542772861357, estimated: 0.6449857773314478, pre-deploy: 0.6741501885271853\n", "Day 1 accuracy - real: 0.6866051543111676, estimated: 0.6919219880068938, pre-deploy: 0.6741501885271853\n", "Day 2 accuracy - real: 0.6291828793774319, estimated: 0.6398225939212645, pre-deploy: 0.6741501885271853\n", "Day 3 accuracy - real: 0.6890386869871044, estimated: 0.684898937625555, pre-deploy: 0.6741501885271853\n", "Day 4 accuracy - real: 0.6733477789815818, estimated: 0.6626462398863927, pre-deploy: 0.6741501885271853\n", "Day 5 accuracy - real: 0.665058189043429, estimated: 0.6551432545163998, pre-deploy: 0.6741501885271853\n", "Day 6 accuracy - real: 0.6886766712141883, estimated: 0.6867237421366811, pre-deploy: 0.6741501885271853\n" ] } ], "source": [ "from whylogs.experimental.performance_estimation.estimators import AccuracyEstimator\n", "import pandas as pd\n", "\n", "pd.options.mode.chained_assignment = None # default='warn’\n", "\n", "\n", "def calculate_real_accuracy(df):\n", " metrics_df = df[['output_discount','output_prediction']]\n", " correct = 0\n", " incorrect = 0\n", " for index,row in metrics_df.iterrows():\n", " \n", " if row['output_discount'] == row['output_prediction']:\n", " correct += 1\n", " else:\n", " incorrect += 1\n", " acc = correct/(correct+incorrect)\n", " return acc\n", "\n", "\n", "estimator = AccuracyEstimator(reference_result_set = reference_results)\n", "\n", "pre_deployment_accs = []\n", "real_accs = []\n", "estimated_accs = []\n", "for day, perturbed in enumerate(zip(perturbed_results_list, perturbed_dfs)):\n", " perturbed_results = perturbed[0]\n", " perturbed_df = perturbed[1]\n", "\n", " real_acc = calculate_real_accuracy(perturbed_df)\n", " pre_deployment_acc = calculate_real_accuracy(reference_df)\n", " estimation_result = estimator.estimate(perturbed_results)\n", " estimated_acc = estimation_result.accuracy\n", "\n", " pre_deployment_accs.append(pre_deployment_acc)\n", " real_accs.append(real_acc)\n", " estimated_accs.append(estimated_acc)\n", " print(f\"Day {day} accuracy - real: {real_acc}, estimated: {estimated_acc}, pre-deploy: {pre_deployment_acc}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now we can plot the accuracies for each day:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Estimated MSE: 5.36355459317329e-05, Pre-deployment MSE: 0.0005099683612281597\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "# plot pre-deployment, real and estimated accuracy\n", "plt.plot(pre_deployment_accs, label='pre-deployment')\n", "plt.plot(real_accs, label='real')\n", "plt.plot(estimated_accs, label='estimated')\n", "plt.legend()\n", "plt.show()\n", "\n", "mse = np.mean((np.array(real_accs) - np.array(estimated_accs))**2)\n", "baseline_mse = np.mean((np.array(real_accs) - np.array(pre_deployment_accs))**2) \n", "\n", "print(f\"Estimated MSE: {mse}, Pre-deployment MSE: {baseline_mse}\")\t" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The __Estimated MSE__ is the mean squared error between Estimated and Real accuracies, while the __Pre-deployment MSE__ is the mean squared error between Pre-deployment and Real accuracies.\n", "\n", "As we can see, even though the estimates are not perfect, it is very closer to the real ones than if we had used the pre-deployment accuracy. The seeds are not fixed in this example, so you can rerun to see different results. I find that the errors for the estimated accuracies are usually roughly one order of magnitude smaller than the errors for the pre-deployment accuracies. This is a sign that the accuracy estimation is working well." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Uploading the results to WhyLabs \n", "\n", "Let's see how to have the estimated accuracy available in your WhyLabs dashboard.\n", "\n", "First, let's set our WhyLabs environment variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "import os\n", "\n", "# set your org-id here - should be something like \"org-xxxx\"\n", "print(\"Enter your WhyLabs Org ID\") \n", "os.environ[\"WHYLABS_DEFAULT_ORG_ID\"] = input()\n", "\n", "# set your datased_id (or model_id) here - should be something like \"model-xxxx\"\n", "print(\"Enter your WhyLabs Dataset ID\")\n", "os.environ[\"WHYLABS_DEFAULT_DATASET_ID\"] = input()\n", "\n", "\n", "# set your API key here\n", "print(\"Enter your WhyLabs API key\")\n", "os.environ[\"WHYLABS_API_KEY\"] = getpass.getpass()\n", "print(\"Using API Key ID: \", os.environ[\"WHYLABS_API_KEY\"][0:10])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's iterate through the days again. The difference this time is we'll write both the result set and estimated result to WhyLabs.\n", "\n", "Just be sure to set the result set's timestamp before performing the estimation, so that the estimated result will have the same timestamp as the result set.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime, timedelta, timezone\n", "from whylogs.experimental.performance_estimation.estimators import AccuracyEstimator\n", "from whylogs.api.writer.whylabs import WhyLabsWriter\n", "\n", "writer = WhyLabsWriter()\n", "estimator = AccuracyEstimator(reference_result_set = reference_results)\n", "\n", "for day, perturbed in enumerate(zip(perturbed_results_list, perturbed_dfs)):\n", " dataset_timestamp = datetime.now() - timedelta(days=day)\n", " dataset_timestamp = dataset_timestamp.replace(tzinfo = timezone.utc)\n", " perturbed_results = perturbed[0]\n", "\n", " perturbed_results.set_dataset_timestamp(dataset_timestamp)\n", "\n", " #logging the data\n", " perturbed_results.writer(\"whylabs\").write()\n", " estimator.estimate(perturbed_results).writer(\"whylabs\").write()\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You will find the estimated accuracy in the `Outputs` tab of your WhyLabs dashboard. Since we're logging a single row for each day, the `estimated median` graph will show the estimated accuracy for each day:\n", "\n", "![alt text](images/accuracy_estimation.png)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> The PerformanceEstimator is still an experimental feature, and the WhyLabs support for it is still under development. It is likely that both whylogs and WhyLabs will evolve to better support the performace estimation use case." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion \n", "\n", "In this example, we showed how we can use whylogs to estimate the accuracy of a model when the distribution of the input data changes, and when we don't have available labels to calculate the real accuracy. We used the `AccuracyEstimator` from the `performance_estimation` module to estimate the accuracy of a model, and we compared the estimated accuracy to the real accuracy. However, even when we do have the labels, or a subset of them, it can still be useful to draw estimates to help further debug and analyse the root cause of eventual changes in performance.\n", "\n", "It's also important to note the approach's limitations and assumptions, with regards to the root cause of changes in the performance and the importance of properly choosing the segments to perform the importance weighting.\n", "\n", "The estimator is still in beta and we are always working on improving it. We'd love to hear your feedback!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "- [Shankar, Shreya, and Aditya Parameswaran. \"Towards Observability for Production Machine Learning Pipelines.\"](https://arxiv.org/pdf/2108.13557.pdf)\n", "- [Oreilly event - Monitor Real-Time Machine Learning Performance](https://learning.oreilly.com/live-events/monitor-real-time-machine-learning-performance/0636920075104/0636920075102/)\n", "- [whylogs - Ecommerce dataset](https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea" } } }, "nbformat": 4, "nbformat_minor": 2 }