{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Drift_Configuration)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Drift_Configuration) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Drift Algorithm Configuration"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Drift_Algorithm_Configuration.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In whylogs, you can calculate drift scores and generate a summary drift report between two profiles, as shown in the [Notebook Profile Visualizer example](-).\n",
"\n",
"In this example, we will show you how to apply drift calculation with the default algorithm selection, and also to customize the drift calculations in two ways: by choosing the algorithm of your choosing and by changing the algorithm's internal parameters and thresholds for drift detection. We will also show you how to calculate drifts in a standalone manner, without the need to generate a visualization with the summary report."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> Currently, whylogs supports the following drift algorithms: __Kolmogorov-Smirnov Test__, __ChiSquare Test__, and __Hellinger distance__ - Stay tuned for more algorithms to be added in the future!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing whylogs\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generating the Target and Reference Profiles"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will generate two profiles, one as the target and one as the reference.\n",
"\n",
"We will use those profiles in order to calculate drift scores for each column in both profiles."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import whylogs as why\n",
"import pandas as pd\n",
"\n",
"data = {\n",
" \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
" \"legs\": [4, 2, 0, 4],\n",
" \"weight\": [4.3, 1.8, None, 4.1],\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"data2 = {\n",
" \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
" \"legs\": [13, 34, 99, 123],\n",
" \"weight\": [4.9, 13.3, None, 232.3],\n",
"}\n",
"\n",
"df2 = pd.DataFrame(data2)\n",
"\n",
"\n",
"target_view = why.log(df).profile().view()\n",
"ref_view = why.log(df2).profile().view()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating Drift - Default Behavior\n",
"\n",
"You can calculate drift scores between your profiles in two ways. The first is to use `calculate_drift_scores`, which whill give you a dictionary of drift scores for each feature.\n",
"\n",
"The second is to view it integrated into the Notebook Profile Visualizer by calling `summary_drift_report`. This will give you a drift summary report in the format of an in-notebook visualization or a downloadable HTML file.\n",
"\n",
"Let's see both cases for the default behavior scenario - we won't specify any drift algorithms or parameters.\n",
"\n",
"To get a dictionary with the drift scores, you can use the `calculate_drift_scores` method:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'animal': {'algorithm': 'chi-square',\n",
" 'pvalue': 1.0,\n",
" 'statistic': 0.0,\n",
" 'thresholds': {'NO_DRIFT': (0.15, 1),\n",
" 'POSSIBLE_DRIFT': (0.05, 0.15),\n",
" 'DRIFT': (0, 0.05)},\n",
" 'drift_category': 'NO_DRIFT'},\n",
" 'legs': {'algorithm': 'ks',\n",
" 'pvalue': 0.0,\n",
" 'statistic': 1.0,\n",
" 'thresholds': {'NO_DRIFT': (0.15, 1),\n",
" 'POSSIBLE_DRIFT': (0.05, 0.15),\n",
" 'DRIFT': (0, 0.05)},\n",
" 'drift_category': 'DRIFT'},\n",
" 'weight': {'algorithm': 'ks',\n",
" 'pvalue': 0.0,\n",
" 'statistic': 1.0,\n",
" 'thresholds': {'NO_DRIFT': (0.15, 1),\n",
" 'POSSIBLE_DRIFT': (0.05, 0.15),\n",
" 'DRIFT': (0, 0.05)},\n",
" 'drift_category': 'DRIFT'}}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores\n",
"\n",
"scores = calculate_drift_scores(target_view=target_view, reference_view=ref_view, with_thresholds = True)\n",
"scores"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `scores` object is a dictionary with the drift scores for each feature with additional metadata.\n",
"\n",
"## Algorithm Selection\n",
"\n",
"We can see that the KS test was applied for both `weight` and `animal`, and `chi-squared` was applied for `animal`. The default behavior for choosing which drift algorithm to use is the following: KS is calculated if distribution metrics exists for said column. If not, Chi2 is calculated if frequent items, cardinality and count metric exists. If not, then no drift value is associated to the column.\n",
"\n",
"## Thresholds\n",
"\n",
"We can also see the thresholds defined by default for each algorithm. Each drift category contains a tuple defining a range: if the measure falls within the range, then the drift category is assigned to the column. For each range, the lower bound is inclusive, while the upper bound is exclusive, except for the maximum upper bound, which is inclusive.\n",
"\n",
"The drift categorization will use either the `pvalue` or `statistic` value, depending on the algorithm. Both KS and Chi Square tests compare the pvalue against the thresholds, while the Hellinger distance compares the `statistic` value against the thresholds."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also visualize this information integrated with the NotebookProfileVisualizer `summary_drift_report`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"