{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Drift Analysis with Profile Visualizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/1.0.x/python/examples/basic/Notebook_Profile_Visualizer.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> This is a `whylogs v1` example. For the analog feature in `v0`, please refer to [this example](https://github.com/whylabs/whylogs/blob/mainline/examples/Notebook_Profile_Visualizer.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we'll show how you can use whylog's Notebook Profile Visualizer to compare two different sets of the same data. This includes:\n", "- __Data Drift__: Detecting feature drift between two datasets' distributions\n", "- __Data Visualization__: Comparing feature's histograms and bar charts\n", "- __Summary Statistics__: Visualizing Summary Statistics of individual features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Drift on Wine Quality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To demonstrate the Profile Visualizer, let's use [UCI's Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality), frequently used for learning purposes. Classification is one possible task, where we predict the wine's quality based on its features, like pH, density and percent alcohol content.\n", "\n", "In this example, we will split the available dataset in two groups: wines with alcohol content (`alcohol` feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. The goal here is to induce a case of __Sample Selection Bias__, where the training sample is not representative of the population.\n", "\n", "The example used here was inspired by the article [A Primer on Data Drift](https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6). If you're interested in more information on this use case, or the theory behind Data Drift, it's a great read!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the Profile Visualizer, we'll install whylogs with the extra package `viz`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q whylogs[viz] --pre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45
\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality \n", "0 9.4 5 \n", "1 9.8 5 \n", "2 9.8 5 \n", "3 9.8 6 \n", "4 9.4 5 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.options.mode.chained_assignment = None # Disabling false positive warning\n", "\n", "url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n", "wine = pd.read_csv(url,sep=\";\")\n", "wine.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1599 entries, 0 to 1598\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 fixed acidity 1599 non-null float64\n", " 1 volatile acidity 1599 non-null float64\n", " 2 citric acid 1599 non-null float64\n", " 3 residual sugar 1599 non-null float64\n", " 4 chlorides 1599 non-null float64\n", " 5 free sulfur dioxide 1599 non-null float64\n", " 6 total sulfur dioxide 1599 non-null float64\n", " 7 density 1599 non-null float64\n", " 8 pH 1599 non-null float64\n", " 9 sulphates 1599 non-null float64\n", " 10 alcohol 1599 non-null float64\n", " 11 quality 1599 non-null int64 \n", "dtypes: float64(11), int64(1)\n", "memory usage: 150.0 KB\n" ] } ], "source": [ "wine.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll split the wines in two groups. The ones with `alcohol` below 11 will form our reference sample, and the ones above 11 will form our target dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "cond_reference = (wine['alcohol']<=11)\n", "wine_reference = wine.loc[cond_reference]\n", "\n", "cond_target = (wine['alcohol']>11)\n", "wine_target = wine.loc[cond_target]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also add some missing values to `citric acid`, to see how this is reflected in the Profile Visualizer later on." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ixs = wine_target.iloc[100:110].index\n", "wine_target.loc[ixs,'citric acid'] = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `quality` feature is a numerical one, representing the wine's quality. Let's tranform it to a categorical feature, where each wine is classified as Good or Bad. Anything above 6.5 is a good a wine. Otherwise, it's bad." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "bins = (2, 6.5, 8)\n", "group_names = ['bad', 'good']\n", "\n", "wine_reference['quality'] = pd.cut(wine_reference['quality'], bins = bins, labels = group_names)\n", "wine_target['quality'] = pd.cut(wine_target['quality'], bins = bins, labels = group_names)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can profile our dataframes with `whylogs`.\n", "The `NotebookProfileVisualizer` accepts `profile_views` as arguments. Profile views are obtained from the profiles, and are used for visualization and merging purposes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "result = why.log(pandas=wine_target)\n", "prof_view = result.view()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "result_ref = why.log(pandas=wine_reference)\n", "prof_view_ref = result_ref.view()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's instantiate `NotebookProfileVisualizer` and set the reference and target profile views:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we're able to generate all sorts of plots and reports.\n", "Let's take a look at some of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drift Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `summary_drift_report`, we have overview statistics, such as number of observations and missing cells, as well as a comparison between both profile's features, with regards to each feature's distribution, and drift calculations for numerical or categorical features.\n", "\n", "The report also displays alerts related to each of the feature's drift severity.\n", "\n", "You can also search for a specific feature, or filter by inferred type." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.summary_drift_report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The drift values are calculated in different ways, depending on the inferred type. For fractional features, the drift detection uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic.For categorical features, the drift detections uses the top frequent items summary, unique count estimate and total count estimate for each feature to calculate the estimated Chi-Squared statistic.\n", "\n", "For `alcohol`, there's an alert of severe drift, with calculated p-value of 0.00. That makes sense, since both distributions are mutually exclusive.\n", "\n", "We can also conclude some thing just by visually inspecting the distributions. We can clearly see that the \"good-to-bad\" ratio changes significantly between both profiles. That in itself is a good indicator that the alcohol content might be correlated to the wine's quality\n", "\n", "The drift value is also relevant for a number of other features. For example, the `density` also is flagged with significant drift. Let's look at this feature in particular." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histograms and Bar charts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a general idea of both profiles, let's take a look at some of the individual features.\n", "\n", "First, let's use the `double_histogram` to check on the `density` feature." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.double_histogram(feature_name=\"density\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visually assess that there seems to be a drift between both distributions indeed. Maybe the alcohol content plays a significant role on the wine's density.\n", "\n", "As is the case with the alcohol content, our potential model would see density values in production never before seen in the training test. We can certainly expect performance degradation during production." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the `alcohol` feature. Obviously there's a clear separation between the distributions at the value of `11`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.double_histogram(feature_name=\"alcohol\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the histograms, we can also plot distribution charts for categorical variables, like the `quality` feature." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.distribution_chart(feature_name=\"quality\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the difference between distributions:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.difference_distribution_chart(feature_name=\"quality\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that there is 800 or so more \"bads\" in the Reference profile, and 50 or so more \"goods\" on the target profile." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `feature_statistics`, we have access to very useful statistics by passing the feature and profile name:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visualization.feature_statistics(feature_name=\"citric acid\", profile=\"target\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like we have 72 distinct values for `citric acid`, ranging from 0 to 0.79. We can also see the 10 missing values injected earlier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading the Visualization Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the previous visualizations can be downloaded in `HTML` format for further inspection. Just run:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.getcwd()\n", "visualization.write(\n", " rendered_html=visualization.feature_statistics(\n", " feature_name=\"density\", profile=\"target\"\n", " ),\n", " html_file_name=os.getcwd() + \"/example\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're downloading the constraints report here, but you can simply replace it for your preferred visualization. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009\n", "\n", "- https://www.kaggle.com/vishalyo990/prediction-of-quality-of-wine\n", "\n", "- https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6" ] } ], "metadata": { "interpreter": { "hash": "f76ec28949fecf16b926a3fc5a03c1aa6468ee82fa5da4ce6fd607df021af5b5" }, "kernelspec": { "display_name": "Python 3.8.13 ('v1.x')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }