{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Drift Analysis with Profile Visualizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/1.0.x/python/examples/basic/Notebook_Profile_Visualizer.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> This is a `whylogs v1` example. For the analog feature in `v0`, please refer to [this example](https://github.com/whylabs/whylogs/blob/mainline/examples/Notebook_Profile_Visualizer.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we'll show how you can use whylog's Notebook Profile Visualizer to compare two different sets of the same data. This includes:\n", "- __Data Drift__: Detecting feature drift between two datasets' distributions\n", "- __Data Visualization__: Comparing feature's histograms and bar charts\n", "- __Summary Statistics__: Visualizing Summary Statistics of individual features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Drift on Wine Quality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To demonstrate the Profile Visualizer, let's use [UCI's Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality), frequently used for learning purposes. Classification is one possible task, where we predict the wine's quality based on its features, like pH, density and percent alcohol content.\n", "\n", "In this example, we will split the available dataset in two groups: wines with alcohol content (`alcohol` feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. The goal here is to induce a case of __Sample Selection Bias__, where the training sample is not representative of the population.\n", "\n", "The example used here was inspired by the article [A Primer on Data Drift](https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6). If you're interested in more information on this use case, or the theory behind Data Drift, it's a great read!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the Profile Visualizer, we'll install whylogs with the extra package `viz`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q whylogs[viz] --pre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | fixed acidity | \n", "volatile acidity | \n", "citric acid | \n", "residual sugar | \n", "chlorides | \n", "free sulfur dioxide | \n", "total sulfur dioxide | \n", "density | \n", "pH | \n", "sulphates | \n", "alcohol | \n", "quality | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7.4 | \n", "0.70 | \n", "0.00 | \n", "1.9 | \n", "0.076 | \n", "11.0 | \n", "34.0 | \n", "0.9978 | \n", "3.51 | \n", "0.56 | \n", "9.4 | \n", "5 | \n", "
1 | \n", "7.8 | \n", "0.88 | \n", "0.00 | \n", "2.6 | \n", "0.098 | \n", "25.0 | \n", "67.0 | \n", "0.9968 | \n", "3.20 | \n", "0.68 | \n", "9.8 | \n", "5 | \n", "
2 | \n", "7.8 | \n", "0.76 | \n", "0.04 | \n", "2.3 | \n", "0.092 | \n", "15.0 | \n", "54.0 | \n", "0.9970 | \n", "3.26 | \n", "0.65 | \n", "9.8 | \n", "5 | \n", "
3 | \n", "11.2 | \n", "0.28 | \n", "0.56 | \n", "1.9 | \n", "0.075 | \n", "17.0 | \n", "60.0 | \n", "0.9980 | \n", "3.16 | \n", "0.58 | \n", "9.8 | \n", "6 | \n", "
4 | \n", "7.4 | \n", "0.70 | \n", "0.00 | \n", "1.9 | \n", "0.076 | \n", "11.0 | \n", "34.0 | \n", "0.9978 | \n", "3.51 | \n", "0.56 | \n", "9.4 | \n", "5 | \n", "