{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Validation_Tutorial)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Validation_Tutorial) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "7V25M_P5sXXq"
},
"source": [
"# Data Validation at Scale - Detecting and Responding to Data Misbehavior\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/tutorials/Data_Validation_Tutorial.ipynb)\n",
"\n",
"In this tutorial, we'll introduce the concept of data logging and discuss how to validate data at scale by creating metric constraints and generating reports based on the data's statistical profiles using the whylogs open-source package.\n",
"\n",
"We will also walk through steps that data scientists and ML engineers can take to tailor their set of validations to fit the specific needs of their business or project, take actions when their rules fail to be met, and debug and troubleshoot cases where data fails to behave as expected.\n",
"\n",
"## Agenda\n",
"- Session 1: Introduction to Data Logging with whylogs\n",
"- Session 2: Data Validation with Metric Constraints\n",
"- Session 3: Per-value constraints with Condition Count Metrics\n",
"- Session 4: Auto-constraints generation\n",
"- Session 5: Debugging Failed Conditions\n",
"\n",
"## What is Data Validation?\n",
"\n",
"Data validation is the process of ensuring that data is accurate, complete, and consistent. It involves checking data for errors or inconsistencies, and ensuring that it meets the specified requirements or constraints. Data validation is important because it helps to ensure the integrity and quality of data, and helps to prevent errors or inaccuracies in data from propagating and causing problems downstream.\n",
"\n",
"In whylogs, you validate data by creating Metric Constraints and validating those constraints against a whylogs profile.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "V7ewBmRCgBV6"
},
"source": [
"# Session 1 - Data Logging"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "H07RwXCldzGh"
},
"source": [
"## Installing whylogs\n",
"\n",
"To install the whylogs library, you can use the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DaxL1mBIdyUZ",
"outputId": "cb97f181-4d42-4919-a6ae-62af52c461c8"
},
"outputs": [],
"source": [
"%pip install -q whylogs[viz]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "NtqFsE6Id5qp"
},
"source": [
"## Loading a Pandas DataFrame\n",
"\n",
"Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "hAK8NOtsd6t4"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"data = {\n",
" \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
" \"legs\": [4, 2, 0, 4],\n",
" \"weight\": [4.3, 1.8, 1.3, 4.1],\n",
"}\n",
"\n",
"df = pd.DataFrame(data)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "uDRUOXn5d_LZ"
},
"source": [
"## Profiling with whylogs\n",
"\n",
"To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 449
},
"id": "4lGjB710eDgJ",
"outputId": "93cd61ce-e4d4-417c-e317-df8e9ac39f24"
},
"outputs": [],
"source": [
"import whylogs as why\n",
"\n",
"profile = why.log(df).profile()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "XBIgI1VOeKoD"
},
"source": [
"## Analyzing Profiles\n",
"\n",
"Once you're done logging the data, you can generate a Profile View and inspect it in a Pandas Dataframe format:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 303
},
"id": "xCvJtUi6eFnZ",
"outputId": "a81842a9-f405-48da-f20c-60b6aa614e5f"
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/inf | \n", "counts/n | \n", "counts/nan | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "... | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "types/tensor | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
animal | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000 | \n", "NaN | \n", "... | \n", "[FrequentItem(value='cat', est=2, upper=2, low... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
legs | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "4.0 | \n", "2.500 | \n", "4.0 | \n", "... | \n", "[FrequentItem(value='4', est=2, upper=2, lower... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "4.0 | \n", "0.0 | \n", "
weight | \n", "4.0 | \n", "4.0 | \n", "4.00020 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "4.3 | \n", "2.875 | \n", "4.1 | \n", "... | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
3 rows × 31 columns
\n", "\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/inf | \n", "counts/n | \n", "counts/nan | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "... | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "types/tensor | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
animal | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000 | \n", "NaN | \n", "... | \n", "[FrequentItem(value='cat', est=4, upper=4, low... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "8 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
legs | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "4.0 | \n", "2.500 | \n", "4.0 | \n", "... | \n", "[FrequentItem(value='4', est=4, upper=4, lower... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "4.0 | \n", "0.0 | \n", "
weight | \n", "4.0 | \n", "4.0 | \n", "4.00020 | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "4.3 | \n", "2.875 | \n", "4.1 | \n", "... | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
3 rows × 31 columns
\n", "\n", " | name | \n", "description | \n", "listing_url | \n", "last_review | \n", "number_of_reviews_ltm | \n", "number_of_reviews_l30d | \n", "id | \n", "latitude | \n", "longitude | \n", "availability_365 | \n", "bedrooms | \n", "reviews_per_month | \n", "room_type | \n", "price | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17895 | \n", "COPACABANA PRINCESINHA DO MAR | \n", "na quadra da praia. pertinho da pedra do leme.... | \n", "https://www.airbnb.com/rooms/37994848 | \n", "2020-02-22 | \n", "2 | \n", "0 | \n", "37994848 | \n", "-22.96576 | \n", "-43.17784 | \n", "82 | \n", "1 | \n", "0.44 | \n", "Entire home/apt | \n", "220.0 | \n", "
5343 | \n", "*Double Room with A/C & TV – Riocentro | \n", "Welcome to Rio de Janeiro-Gated community loca... | \n", "https://www.airbnb.com/rooms/10123238 | \n", "2016-08-23 | \n", "0 | \n", "0 | \n", "10123238 | \n", "-22.95001 | \n", "-43.38205 | \n", "0 | \n", "1 | \n", "0.02 | \n", "Private room | \n", "119.0 | \n", "