{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Constraints_Suite)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Constraints_Suite) to leverage the power of whylogs and WhyLabs together!*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Simple Constraints - Examples and Usage" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Constraints_Suite.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> This is a `whylogs v1` example. For the analog feature in `v0`, please refer to [this example](https://github.com/whylabs/whylogs/blob/maintenance/0.7.x/examples/Constraints_Suite.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we'll show how to define a number of simple constraints and examples on how to use them. For the basics on how to build your own set of constraints, see the example - [Data Validation with Metric Constraints](https://whylogs.readthedocs.io/en/stable/examples/advanced/Metric_Constraints.html).\n", "\n", "The constraints are listed according to the metric namespace used when defining them. For each category, we will create helper functions for simple and popular constraints. Each helper function has a brief explanation in its docstring. After defining the helper functions, we'll show a simple example on how to build the constraints out of the functions and visualize them as a report with the visualization module." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> Note: The constraints shown here are still experimental and subject to further changes. Stay tuned for upgrades!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Completeness Constraints\n", "\n", "| constraint | parameters | semantic | metric |\n", "|------------------------------|---------------------|-------------------------------------------------------|--------|\n", "| no_missing_values | column name | Checks that are no missing values in the column | Counts |\n", "| null_values_below_number | column name, number | Number of null values must be below given number. | Counts |\n", "| null_percentage_below_number | column name, number | Percentage of null values must be below given number. | Counts |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Consistency Constraints\n", "\n", "| constraint | parameters | semantic | metric |\n", "|-----------------------------------|----------------------------|----------------------------------------------------------------------------------------|----------------|\n", "| greater_than_number | column name | Minimum value of given column must be above defined number. | Distribution |\n", "| smaller_than_number | column name, number | Maximum value of given column must be below defined number. | Distribution |\n", "| is_in_range | column name, lower, upper | Checks that all of column's values are in defined range (inclusive). | Distribution |\n", "| is_non_negative | column name | Checks if a column is non negative. | Distribution |\n", "| n_most_common_items_in_set | column name, reference set | Checks if the top n most common items appear in the dataset. | Frequent Items |\n", "| frequent_strings_in_reference_set | column name, reference set | Checks if a set of variables appear in the frequent strings for a string column. | Frequent Items |\n", "| count_below_number | column name, number | Checks if elements in a column are below given number. | Counts |\n", "| distinct_number_in_range | column name, lower, upper | Checks if number of distinct categories is between lower and upper values (inclusive). | Cardinality |\n", "| column_is_nullable_integral | column name | Check if column contains only records of specific datatype. | Types |\n", "| column_is_nullable_boolean | column name | Check if column contains only records of specific datatype. | Types |\n", "| column_is_nullable_fractional | column name | Check if column contains only records of specific datatype. | Types |\n", "| column_is_nullable_object | column name | Check if column contains only records of specific datatype. | Types |\n", "| column_is_nullable_string | column name | Check if column contains only records of specific datatype. | Types |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Condition Count Constraints\n", "\n", "Please refer to the example [Metric Constraints with Condition Count Metrics](https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Metric_Constraints_with_Condition_Count_Metrics.ipynb) for examples on how to use these constraints.\n", "\n", "| constraint | parameters | semantic | metric |\n", "|------------------------|-------------------------------------|--------------------------------------------------------------------------------------|--------------|\n", "| condition_meets | column name, condition_name | Fails if condition not met at least once. | Condition Count |\n", "| condition_never_meets | column name, condition_name | Fails if condition is met at least once | Condition Count |\n", "| condition_count_below | column name, condition_name, max_count | Fails if condition is met more than max count | Condition Count |\n", "\n", "## Statistics Constraints\n", "\n", "| constraint | parameters | semantic | metric |\n", "|------------------------|-------------------------------------|--------------------------------------------------------------------------------------|--------------|\n", "| mean_between_range | column name, lower, upper | Mean must be between range defined by lower and upper bounds. | Distribution |\n", "| stddev_between_range | column name, lower, upper | Standard deviarion must be between range defined by lower and upper bounds. | Distribution |\n", "| quantile_between_range | column name, quantile, lower, upper | Q-th quantile value must be withing the range defined by lower and upper boundaries. | Distribution |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "\n", "- [Installing and Importing Modules](#pre)\n", "- [Distribution Metrics Constraints](#distribution)\n", "- [Frequent Items/Frequent Strings Metrics Constraints](#frequent)\n", "- [Counters Constraints](#counts)\n", "- [Cardinality Constraints](#card)\n", "- [Types Constraints](#types)\n", "- [Combined Metrics Constraints](#comb)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Installing whylogs and importing modules \n", "\n", "If you haven't already, install whylogs:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install 'whylogs[viz]'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then, let's import the helper functions needed to define the constraints:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.constraints import ConstraintsBuilder\n", "from whylogs.core.constraints.factories import (\n", " greater_than_number,\n", " is_in_range,\n", " is_non_negative,\n", " mean_between_range,\n", " smaller_than_number,\n", " stddev_between_range,\n", " quantile_between_range\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Distribution Metrics Constraints" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "data = {\n", " \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, 4, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", "}\n", "\n", "results = why.log(pd.DataFrame(data))\n", "profile_view = results.view()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(greater_than_number(column_name=\"weight\", number=0.14))\n", "builder.add_constraint(mean_between_range(column_name=\"weight\", lower=2, upper=3))\n", "builder.add_constraint(smaller_than_number(column_name=\"weight\", number=20.5))\n", "builder.add_constraint(stddev_between_range(column_name=\"weight\", lower=1, upper=3))\n", "builder.add_constraint(quantile_between_range(column_name=\"weight\", quantile=0.5, lower=1.5, upper=2.0))\n", "builder.add_constraint(is_in_range(column_name=\"weight\", lower=1.1, upper=3.2))\n", "builder.add_constraint(is_in_range(column_name=\"legs\", lower=0, upper=6))\n", "builder.add_constraint(is_non_negative(column_name=\"legs\"))\n", "\n", "# animal has missing distribution metrics. this will pass if skip_missing = True and fail otherwise.\n", "builder.add_constraint(\n", " quantile_between_range(\n", " column_name=\"animal\", \n", " quantile=0.5, \n", " lower=1.5, \n", " upper=2.0, \n", " skip_missing=False\n", " )\n", ")\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Frequent Items/Frequent Strings Constraints " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.constraints.factories import n_most_common_items_in_set, frequent_strings_in_reference_set" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Frequent Items/Frequent Strings Constraints" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "data = {\n", " \"animal\": [\"cat\", \"snake\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [0, 1, 2, 3, 4],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", "}\n", "\n", "results = why.log(pd.DataFrame(data))\n", "profile_view = results.view()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "reference_set = {\"cat\",\"snake\"}\n", "builder.add_constraint(frequent_strings_in_reference_set(column_name=\"animal\", reference_set=reference_set))\n", "builder.add_constraint(n_most_common_items_in_set(column_name=\"animal\",n=2,reference_set=reference_set))\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Counters Constraints " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.constraints.factories import no_missing_values, count_below_number, null_percentage_below_number, null_values_below_number" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Counters Constraints" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "data = {\n", " \"animal\": [\"cat\", \"snake\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, None, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", "}\n", "\n", "results = why.log(pd.DataFrame(data))\n", "profile_view = results.view()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(count_below_number(column_name=\"legs\", number=10))\n", "builder.add_constraint(null_percentage_below_number(column_name=\"legs\", number=0.05))\n", "builder.add_constraint(null_values_below_number(column_name=\"legs\", number=1))\n", "builder.add_constraint(no_missing_values(column_name=\"legs\"))\n", "builder.add_constraint(no_missing_values(column_name=\"animal\"))\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Cardinality Constraints " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.constraints.factories import distinct_number_in_range" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Cardinality Constraints" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "data = {\n", " \"animal\": [\"cat\", \"snake\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, None, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", "}\n", "\n", "results = why.log(pd.DataFrame(data))\n", "profile_view = results.view()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(distinct_number_in_range(column_name = \"animal\", lower = 3, upper = 6))\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Types Metrics " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Types Metrics" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "\n", "data = {\n", " \"animal\": [\"cat\", \"snake\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, None, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", " \"flies\": [False, False, \"False\", False, True],\n", " \"obj\": [{\"a\":1}, None, {\"a\":1}, {\"a\":1}, {\"a\":1}]\n", "}\n", "df = pd.DataFrame(data)\n", "results = why.log(df)\n", "profile_view = results.view()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Check Nullable Types" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "from whylogs.core.constraints.factories import ( \n", " column_is_nullable_integral,\n", " column_is_nullable_boolean, \n", " column_is_nullable_fractional,\n", " column_is_nullable_object,\n", " column_is_nullable_string,\n", ")\n", "from whylogs.core.constraints import ConstraintsBuilder\n", "\n", "\n", "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(column_is_nullable_string(column_name=\"animal\"))\n", "builder.add_constraint(column_is_nullable_integral(column_name=\"legs\"))\n", "builder.add_constraint(column_is_nullable_fractional(column_name=\"weight\"))\n", "builder.add_constraint(column_is_nullable_boolean(column_name=\"flies\"))\n", "builder.add_constraint(column_is_nullable_object(column_name=\"obj\"))\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The constraints above will pass if all values are of a given type. Null values are accepted.\n", "\n", "Note that for `legs`, the constraints failed. That is because whylogs leverages __pandas' dtypes__ when it is available, and when a `None` is present, the column is considered to be `fractional`, even though the remaining values were originally integers. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Combined Constraints " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Examples - Combined Metrics" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To create a constraint that checks for a non-nullable type, we combine two separate constraints:\n", "\n", "- `column is nullable datatype`\n", "- `null values below 1`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "import pandas as pd\n", "\n", "data = {\n", " \"animal\": [\"cat\", \"snake\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, None, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", " \"flies\": [False, False, \"False\", False, True],\n", " \"obj\": [{\"a\":1}, None, {\"a\":1}, {\"a\":1}, {\"a\":1}]\n", "}\n", "df = pd.DataFrame(data)\n", "results = why.log(df)\n", "profile_view = results.view()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Check Non-nullable Types" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from whylogs.core.constraints.factories import ( \n", " column_is_nullable_integral,\n", " column_is_nullable_boolean, \n", " column_is_nullable_fractional,\n", " column_is_nullable_object,\n", " column_is_nullable_string,\n", " null_values_below_number,\n", ")\n", "from whylogs.core.constraints import ConstraintsBuilder\n", "\n", "\n", "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(column_is_nullable_string(column_name=\"animal\"))\n", "builder.add_constraint(null_values_below_number(column_name=\"animal\",number=1))\n", "\n", "# The combination of these metrics makes a check of non-nullable integral\n", "builder.add_constraint(column_is_nullable_integral(column_name=\"legs\"))\n", "builder.add_constraint(null_values_below_number(column_name=\"legs\",number=1))\n", "\n", "# The combination of these metrics makes a check of non-nullable fractional\n", "builder.add_constraint(column_is_nullable_fractional(column_name=\"weight\"))\n", "builder.add_constraint(null_values_below_number(column_name=\"weight\",number=1))\n", "\n", "# The combination of these metrics makes a check of non-nullable boolean\n", "builder.add_constraint(column_is_nullable_boolean(column_name=\"flies\"))\n", "builder.add_constraint(null_values_below_number(column_name=\"flies\",number=1))\n", "\n", "# The combination of these metrics makes a check of non-nullable object\n", "builder.add_constraint(column_is_nullable_object(column_name=\"obj\"))\n", "builder.add_constraint(null_values_below_number(column_name=\"obj\",number=1))\n", "\n", "constraints = builder.build()\n", "\n", "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.10 ('.venv': poetry)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea" } } }, "nbformat": 4, "nbformat_minor": 2 }