{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Metric_Constraints)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Metric_Constraints) to leverage the power of whylogs and WhyLabs together!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Validation with Metric Constraints" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Metric_Constraints.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> This is an example for whylogs versions 1.0.0 and above. If you're interested in constraints for versions <1.0.0, please see these examples: [Constraints Suite](https://github.com/whylabs/whylogs/blob/maintenance/0.7.x/examples/Constraints_Suite.ipynb), [Constraints-Distributional Measures](https://github.com/whylabs/whylogs/blob/maintenance/0.7.x/examples/Constraints_Distributional_Measures.ipynb), and [Creating Customized Constraints](https://github.com/whylabs/whylogs/blob/maintenance/0.7.x/examples/Creating_Customized_Constraints.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install 'whylogs[viz]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting with the basic pandas dataframe logging, consider the following input. We will generate whylogs profile view from this" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import whylogs as why\n", "\n", "data = {\n", " \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\", \"mosquito\"],\n", " \"legs\": [4, 2, 0, 4, 6],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1, 5.5e-6],\n", "}\n", "\n", "results = why.log(pd.DataFrame(data))\n", "profile_view = results.view()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The profile view can be display as a pandas dataframe where the columns are metric/component paths" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cardinality/estcardinality/lower_1cardinality/upper_1counts/infcounts/ncounts/nancounts/nulldistribution/maxdistribution/meandistribution/median...distribution/stddevfrequent_items/frequent_stringstypetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/stringints/maxints/min
column
animal4.04.04.000200500NaN0.000000NaN...0.000000[FrequentItem(value='cat', est=2, upper=2, low...SummaryType.COLUMN00005NaNNaN
legs4.04.04.0002005006.03.2000004.0...2.280351[FrequentItem(value='4', est=2, upper=2, lower...SummaryType.COLUMN005006.00.0
weight5.05.05.0002505004.32.3000011.8...1.856069NaNSummaryType.COLUMN05000NaNNaN
\n", "

3 rows × 30 columns

\n", "
" ], "text/plain": [ " cardinality/est cardinality/lower_1 cardinality/upper_1 counts/inf \\\n", "column \n", "animal 4.0 4.0 4.00020 0 \n", "legs 4.0 4.0 4.00020 0 \n", "weight 5.0 5.0 5.00025 0 \n", "\n", " counts/n counts/nan counts/null distribution/max \\\n", "column \n", "animal 5 0 0 NaN \n", "legs 5 0 0 6.0 \n", "weight 5 0 0 4.3 \n", "\n", " distribution/mean distribution/median ... distribution/stddev \\\n", "column ... \n", "animal 0.000000 NaN ... 0.000000 \n", "legs 3.200000 4.0 ... 2.280351 \n", "weight 2.300001 1.8 ... 1.856069 \n", "\n", " frequent_items/frequent_strings type \\\n", "column \n", "animal [FrequentItem(value='cat', est=2, upper=2, low... SummaryType.COLUMN \n", "legs [FrequentItem(value='4', est=2, upper=2, lower... SummaryType.COLUMN \n", "weight NaN SummaryType.COLUMN \n", "\n", " types/boolean types/fractional types/integral types/object \\\n", "column \n", "animal 0 0 0 0 \n", "legs 0 0 5 0 \n", "weight 0 5 0 0 \n", "\n", " types/string ints/max ints/min \n", "column \n", "animal 5 NaN NaN \n", "legs 0 6.0 0.0 \n", "weight 0 NaN NaN \n", "\n", "[3 rows x 30 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above output notice that we have a metrics on the number of legs these animals have in the \"legs\" column.\n", "Let's say we want to define some constraints on the number of \"legs\" we expect for animals." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "columns: dict_keys(['animal', 'legs', 'weight'])\n", "metric names: ['counts', 'types', 'distribution', 'ints', 'cardinality', 'frequent_items']\n", "here is selector at index 6: MetricsSelector(metric_name='types', column_name='legs', metrics_resolver=None) there are a total of 15\n" ] } ], "source": [ "from whylogs.core.constraints import Constraints, ConstraintsBuilder, MetricsSelector, MetricConstraint\n", "column_view = profile_view.get_column(\"legs\")\n", "\n", "# constraint session bound to profile_view\n", "builder = ConstraintsBuilder(profile_view)\n", "\n", "# A constraint builder lets you generate a set of contraints using the passed in profile_view's list of columns and metrics.\n", "# lets explore what kind of column profiles and metrics we have avalaible in the profile view\n", "\n", "# We can specify a metric by selecting a (column_name, metric_name)\n", "# lets look at the column names again:\n", "column_names = profile_view.get_columns().keys()\n", "print(f\"columns: {column_names}\")\n", "\n", "# And here are the metric names on the \"legs\" column\n", "metric_names = profile_view.get_column(\"legs\").get_metric_names()\n", "print(f\"metric names: {metric_names}\")\n", "\n", "# If you want to the full set of possibilities you can ask the builder for all MetricSelectors\n", "# which covers the unique combinations of (column_name, metric_name)\n", "selectors = builder.get_metric_selectors()\n", "i = 6\n", "print(f\"here is selector at index {i}: {selectors[i]} there are a total of {len(selectors)}\")\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mean': 3.2,\n", " 'stddev': 2.280350850198276,\n", " 'n': 5,\n", " 'max': 6.0,\n", " 'min': 0.0,\n", " 'q_01': 0.0,\n", " 'q_05': 0.0,\n", " 'q_10': 0.0,\n", " 'q_25': 2.0,\n", " 'median': 4.0,\n", " 'q_75': 4.0,\n", " 'q_90': 6.0,\n", " 'q_95': 6.0,\n", " 'q_99': 6.0}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lets say we're interested in defining a constraint on the number of \"legs\". From output above we see\n", "# that there are the following metrics on column \"legs\": [counts, types, distribution, ints, cardinality, frequent_items]\n", "# lets look at what the distribution metric contains:\n", "distribution_values = profile_view.get_column(\"legs\").get_metric(\"distribution\").to_summary_dict()\n", "distribution_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's come back to how to use the ConstraintsBuilder to add a couple constraints" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# the constraints builder add_constraint() takes in a MetricConstraint, which requires three things to define it:\n", "# 1. A metric selector, this is a way of selecting which metric and on which column you want to apply a constraint.\n", "# let's choose MetricsSelector(metric_name='distribution', column_name='legs', metrics_resolver=None)\n", "# 2. an expression on the selected metric, for distribution, we have numeric properties such as max, min, stddev \n", "# and others we can reference. For this we'll require animal legs < 12 (sorry centipedes)!\n", "# 3. a name for this constraint, let's go with \"legs < 12\"\n", "\n", "distribution_legs = MetricsSelector(metric_name='distribution', column_name='legs')\n", "\n", "# this lambda takes in a distribution metric, which has convenience properties on this metric for max/min,\n", "# but we could also call to_summary_dict() and use any of the keys we saw in 'distribution_values' above\n", "legs_under_12 = lambda x: x.max < 12\n", "\n", "constraint_name = \"legs < 12\"\n", "\n", "legs_constraint = MetricConstraint(\n", " name=constraint_name,\n", " condition=legs_under_12,\n", " metric_selector=distribution_legs)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Constraints valid: True\n", "Constraints report [constraint name, pass, fail, summary]: [ReportResult(name='legs < 12', passed=1, failed=0, summary=None), ReportResult(name='legs >= 0', passed=1, failed=0, summary=None)]\n" ] } ], "source": [ "# now that we have a legs_constraint defined we can add it to the builder:\n", "builder.add_constraint(legs_constraint)\n", "\n", "# we could add more constraints using this pattern to the builder, maybe we realize negative values are invalid\n", "not_negative = lambda x: x.min >= 0\n", "builder.add_constraint(MetricConstraint(\n", " name=\"legs >= 0\",\n", " condition=not_negative,\n", " metric_selector=distribution_legs\n", "))\n", "\n", "# ok lets build these constraints\n", "constraints: Constraints = builder.build()\n", "\n", "# A Constraints object contains a collection of contraints and can call validate to get a pass/fail\n", "# or report for display\n", "constraints_valid = constraints.validate()\n", "print(f\"Constraints valid: {constraints_valid}\")\n", "\n", "# And a simple report of the [constraint name, pass, fail] can be generated like this:\n", "constraints_report = constraints.generate_constraints_report()\n", "print(f\"Constraints report [constraint name, pass, fail, summary]: {constraints_report}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok lets add a few more! and rebuild the constraints" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "stddev_below_3 = lambda x: x.stddev < 3.0\n", "builder.add_constraint(MetricConstraint(\n", " name=\"legs stddev < 3.0\",\n", " condition=stddev_below_3,\n", " metric_selector=distribution_legs\n", "))\n", "\n", "distribution_weight = MetricsSelector(metric_name='distribution', column_name='weight')\n", "builder.add_constraint(MetricConstraint(\n", " name=\"weight >= 0\",\n", " condition=not_negative,\n", " metric_selector=distribution_weight\n", "))\n", "\n", "reasonable_constraints = builder.build()\n", "\n", "\n", "builder.add_constraint(MetricConstraint(\n", " name=\"animal count >= 1000\",\n", " condition=lambda x: x.n.value > 1000,\n", " metric_selector=MetricsSelector(metric_name='counts', column_name='animal')\n", "))\n", "\n", "reasonable_constraints_over_1000_rows = builder.build()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "# You can also pass the constraints to the NotebookProfileVisualizer and generate a report\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints, cell_height=300)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you hover on the `Passed/Fail` icons, you'll be able to check the summary of the metric that was used to build the constraints. In this case, `legs<12` passed because the `max` metric component is __6__, which is below the number __12__.\n", "\n", "Similarly, `legs >= 0` passed, because `min` is __0__, which is above or equal __0__." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# a slightly more interesting report\n", "visualization.constraints_report(reasonable_constraints, cell_height=400)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# a failing report (because we don't have enough animals!)\n", "visualization.constraints_report(reasonable_constraints_over_1000_rows, cell_height=400)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.10 ('.venv': poetry)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea" } } }, "nbformat": 4, "nbformat_minor": 2 }