{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Creating Metric Constraints on Condition Count Metrics" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "whylogs profiles contain summarized information about our data. This means that it's a __lossy__ process, and once we get the profiles, we don't have access anymore to the complete set of data.\n", "\n", "This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there are no textual information that matches a credit card number or email information. Or maybe you're interested in ensuring that there are no even numbers in a certain column. How do we do that if we don't have access to the complete data?\n", "\n", "The answer is that you need to define a __Condition Count Metric__ to be tracked __before__ logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you'll have that information to check against the constraints you'll create.\n", "\n", "In this example, you'll learn how to:\n", "- Define additional Condition Count Metrics\n", "- Define actions to be triggered whenever those conditions are met during the logging process.\n", "- Use the Condition Count Metrics to create constraints against said conditions\n", "\n", "If you want more information on Condition Count Metrics, you can see [this example](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Condition_Count_Metrics.ipynb) and also the documentation for [Data Validation](https://whylogs.readthedocs.io/en/stable/features/data_validation.html)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Installing whylogs\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Context\n", "\n", "Let's assume we have a DataFrame for which we wish to log standard metrics through whylogs' default logging process. But additionally, we want specific information on two columns:\n", "\n", "- `url`: Regex pattern validation: the values in this column should always start with `https:://www.mydomain.com/profile`\n", "- `subscription_date`: Date Format validation: the values in this column should be a string with a date format of `%Y-%m-%d`\n", "\n", "In addition, we consider these cases to be critical, so we wish to make certain actions whenever the condition fails. In this example we will:\n", "\n", "- Send an alert in Slack whenever `subscription_date` fails the condition\n", "- Send an alert in Slack and pull a symbolic Andon Cord whenever `url` is not from the domain we expect\n", "\n", "Let's first create a simple DataFrame to demonstrate:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "data = {\n", " \"name\": [\"Alice\", \"Bob\", \"Charles\"],\n", " \"age\": [31,0,25],\n", " \"url\": [\"https://www.mydomain.com/profile/123\", \"www.wrongdomain.com\", \"http://mydomain.com/unsecure\"],\n", " \"subscription_date\": [\"2021-12-28\",\"2019-29-11\",\"04/08/2021\"],\n", " }\n", "\n", "df = pd.DataFrame(data)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this case, both `url` and `subscription_date` has 2 values out of 3 that are not what we expect." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the Relations\n", "\n", "Let's first define the relations that will actually check whether the value passes our constraint. For the date format validation, we'll use the __datetime__ module in a user defined function. As for the Regex pattern matching, we will use whylogs' `Predicates` along with regular expressions, which allows us to build simple relations intuitively." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "from typing import Any\n", "from whylogs.core.relations import Predicate\n", "\n", "\n", "def date_format(x: Any) -> bool:\n", " date_format = '%Y-%m-%d'\n", " try:\n", " datetime.datetime.strptime(x, date_format)\n", " return True\n", " except ValueError:\n", " return False\n", "\n", "# matches accept a regex expression\n", "matches_domain_url = Predicate().matches(\"^https:\\/\\/www.mydomain.com\\/profile\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the Actions\n", "\n", "Next, we need to define the actions that will be triggered whenever the conditions fail.\n", "\n", "We will define two placeholder functions that, in a real scenario, would execute the defined actions." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from typing import Any\n", "\n", "def pull_andon_cord(validator_name, condition_name: str, value: Any):\n", " print(\"Validator: {}\\n Condition name {} failed for value {}\".format(validator_name, condition_name, value))\n", " print(\" Pulling andon cord....\")\n", " # Do something here to respond to the constraint violation\n", " return\n", "\n", "def send_slack_alert(validator_name, condition_name: str, value: Any):\n", " print(\"Validator: {}\\n Condition name {} failed for value {}\".format(validator_name, condition_name, value))\n", " print(\" Sending slack alert....\")\n", " # Do something here to respond to the constraint violation\n", " return" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Conditions = Relations + Actions\n", "\n", "Conditions are defined by the combination of a relation and a set of actions. Now that we have both relations and actions, we can create two sets of conditions - in this example, each set contain a single condition, but we could have multiple." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.metrics.condition_count_metric import Condition\n", "\n", "has_date_format = {\n", " \"Y-m-d format\": Condition(date_format, actions=[send_slack_alert]),\n", "}\n", "\n", "regex_conditions = {\"url_matches_domain\": Condition(matches_domain_url, actions=[pull_andon_cord,send_slack_alert])}\n", "\n", "ints_conditions = {\n", " \"integer_zeros\": Condition(Predicate().equals(0)),\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Passing the conditions to the Logger\n", "\n", "Now, we need to let the logger aware of our Conditions. This can be done by creating a custom schema object that will be passed to `why.log()`.\n", "\n", "To create the schema object, we will use the __Declarative Schema__, which is an auxiliary class that will enable us to create a schema in a simple way.\n", "\n", "In this case, we want our schema to start with the default behavior (standard metrics for the default datatypes). Then, we want to add two condition count metrics based on the conditions we defined earlier and the name of the column we want to bind those conditions to. We can do so by calling the schema's `add_condition_count_metric` method:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.resolvers import STANDARD_RESOLVER\n", "from whylogs.core.specialized_resolvers import ConditionCountMetricSpec\n", "from whylogs.core.schema import DeclarativeSchema\n", "\n", "schema = DeclarativeSchema(STANDARD_RESOLVER)\n", "\n", "schema.add_resolver_spec(column_name=\"subscription_date\", metrics=[ConditionCountMetricSpec(has_date_format)])\n", "schema.add_resolver_spec(column_name=\"url\", metrics=[ConditionCountMetricSpec(regex_conditions)])\n", "schema.add_resolver_spec(column_name=\"age\", metrics=[ConditionCountMetricSpec(ints_conditions)])\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's pass the schema to why.log() and start logging our data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Validator: condition_count\n", " Condition name url_matches_domain failed for value www.wrongdomain.com\n", " Pulling andon cord....\n", "Validator: condition_count\n", " Condition name url_matches_domain failed for value www.wrongdomain.com\n", " Sending slack alert....\n", "Validator: condition_count\n", " Condition name url_matches_domain failed for value http://mydomain.com/unsecure\n", " Pulling andon cord....\n", "Validator: condition_count\n", " Condition name url_matches_domain failed for value http://mydomain.com/unsecure\n", " Sending slack alert....\n", "Validator: condition_count\n", " Condition name Y-m-d format failed for value 2019-29-11\n", " Sending slack alert....\n", "Validator: condition_count\n", " Condition name Y-m-d format failed for value 04/08/2021\n", " Sending slack alert....\n" ] } ], "source": [ "import whylogs as why\n", "profile_view = why.log(df, schema=schema).profile().view()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can see that during the logging process, our actions were triggered whenever the condition failed. We can see the name of the failed condition and the specific value that triggered it.\n", "\n", "We see the actions were triggered, but we also expect the Condition Count Metrics to be generated. Let's see if this is the case:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "condition_count/integer_zeros | \n", "condition_count/total | \n", "counts/inf | \n", "counts/n | \n", "counts/nan | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "distribution/min | \n", "distribution/n | \n", "distribution/q_01 | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "distribution/stddev | \n", "frequent_items/frequent_strings | \n", "ints/max | \n", "ints/min | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "types/tensor | \n", "condition_count/Y-m-d format | \n", "condition_count/url_matches_domain | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
age | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "1.0 | \n", "3.0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "31.0 | \n", "18.666667 | \n", "25.0 | \n", "0.0 | \n", "3 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "31.0 | \n", "31.0 | \n", "31.0 | \n", "31.0 | \n", "16.441817 | \n", "[FrequentItem(value='25', est=1, upper=1, lowe... | \n", "31.0 | \n", "0.0 | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
name | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "NaN | \n", "NaN | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "[FrequentItem(value='Alice', est=1, upper=1, l... | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
subscription_date | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "NaN | \n", "3.0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "[FrequentItem(value='2019-29-11', est=1, upper... | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1.0 | \n", "NaN | \n", "
url | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "NaN | \n", "3.0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "[FrequentItem(value='www.wrongdomain.com', est... | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "NaN | \n", "1.0 | \n", "