{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "_klgn5JO0oqh" }, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Schema_Configuration)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Schema_Configuration) to leverage the power of whylogs and WhyLabs together!*" ] }, { "cell_type": "markdown", "metadata": { "id": "UjrYEE9H0oqj" }, "source": [ "# Schema Configuration for Tracking Metrics" ] }, { "cell_type": "markdown", "metadata": { "id": "Z6wkDgOL0oqk" }, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Schema_Configuration.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "LPUPlTjC0oqk" }, "source": [ "When logging data, whylogs outputs certain metrics according to the column type. While whylogs provide a default behaviour, you can configure it in order to only track metrics that are important to you.\n", "\n", "In this example, we'll see how you can configure the Schema for a dataset level to control which metrics you want to calculate.\n", "We'll see how to specify metrics:\n", "\n", "1. Per data type\n", "\n", "2. Per column name\n", "\n", "\n", "But first, let's talk briefly about whylogs' data types and basic metrics." ] }, { "cell_type": "markdown", "metadata": { "id": "_FnYGJyu0oql" }, "source": [ "## Installing whylogs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "4nmNldIc0oql", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d519ace0-fa87-402a-8c04-adfea326f868" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Installing collected packages: whylogs-sketching, types-urllib3, types-requests, whylabs-client, whylogs\n", "Successfully installed types-requests-2.31.0.2 types-urllib3-1.26.25.14 whylabs-client-0.5.4 whylogs-1.3.0 whylogs-sketching-3.4.1.dev3\n" ] } ], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs" ] }, { "cell_type": "markdown", "metadata": { "id": "CknnJlhl0oqm" }, "source": [ "## whylogs DataTypes" ] }, { "cell_type": "markdown", "metadata": { "id": "iScoxXG40oqm" }, "source": [ "whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:\n", "\n", "- Integral\n", "- Fractional\n", "- String" ] }, { "cell_type": "markdown", "metadata": { "id": "Dal4ykoM0oqn" }, "source": [ "Anything that doesn't end up matching the above types will have an `AnyType` type.\n", "\n", "To check which type a certain Python type is mapped to in whylogs, you can use the StandardTypeMapper:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bU1Ehh6O0oqo", "outputId": "ef4d4245-0870-4785-8210-b9d696179dfc" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 2 } ], "source": [ "from whylogs.core.datatypes import StandardTypeMapper\n", "\n", "type_mapper = StandardTypeMapper()\n", "\n", "type_mapper(list)" ] }, { "cell_type": "markdown", "metadata": { "id": "0G5JtXSw0oqp" }, "source": [ "## Basic Metrics" ] }, { "cell_type": "markdown", "metadata": { "id": "ULgE1qA_0oqp" }, "source": [ "The standard metrics available in whylogs are grouped in __namespaces__. They are:\n", "\n", "- __counts__: Counters, such as number of samples and null values\n", "- __types__: Inferred types, such as boolean, string or fractional\n", "- __ints__: Max and Min Values\n", "- __distribution__: min,max, median, quantile values\n", "- __cardinality__: Number of different values\n", "- __frequent_items__: Most common values\n", "- __unicode_range__: Count of characters used in string values\n", "- __condition_count__: Count how often values meet specified conditions" ] }, { "cell_type": "markdown", "metadata": { "id": "RTWtbmj60oqp" }, "source": [ "## Configuring Metrics in the Dataset Schema" ] }, { "cell_type": "markdown", "metadata": { "id": "WI1kMCg_0oqq" }, "source": [ "Now, let's see how we can control which metrics are tracked according to the column's type or column name." ] }, { "cell_type": "markdown", "metadata": { "id": "YL9oBPgX0oqq" }, "source": [ "### Metrics per Type" ] }, { "cell_type": "markdown", "metadata": { "id": "-dho9xYo0oqq" }, "source": [ "Let's assume you're not interested in every metric listed above, and you have a performance-critical application, so you'd like to do as few calculations as possible.\n", "\n", "For example, you might only be interested in:\n", "\n", "- Counts/Types metrics for every data type\n", "- Distribution metrics for Fractional\n", "- Frequent Items for Integral\n", "\n", "Let's see how we can configure our Schema to track only the above metrics for the related types." ] }, { "cell_type": "markdown", "metadata": { "id": "6zQWkC0m0oqr" }, "source": [ "Let's create a sample dataframe to illustrate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mlaGnadP0oqr" }, "outputs": [], "source": [ "# Install pandas if you don't have it already\n", "%pip install pandas\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "l_OiNyWx0oqr" }, "outputs": [], "source": [ "import pandas as pd\n", "d = {\"col1\": [1, 2, 3], \"col2\": [3.0, 4.0, 5.0], \"col3\": [\"a\", \"b\", \"c\"], \"col4\": [3.0, 4.0, 5.0]}\n", "df = pd.DataFrame(data=d)" ] }, { "cell_type": "markdown", "metadata": { "id": "CfmFqAaB0oqs" }, "source": [ "whylogs uses `Resolvers` in order to define how a column name or data type gets mapped to different metrics.\n", "\n", "We will create a custom Resolver class in order to customize it." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "wRHf46IA0oqs" }, "outputs": [], "source": [ "from whylogs.core.resolvers import Resolver\n", "from whylogs.core.datatypes import DataType, Fractional, Integral\n", "from typing import Dict, List\n", "from whylogs.core.metrics import StandardMetric\n", "from whylogs.core.metrics.metrics import Metric\n", "\n", "class MyCustomResolver(Resolver):\n", " \"\"\"Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types.\"\"\"\n", "\n", " def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:\n", " metrics: List[StandardMetric] = [StandardMetric.counts, StandardMetric.types]\n", " if isinstance(why_type, Fractional):\n", " metrics.append(StandardMetric.distribution)\n", " if isinstance(why_type, Integral):\n", " metrics.append(StandardMetric.frequent_items)\n", "\n", "\n", " result: Dict[str, Metric] = {}\n", " for m in metrics:\n", " result[m.name] = m.zero(column_schema.cfg)\n", " return result\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pfpijk4F0oqs" }, "source": [ "In the case above, the `name` parameter is not being used, as the column name is not relevant to map the metrics, only the `why_type`.\n", "\n", "We basically initialize `metrics` with metrics of both `counts` and `types` namespaces regardless of the data type. Then, we check for the whylogs data type in order to add the desired metric namespace (`distribution` for __Fractional__ columns and `frequent_items` for __Integral__ columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "O6WvUsIx0oqt" }, "source": [ "Now we can proceed with the normal process of logging a dataframe. Resolvers are passed to whylogs through a `Dataset Schema`, so we can pass a `DatasetSchema` object to log's `schema` parameter as follows:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "id": "X7ohGb5E0oqt", "outputId": "41a798e0-f82a-42df-d33e-9ab5e63057c5" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "WARNING:whylogs.api.whylabs.session.session_manager:No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " counts/inf counts/n counts/nan counts/null \\\n", "column \n", "col1 0 3 0 0 \n", "col2 0 3 0 0 \n", "col3 0 3 0 0 \n", "col4 0 3 0 0 \n", "\n", " frequent_items/frequent_strings type \\\n", "column \n", "col1 [FrequentItem(value='1', est=1, upper=1, lower... SummaryType.COLUMN \n", "col2 NaN SummaryType.COLUMN \n", "col3 NaN SummaryType.COLUMN \n", "col4 NaN SummaryType.COLUMN \n", "\n", " types/boolean types/fractional types/integral types/object \\\n", "column \n", "col1 0 0 3 0 \n", "col2 0 3 0 0 \n", "col3 0 0 0 0 \n", "col4 0 3 0 0 \n", "\n", " types/string types/tensor distribution/max distribution/mean \\\n", "column \n", "col1 0 0 NaN NaN \n", "col2 0 0 5.0 4.0 \n", "col3 3 0 NaN NaN \n", "col4 0 0 5.0 4.0 \n", "\n", " distribution/median distribution/min distribution/n \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 4.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col4 4.0 3.0 3.0 \n", "\n", " distribution/q_01 distribution/q_05 distribution/q_10 \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col4 3.0 3.0 3.0 \n", "\n", " distribution/q_25 distribution/q_75 distribution/q_90 \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 3.0 5.0 5.0 \n", "col3 NaN NaN NaN \n", "col4 3.0 5.0 5.0 \n", "\n", " distribution/q_95 distribution/q_99 distribution/stddev \n", "column \n", "col1 NaN NaN NaN \n", "col2 5.0 5.0 1.0 \n", "col3 NaN NaN NaN \n", "col4 5.0 5.0 1.0 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/infcounts/ncounts/nancounts/nullfrequent_items/frequent_stringstypetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/stringtypes/tensordistribution/maxdistribution/meandistribution/mediandistribution/mindistribution/ndistribution/q_01distribution/q_05distribution/q_10distribution/q_25distribution/q_75distribution/q_90distribution/q_95distribution/q_99distribution/stddev
column
col10300[FrequentItem(value='1', est=1, upper=1, lower...SummaryType.COLUMN003000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
col20300NaNSummaryType.COLUMN0300005.04.04.03.03.03.03.03.03.05.05.05.05.01.0
col30300NaNSummaryType.COLUMN000030NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
col40300NaNSummaryType.COLUMN0300005.04.04.03.03.03.03.03.03.05.05.05.05.01.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 6 } ], "source": [ "import whylogs as why\n", "from whylogs.core import DatasetSchema\n", "result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))\n", "prof = result.profile()\n", "prof_view = prof.view()\n", "pd.set_option(\"display.max_columns\", None)\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "id": "JIJu86Am0oqt" }, "source": [ "Notice we have `counts` and `types` metrics for every type, `distribution` metrics only for `col2` and `col4` (floats) and `frequent_items` only for `col1` (ints).\n", "\n", "That's precisely what we wanted." ] }, { "cell_type": "markdown", "metadata": { "id": "zvUgPw0M0oqu" }, "source": [ "### Metrics per Column" ] }, { "cell_type": "markdown", "metadata": { "id": "fst55b2Z0oqu" }, "source": [ "Now, suppose we don't want to specify the tracked metrics per data type, and rather by each specific columns.\n", "\n", "For example, we might want to track:" ] }, { "cell_type": "markdown", "metadata": { "id": "imjPxKFe0oqu" }, "source": [ "- Count metrics for `col1`\n", "- Distribution Metrics for `col2`\n", "- Cardinality for `col3`\n", "- Distribution Metrics + Cardinality for `col4`\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xyEU8lFR0oqu" }, "source": [ "The process is similar to the previous case. We only need to change the if clauses to check for the `name` instead of `why_type`, like this:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "hjUhBECv0oqv" }, "outputs": [], "source": [ "from whylogs.core.resolvers import Resolver\n", "from whylogs.core.datatypes import DataType, Fractional, Integral\n", "from typing import Dict, List\n", "from whylogs.core.metrics import StandardMetric\n", "from whylogs.core.metrics.metrics import Metric\n", "\n", "class MyCustomResolver(Resolver):\n", " \"\"\"Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types.\"\"\"\n", "\n", " def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:\n", " metrics = []\n", " if name=='col1':\n", " metrics.append(StandardMetric.counts)\n", " if name=='col2':\n", " metrics.append(StandardMetric.distribution)\n", " if name=='col3':\n", " metrics.append(StandardMetric.cardinality)\n", " if name=='col4':\n", " metrics.append(StandardMetric.distribution)\n", " metrics.append(StandardMetric.cardinality)\n", "\n", "\n", "\n", " result: Dict[str, Metric] = {}\n", " for m in metrics:\n", " result[m.name] = m.zero(column_schema.cfg)\n", " return result\n" ] }, { "cell_type": "markdown", "metadata": { "id": "LCyHU22O0oqv" }, "source": [ "Since there's no common metrics for all columns, we can initialize `metrics` as an empty list, and then append the relevant metrics for each columns.\n", "\n", "Now, we create a custom schema, just like before:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 255 }, "id": "jpTGcNNV0oqv", "outputId": "99a422ee-8dfe-4811-d137-91b6a18947a5" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " counts/inf counts/n counts/nan counts/null type \\\n", "column \n", "col1 0.0 3.0 0.0 0.0 SummaryType.COLUMN \n", "col2 NaN NaN NaN NaN SummaryType.COLUMN \n", "col3 NaN NaN NaN NaN SummaryType.COLUMN \n", "col4 NaN NaN NaN NaN SummaryType.COLUMN \n", "col5 NaN NaN NaN NaN SummaryType.COLUMN \n", "\n", " distribution/max distribution/mean distribution/median \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 5.0 4.0 4.0 \n", "col3 NaN NaN NaN \n", "col4 5.0 4.0 4.0 \n", "col5 NaN NaN NaN \n", "\n", " distribution/min distribution/n distribution/q_01 \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col4 3.0 3.0 3.0 \n", "col5 NaN NaN NaN \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col4 3.0 3.0 3.0 \n", "col5 NaN NaN NaN \n", "\n", " distribution/q_75 distribution/q_90 distribution/q_95 \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 5.0 5.0 5.0 \n", "col3 NaN NaN NaN \n", "col4 5.0 5.0 5.0 \n", "col5 NaN NaN NaN \n", "\n", " distribution/q_99 distribution/stddev cardinality/est \\\n", "column \n", "col1 NaN NaN NaN \n", "col2 5.0 1.0 NaN \n", "col3 NaN NaN 3.0 \n", "col4 5.0 1.0 3.0 \n", "col5 NaN NaN NaN \n", "\n", " cardinality/lower_1 cardinality/upper_1 \n", "column \n", "col1 NaN NaN \n", "col2 NaN NaN \n", "col3 3.0 3.00015 \n", "col4 3.0 3.00015 \n", "col5 NaN NaN " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/infcounts/ncounts/nancounts/nulltypedistribution/maxdistribution/meandistribution/mediandistribution/mindistribution/ndistribution/q_01distribution/q_05distribution/q_10distribution/q_25distribution/q_75distribution/q_90distribution/q_95distribution/q_99distribution/stddevcardinality/estcardinality/lower_1cardinality/upper_1
column
col10.03.00.00.0SummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
col2NaNNaNNaNNaNSummaryType.COLUMN5.04.04.03.03.03.03.03.03.05.05.05.05.01.0NaNNaNNaN
col3NaNNaNNaNNaNSummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN3.03.03.00015
col4NaNNaNNaNNaNSummaryType.COLUMN5.04.04.03.03.03.03.03.03.05.05.05.05.01.03.03.03.00015
col5NaNNaNNaNNaNSummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "import whylogs as why\n", "from whylogs.core import DatasetSchema\n", "df['col5'] = 0\n", "result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))\n", "prof = result.profile()\n", "prof_view = prof.view()\n", "pd.set_option(\"display.max_columns\", None)\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "id": "HzEKQywx0oqw" }, "source": [ "Note that existing columns that are not specified in your custom resolver won't have any metrics tracked. In the example above, we added a `col5` column, but since we didn't link any metrics to it, all of the metrics are `NaN`s.\n", "\n", "## Declarative Schema\n", "\n", "In the previous section, we created subclasses of `Resolver` and implemented its `resolve()` method using control flow. The `DeclarativeSchema` allows us to customize the metrics present in a column by simply listing the metrics we want by data type or column name without implementing a `Resolver` subclass.\n", "\n", "### Declarative Schema Specification\n", "\n", "A `ResolverSpec` specifies a list of metrics to use for columns that match it. We can match columns by name or by type. The column name takes precedence if both are given. Each `ResolverSpec` has a list of `MetricSpec` that specify the `Metric`s (and optionally custom configurations) to apply to matching metrics. For example:\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "id": "V2x8PhFUh1ep", "outputId": "65d91834-e5cd-443d-963b-a2be7a48428e" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " condition_count/above 42 condition_count/below 42 \\\n", "column \n", "col1 0.0 3.0 \n", "col2 NaN NaN \n", "col3 NaN NaN \n", "col4 NaN NaN \n", "\n", " condition_count/total distribution/max distribution/mean \\\n", "column \n", "col1 3.0 3.0 2.0 \n", "col2 NaN NaN NaN \n", "col3 3.0 NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " distribution/median distribution/min distribution/n \\\n", "column \n", "col1 2.0 1.0 3.0 \n", "col2 NaN NaN NaN \n", "col3 NaN NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " distribution/q_01 distribution/q_05 distribution/q_10 \\\n", "column \n", "col1 1.0 1.0 1.0 \n", "col2 NaN NaN NaN \n", "col3 NaN NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " distribution/q_25 distribution/q_75 distribution/q_90 \\\n", "column \n", "col1 1.0 3.0 3.0 \n", "col2 NaN NaN NaN \n", "col3 NaN NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " distribution/q_95 distribution/q_99 distribution/stddev \\\n", "column \n", "col1 3.0 3.0 1.0 \n", "col2 NaN NaN NaN \n", "col3 NaN NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " type condition_count/alpha condition_count/digit \\\n", "column \n", "col1 SummaryType.COLUMN NaN NaN \n", "col2 SummaryType.COLUMN NaN NaN \n", "col3 SummaryType.COLUMN 3.0 0.0 \n", "col4 SummaryType.COLUMN NaN NaN \n", "\n", " frequent_items/frequent_strings \n", "column \n", "col1 NaN \n", "col2 NaN \n", "col3 [FrequentItem(value='c', est=1, upper=1, lower... \n", "col4 NaN " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
condition_count/above 42condition_count/below 42condition_count/totaldistribution/maxdistribution/meandistribution/mediandistribution/mindistribution/ndistribution/q_01distribution/q_05distribution/q_10distribution/q_25distribution/q_75distribution/q_90distribution/q_95distribution/q_99distribution/stddevtypecondition_count/alphacondition_count/digitfrequent_items/frequent_strings
column
col10.03.03.03.02.02.01.03.01.01.01.01.03.03.03.03.01.0SummaryType.COLUMNNaNNaNNaN
col2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSummaryType.COLUMNNaNNaNNaN
col3NaNNaN3.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSummaryType.COLUMN3.00.0[FrequentItem(value='c', est=1, upper=1, lower...
col4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSummaryType.COLUMNNaNNaNNaN
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "from whylogs.core.metrics.condition_count_metric import (\n", " Condition,\n", " ConditionCountConfig,\n", " ConditionCountMetric,\n", ")\n", "from whylogs.core.relations import Predicate\n", "from whylogs.core.resolvers import COLUMN_METRICS, MetricSpec, ResolverSpec\n", "from whylogs.core.schema import DeclarativeSchema\n", "from whylogs.core.datatypes import AnyType, DataType, Fractional, Integral, String\n", "\n", "X = Predicate()\n", "\n", "\n", "schema = DeclarativeSchema(\n", " [\n", " ResolverSpec(\n", " column_name=\"col1\",\n", " metrics=[\n", " MetricSpec(StandardMetric.distribution.value),\n", " MetricSpec(\n", " ConditionCountMetric,\n", " ConditionCountConfig(\n", " conditions={\n", " \"below 42\": Condition(lambda x: x < 42),\n", " \"above 42\": Condition(lambda x: x > 42),\n", " }\n", " ),\n", " ),\n", " ],\n", " ),\n", " ResolverSpec(\n", " column_type=String,\n", " metrics=[\n", " MetricSpec(StandardMetric.frequent_items.value),\n", " MetricSpec(\n", " ConditionCountMetric,\n", " ConditionCountConfig(\n", " conditions={\n", " \"alpha\": Condition(X.matches(\"[a-zA-Z]+\")),\n", " \"digit\": Condition(X.matches(\"[0-9]+\")),\n", " }\n", " ),\n", " ),\n", " ],\n", " ),\n", " ]\n", ")\n", "\n", "d = {\"col1\": [1, 2, 3], \"col2\": [3.0, 4.0, 5.0], \"col3\": [\"a\", \"b\", \"c\"], \"col4\": [3.0, 4.0, 5.0]}\n", "df = pd.DataFrame(data=d)\n", "result = why.log(df, schema=schema)\n", "prof_view = result.profile().view()\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "id": "JiMlsvSIh2m7" }, "source": [ "We can now pass `schema` to `why.log()` to log data according to the schema. Note that we pass the `Metric` class to the the `MetricSpec` constructor, not an instance. In this example, `col1` will have a `ConditionCountMetric` that tracks how often the column entries are above or below 42. Any string column will track how many entries are alphabetic and how many are numeric.\n", "\n", "`whylogs.core.resolvers.COLUMN_METRICS` is a list of `MetricSpec`s for the metrics WhyLabs expects in each column. There are also some predefined `ResolverSpec` lists to cover common use cases. For example, `STANDARD_RESOLVER` specifies the same metrics as the `StandardResolver`:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "6A4qR1lFpbjR" }, "outputs": [], "source": [ "STANDARD_RESOLVER = [\n", " ResolverSpec(\n", " column_type=Integral,\n", " metrics=COLUMN_METRICS\n", " + [\n", " MetricSpec(StandardMetric.distribution.value),\n", " MetricSpec(StandardMetric.ints.value),\n", " MetricSpec(StandardMetric.cardinality.value),\n", " MetricSpec(StandardMetric.frequent_items.value),\n", " ],\n", " ),\n", " ResolverSpec(\n", " column_type=Fractional,\n", " metrics=COLUMN_METRICS\n", " + [\n", " MetricSpec(StandardMetric.distribution.value),\n", " MetricSpec(StandardMetric.cardinality.value),\n", " ],\n", " ),\n", " ResolverSpec(\n", " column_type=String,\n", " metrics=COLUMN_METRICS\n", " + [\n", " MetricSpec(StandardMetric.unicode_range.value),\n", " MetricSpec(StandardMetric.distribution.value),\n", " MetricSpec(StandardMetric.cardinality.value),\n", " MetricSpec(StandardMetric.frequent_items.value),\n", " ],\n", " ),\n", " ResolverSpec(column_type=AnyType, metrics=COLUMN_METRICS),\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "87x9SdHlP2DF" }, "source": [ "There are also declarations for\n", "* `LIMITED_TRACKING_RESOLVER` just tracks the metrics required by WhyLogs, plus the distribution metric for numeric columns.\n", "* `NO_FI_RESOLVER` is the same as `STANDARD_RESOLVER` but omits the frequent item metrics.\n", "* `HISTOGRAM_COUNTING_TRACKING_RESOLVER` tracks only the distribution metric for each column.\n", "\n", "These provide handy starting places if we just want to add one or two metrics to one of these standard schema using the `add_resolver()` method:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "Dlwjc70uQNi-", "outputId": "007ffa75-a29f-41f1-aed8-a6e32cab82a3" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "WARNING:whylogs.core.resolvers:Conflicting resolvers for distribution metric in column 'col1' of type int\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " cardinality/est cardinality/lower_1 cardinality/upper_1 \\\n", "column \n", "col1 3.0 3.0 3.00015 \n", "col2 3.0 3.0 3.00015 \n", "col3 3.0 3.0 3.00015 \n", "col4 3.0 3.0 3.00015 \n", "\n", " condition_count/above 42 condition_count/below 42 \\\n", "column \n", "col1 0.0 3.0 \n", "col2 NaN NaN \n", "col3 NaN NaN \n", "col4 NaN NaN \n", "\n", " condition_count/total counts/inf counts/n counts/nan counts/null \\\n", "column \n", "col1 3.0 0 3 0 0 \n", "col2 NaN 0 3 0 0 \n", "col3 NaN 0 3 0 0 \n", "col4 NaN 0 3 0 0 \n", "\n", " distribution/max distribution/mean distribution/median \\\n", "column \n", "col1 3.0 2.0 2.0 \n", "col2 5.0 4.0 4.0 \n", "col3 NaN 0.0 NaN \n", "col4 5.0 4.0 4.0 \n", "\n", " distribution/min distribution/n distribution/q_01 \\\n", "column \n", "col1 1.0 3 1.0 \n", "col2 3.0 3 3.0 \n", "col3 NaN 0 NaN \n", "col4 3.0 3 3.0 \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "col1 1.0 1.0 1.0 \n", "col2 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col4 3.0 3.0 3.0 \n", "\n", " distribution/q_75 distribution/q_90 distribution/q_95 \\\n", "column \n", "col1 3.0 3.0 3.0 \n", "col2 5.0 5.0 5.0 \n", "col3 NaN NaN NaN \n", "col4 5.0 5.0 5.0 \n", "\n", " distribution/q_99 distribution/stddev \\\n", "column \n", "col1 3.0 1.0 \n", "col2 5.0 1.0 \n", "col3 NaN 0.0 \n", "col4 5.0 1.0 \n", "\n", " frequent_items/frequent_strings ints/max ints/min \\\n", "column \n", "col1 [FrequentItem(value='1', est=1, upper=1, lower... 3.0 1.0 \n", "col2 NaN NaN NaN \n", "col3 [FrequentItem(value='c', est=1, upper=1, lower... NaN NaN \n", "col4 NaN NaN NaN \n", "\n", " type types/boolean types/fractional types/integral \\\n", "column \n", "col1 SummaryType.COLUMN 0 0 3 \n", "col2 SummaryType.COLUMN 0 3 0 \n", "col3 SummaryType.COLUMN 0 0 0 \n", "col4 SummaryType.COLUMN 0 3 0 \n", "\n", " types/object types/string types/tensor \n", "column \n", "col1 0 0 0 \n", "col2 0 0 0 \n", "col3 0 3 0 \n", "col4 0 0 0 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cardinality/estcardinality/lower_1cardinality/upper_1condition_count/above 42condition_count/below 42condition_count/totalcounts/infcounts/ncounts/nancounts/nulldistribution/maxdistribution/meandistribution/mediandistribution/mindistribution/ndistribution/q_01distribution/q_05distribution/q_10distribution/q_25distribution/q_75distribution/q_90distribution/q_95distribution/q_99distribution/stddevfrequent_items/frequent_stringsints/maxints/mintypetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/stringtypes/tensor
column
col13.03.03.000150.03.03.003003.02.02.01.031.01.01.01.03.03.03.03.01.0[FrequentItem(value='1', est=1, upper=1, lower...3.01.0SummaryType.COLUMN003000
col23.03.03.00015NaNNaNNaN03005.04.04.03.033.03.03.03.05.05.05.05.01.0NaNNaNNaNSummaryType.COLUMN030000
col33.03.03.00015NaNNaNNaN0300NaN0.0NaNNaN0NaNNaNNaNNaNNaNNaNNaNNaN0.0[FrequentItem(value='c', est=1, upper=1, lower...NaNNaNSummaryType.COLUMN000030
col43.03.03.00015NaNNaNNaN03005.04.04.03.033.03.03.03.05.05.05.05.01.0NaNNaNNaNSummaryType.COLUMN030000
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 11 } ], "source": [ "from whylogs.core.resolvers import STANDARD_RESOLVER\n", "\n", "schema = DeclarativeSchema(STANDARD_RESOLVER)\n", "extra_metric = ResolverSpec(\n", " column_name=\"col1\",\n", " metrics=[\n", " MetricSpec(StandardMetric.distribution.value),\n", " MetricSpec(\n", " ConditionCountMetric,\n", " ConditionCountConfig(\n", " conditions={\n", " \"below 42\": Condition(lambda x: x < 42),\n", " \"above 42\": Condition(lambda x: x > 42),\n", " }\n", " ),\n", " ),\n", " ],\n", ")\n", "schema.add_resolver(extra_metric)\n", "\n", "result = why.log(df, schema=schema)\n", "prof_view = result.profile().view()\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "id": "JGAkb97WVxSP" }, "source": [ "This example adds a condition count metric to `col1` in addition to the usual default metrics.\n" ] }, { "cell_type": "markdown", "source": [ "### Default Resolver\n", "\n", "If you instantiate a `DeclarativeResolver` without passing it a list of `ResolverSpec`s, it will use the value of the variable `whylogs.core.resovlers.DEFAULT_RESOLVER`. Initially this has the value of `STANDARD_RESOLVER` which matches whylog's default behavior. You can set the value to one of the other pre-defined resolver lists or your own custom resolver list to customize the default resolving behavior.\n", "\n", "Similarly, there is a `whylogs.experimental.core.metrics.udf_metric.DEFAULT_UDF_RESOLVER` variable that specifies the default resolvers for the submetrics in a `UdfMetric`.\n", "\n", "## Excluding Metrics\n", "\n", "The `ResolverSpec` has an `exclude` field. If this is set to true, the metrics listed in the `ResolverSpec` are excluded from columns that match it. This can be handy for preventing sensitive information from \"leaking\" via a frequent items metric:" ], "metadata": { "id": "qXzLhIvtt0vF" } }, { "cell_type": "code", "source": [ "from whylogs.core.resolvers import DEFAULT_RESOLVER\n", "\n", "data = pd.DataFrame({\"Sensitive\": [\"private\", \"secret\"], \"Boring\": [\"normal\", \"stuff\"]})\n", "schema = DeclarativeSchema(\n", " DEFAULT_RESOLVER + [ResolverSpec(\n", " column_name = \"Sensitive\",\n", " metrics = [MetricSpec(StandardMetric.frequent_items.value)],\n", " exclude = True\n", " )]\n", ")\n", "result = why.log(data, schema=schema)\n", "result.profile().view().to_pandas()[\"frequent_items/frequent_strings\"]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "laYXvD3GKQ1-", "outputId": "10635d62-c4d1-4513-d9f3-908ce022d680" }, "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "column\n", "Boring [FrequentItem(value='normal', est=1, upper=1, ...\n", "Sensitive NaN\n", "Name: frequent_items/frequent_strings, dtype: object" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "source": [ "The frequent items metrics has been excluded from the `Sensitive` column without affecting the `DEFAULT_RESOLVER`'s treatment of other columns." ], "metadata": { "id": "RV-C3NkKNiRk" } } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3.8.10 ('.venv': poetry)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "8430e7bcc333486e417258c6fadac662061ebd166d9f3c5ccb12c1968aa41625" } } }, "nbformat": 4, "nbformat_minor": 0 }