{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Schema Configuration for Tracking Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs-v1/blob/mainline/python/examples/basic/Schema_Configuration.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When logging data, whylogs outputs certain metrics according to the column type. While whylogs provide a default behaviour, you can configure it in order to only track metrics that are important to you.\n", "\n", "In this example, we'll see how you can configure the Schema for a dataset level to control which metrics you want to calculate.\n", "We'll see how to specify metrics:\n", "\n", "1. Per data type\n", "\n", "2. Per column name\n", "\n", "\n", "But first, let's talk briefly about whylogs' data types and basic metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## whylogs DataTypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:\n", "\n", "- Integral\n", "- Fractional\n", "- String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anything that doesn't end up matching the above types will have an `AnyType` type.\n", "\n", "If you want to check to which type a certain Python type is mapped to whylogs, you can use the StandardTypeMapper:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from whylogs.core.datatypes import StandardTypeMapper\n", "\n", "type_mapper = StandardTypeMapper()\n", "\n", "type_mapper(list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standard metrics available in whylogs are grouped in __namespaces__. They are:\n", "\n", "- __counts__: Counters, such as number of samples and null values\n", "- __types__: Inferred types, such as boolean, string or fractional\n", "- __ints__: Max and Min Values\n", "- __distribution__: min,max, median, quantile values\n", "- __cardinality__\n", "- __frequent_items__ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuring Metrics in the Dataset Schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's see how we can control which metrics are tracked according to the column's type or column name. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metrics per Type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's assume you're not interested in every metric listed above, and you have a performance-critical application, so you'd like to do as few calculations as possible.\n", "\n", "For example, you might only be interested in:\n", "\n", "- Counts/Types metrics for every data type\n", "- Distribution metrics for Fractional\n", "- Frequent Items for Integral\n", "\n", "Let's see how we can configure our Schema to track only the above metrics for the related types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a sample dataframe to illustrate:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "d = {\"col1\": [1, 2, 3], \"col2\": [3.0, 4.0, 5.0], \"col3\": [\"a\", \"b\", \"c\"], \"col4\": [3.0, 4.0, 5.0]}\n", "df = pd.DataFrame(data=d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "whylogs use `Resolvers` in order to define how a column name or data type gets mapped to different metrics.\n", "\n", "We will need to create a custom Resolver class in order to customize it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.resolvers import Resolver\n", "from whylogs.core.datatypes import DataType, Fractional, Integral\n", "from typing import Dict, List\n", "from whylogs.core.metrics import StandardMetric\n", "from whylogs.core.metrics.metrics import Metric\n", "\n", "class MyCustomResolver(Resolver):\n", " \"\"\"Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types.\"\"\"\n", "\n", " def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:\n", " metrics: List[StandardMetric] = [StandardMetric.counts, StandardMetric.types]\n", " if isinstance(why_type, Fractional):\n", " metrics.append(StandardMetric.distribution)\n", " if isinstance(why_type, Integral):\n", " metrics.append(StandardMetric.frequent_items)\n", "\n", "\n", " result: Dict[str, Metric] = {}\n", " for m in metrics:\n", " result[m.name] = m.zero(column_schema)\n", " return result\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case above, the `name` parameter is not being used, as the column name is not relevant to map the metrics, only the `why_type`.\n", "\n", "We basically initialize `metrics` with metrics of both `counts` and `types` namespaces regardless of the data type. Then, we check for the whylogs data type in order to add the desired metric namespace (`distribution` for __Fractional__ columns and `frequent_items` for __Integral__ columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Resolvers are passed to whylogs through a `Dataset Schema`, so we'll have to create a custom Schema as well.\n", "\n", "In this case, since we're only interested in the resolvers, we could create a custom schema as follows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from whylogs.core import DatasetSchema" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class MyCustomSchema(DatasetSchema):\n", " resolvers = MyCustomResolver()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can proceed with the normal process of logging a dataframe, remembering to pass our schema when making the `log` call:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectdistribution/meandistribution/stddevdistribution/ndistribution/maxdistribution/mindistribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90typefrequent_items/frequent_strings
column
col230030004.01.03.05.03.03.03.04.05.05.0SummaryType.COLUMNNaN
col430030004.01.03.05.03.03.03.04.05.05.0SummaryType.COLUMNNaN
col33000030NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSummaryType.COLUMNNaN
col13030000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSummaryType.COLUMN[FrequentItem(value='1.000000', est=1, upper=1...
\n", "
" ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "col2 3 0 0 3 \n", "col4 3 0 0 3 \n", "col3 3 0 0 0 \n", "col1 3 0 3 0 \n", "\n", " types/boolean types/string types/object distribution/mean \\\n", "column \n", "col2 0 0 0 4.0 \n", "col4 0 0 0 4.0 \n", "col3 0 3 0 NaN \n", "col1 0 0 0 NaN \n", "\n", " distribution/stddev distribution/n distribution/max \\\n", "column \n", "col2 1.0 3.0 5.0 \n", "col4 1.0 3.0 5.0 \n", "col3 NaN NaN NaN \n", "col1 NaN NaN NaN \n", "\n", " distribution/min distribution/q_10 distribution/q_25 \\\n", "column \n", "col2 3.0 3.0 3.0 \n", "col4 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col1 NaN NaN NaN \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "col2 4.0 5.0 5.0 \n", "col4 4.0 5.0 5.0 \n", "col3 NaN NaN NaN \n", "col1 NaN NaN NaN \n", "\n", " type frequent_items/frequent_strings \n", "column \n", "col2 SummaryType.COLUMN NaN \n", "col4 SummaryType.COLUMN NaN \n", "col3 SummaryType.COLUMN NaN \n", "col1 SummaryType.COLUMN [FrequentItem(value='1.000000', est=1, upper=1... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import whylogs as why\n", "\n", "result = why.log(df, schema=MyCustomSchema())\n", "prof = result.profile()\n", "prof_view = prof.view()\n", "pd.set_option(\"display.max_columns\", None)\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice we have `counts` and `types` metrics for every type, `distribution` metrics only for `col2` and `col4` (floats) and `frequent_items` only for `col1` (ints).\n", "\n", "That's precisely what we wanted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metrics per Column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, suppose we don't want to specify the tracked metrics per data type, and rather by each specific columns.\n", "\n", "For example, we might want to track:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Count metrics for `col1`\n", "- Distribution Metrics for `col2`\n", "- Cardinality for `col3`\n", "- Distribution Metrics + Cardinality for `col4`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The process is similar to the previous case. We only need to change the if clauses to check for the `name` instead of `why_type`, like this: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from whylogs.core.resolvers import Resolver\n", "from whylogs.core.datatypes import DataType, Fractional, Integral\n", "from typing import Dict, List\n", "from whylogs.core.metrics import StandardMetric\n", "from whylogs.core.metrics.metrics import Metric\n", "\n", "class MyCustomResolver(Resolver):\n", " \"\"\"Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types.\"\"\"\n", "\n", " def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:\n", " metrics = []\n", " if name=='col1':\n", " metrics.append(StandardMetric.counts)\n", " if name=='col2':\n", " metrics.append(StandardMetric.distribution)\n", " if name=='col3':\n", " metrics.append(StandardMetric.cardinality)\n", " if name=='col4':\n", " metrics.append(StandardMetric.distribution)\n", " metrics.append(StandardMetric.cardinality)\n", "\n", "\n", "\n", " result: Dict[str, Metric] = {}\n", " for m in metrics:\n", " result[m.name] = m.zero(column_schema)\n", " return result\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there's no common metrics for all columns, we can initialize `metrics` as an empty list, and then append the relevant metrics for each columns.\n", "\n", "Now, we create a custom schema, just like before:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "class MyCustomSchema(DatasetSchema):\n", " resolvers = MyCustomResolver()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypedistribution/meandistribution/stddevdistribution/ndistribution/maxdistribution/mindistribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90cardinality/estcardinality/upper_1cardinality/lower_1
column
col13.00.0SummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
col5NaNNaNSummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
col4NaNNaNSummaryType.COLUMN4.01.03.05.03.03.03.04.05.05.03.03.000153.0
col3NaNNaNSummaryType.COLUMNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN3.03.000153.0
col2NaNNaNSummaryType.COLUMN4.01.03.05.03.03.03.04.05.05.0NaNNaNNaN
\n", "
" ], "text/plain": [ " counts/n counts/null type distribution/mean \\\n", "column \n", "col1 3.0 0.0 SummaryType.COLUMN NaN \n", "col5 NaN NaN SummaryType.COLUMN NaN \n", "col4 NaN NaN SummaryType.COLUMN 4.0 \n", "col3 NaN NaN SummaryType.COLUMN NaN \n", "col2 NaN NaN SummaryType.COLUMN 4.0 \n", "\n", " distribution/stddev distribution/n distribution/max \\\n", "column \n", "col1 NaN NaN NaN \n", "col5 NaN NaN NaN \n", "col4 1.0 3.0 5.0 \n", "col3 NaN NaN NaN \n", "col2 1.0 3.0 5.0 \n", "\n", " distribution/min distribution/q_10 distribution/q_25 \\\n", "column \n", "col1 NaN NaN NaN \n", "col5 NaN NaN NaN \n", "col4 3.0 3.0 3.0 \n", "col3 NaN NaN NaN \n", "col2 3.0 3.0 3.0 \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "col1 NaN NaN NaN \n", "col5 NaN NaN NaN \n", "col4 4.0 5.0 5.0 \n", "col3 NaN NaN NaN \n", "col2 4.0 5.0 5.0 \n", "\n", " cardinality/est cardinality/upper_1 cardinality/lower_1 \n", "column \n", "col1 NaN NaN NaN \n", "col5 NaN NaN NaN \n", "col4 3.0 3.00015 3.0 \n", "col3 3.0 3.00015 3.0 \n", "col2 NaN NaN NaN " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import whylogs as why\n", "\n", "df['col5'] = 0\n", "result = why.log(df, schema=MyCustomSchema())\n", "prof = result.profile()\n", "prof_view = prof.view()\n", "pd.set_option(\"display.max_columns\", None)\n", "prof_view.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that existing columns that are not specified in your custom resolver won't have any metrics tracked. In the example above, we added a `col5` column, but since we didn't link any metrics to it, all of the metrics are `NaN`s." ] } ], "metadata": { "interpreter": { "hash": "f76ec28949fecf16b926a3fc5a03c1aa6468ee82fa5da4ce6fd607df021af5b5" }, "kernelspec": { "display_name": "Python 3.8.13 ('v1.x')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }