{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# whylogs UDFs\n", "\n", "WARNING: UDF support is an experimental feature that is still evolving. For example, there was an incompatible UDF signature change in the whylogs 1.2.5 release. We may drop support for metric UDFs as other types of UDFs become able to handle the metric UDF use cases. Feedback on how UDFs should evolve is welcome.\n", "\n", "Sometimes you want to use whylogs to track values computed from your data along with the original input data. whylogs accepts input as either a Python dictionary representing a single row of data or a Pandas dataframe containing multiple rows. Both of these provide easy interfaces to add the results of user defined functions (UDFs) to your input data. whylogs also provides a UDF mechanism for logging computed data. It offers two advantagves over the native UDF facilities: you can easily define and apply a suite of UDFs suitable for an application area (e.g., [langkit](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjrpa6X4aeAAxUTPEQIHTJxD7YQFnoECBUQAQ&url=https%3A%2F%2Fwhylabs.ai%2Fsafeguard-large-language-models&usg=AOvVaw202jdq6Y33iB6r0SKtmkyK&opi=89978449)), and you can easily customize which metrics whylogs tracks for each UDF output. Let's explore the whylogs UDF APIs.\n", "\n", "## Install whylogs" ], "metadata": { "id": "3zth_nQy00Dq" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YxjFQM3c0q6Z", "outputId": "dce84c9f-6dc2-4dc1-e480-f848cdc6f707" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Installing collected packages: whylogs-sketching, types-urllib3, types-requests, whylabs-client, whylogs\n", "Successfully installed types-requests-2.31.0.2 types-urllib3-1.26.25.14 whylabs-client-0.5.3 whylogs-1.2.7 whylogs-sketching-3.4.1.dev3\n" ] } ], "source": [ "%pip install whylogs" ] }, { "cell_type": "markdown", "source": [ "## Types of UDF\n", "\n", "whylogs supports four kinds of UDFs:\n", "\n", "* Dataset UDFs take one or more named columns as input and produce a new column as output.\n", "* Multioutput UDFs take one or more named columns as input and produce one or more new columns as output.\n", "* Type UDFs are applied to all columns of a specified type and produce a new column as output.\n", "* Metric UDFs can be applied to a column specified by name or type, and do not produce a column. Instead, their output is tracked by a whylogs `UdfMetric` instance attached to input column in the dataset profile.\n", "\n", "Dataset, multioutput, and type UDFs produce their output columns before whylogs profiles the dataset. Thus the full machinery of whylogs schema specification and segmentation apply to the output columns. The `UdfMetric` has its own submetric schema mechanism to control the statistics tracked for metric UDF output, but since metric UDFs do not create columns they cannot be used for segmentation.\n", "\n", "### Dataset UDFs\n", "\n", "The signature for dataset UDFs is\n", "\n", "`f(x: Union[Dict[str, List], pd.DataFrame]) -> Union[List, pd.Series]`\n", "\n", "The dataframe or dictionary only contains the columns the UDF is registered to access (see the section on registration below). `DataFrame` inputs may contain multiple rows. Dictionary inputs contain only a single row, but it is presented as a list containing one value. This allows UDFs to be written using the intersection of the `DataFrame` and dictionary/list APIs to handle both cases. Performance-critical UDFs can check the type of input to provide implementations optimized for the specific input type. The returned list or series should contain one value for each input row.\n", "\n", "### Multioutput UDFs\n", "\n", "The signature for multioutput UDFs is\n", "\n", "`f(Union[Dict[str, List], pd.DataFrame]) -> Union[Dict[str, List], pd.DataFrame]`\n", "\n", "These are very similar to dataset UDFs. Where dataset UDFs use the UDF's name as the name of their single output column, multioutput UDFs prepend the UDF's name to the names of the columns returned by the UDF.\n", "\n", "### Type UDFs\n", "\n", "The signature for type UDFs is\n", "\n", "`f(x: Union[List, pd.Series]) -> Union[List, pd.Series]`\n", "\n", "Since type UDFs take a single column as input, the input is presented as a single-element list representing a single row of data, or as a Pandas series representing a column. Note that the column created by a type UDF will have the input column's name prepended to it to avoid name collisions.\n", "\n", "\n", "### Metric UDFs\n", "\n", "The signature for metric UDFs is\n", "\n", "`f(x: Any) -> Any`\n", "\n", "Metric UDFs recieve a single value as input, and produce a single value as output. The UDF will be invoked for each element of the column the `UdfMetric` is attached to." ], "metadata": { "id": "hcaTK-2c-sqx" } }, { "cell_type": "markdown", "source": [ "## UDF Registration\n", "\n", "The easiest way to get whylogs to invoke your UDFs is to register the UDF functions with the appropriate decorator. There's a decorator for each type of UDF. Note that using the decorators requires you use the schema produced by `whylogs.experimental.core.udf_schema()`.\n", "\n", "### Dataset UDFs\n", "\n", "The `@register_dataset_udf` decorator declares dataset UDFs.\n", "```\n", "from whylogs.experimental.core.udf_schema import register_dataset_udf\n", "import pandas as pd\n", "\n", "@register_dataset_udf([\"mass\", \"volume\"])\n", "def density(data: Union[Dict[str, List], pd.DataFrame]) -> Union[List, pd.Series]:\n", " if isinstance(data, pd.DataFrame):\n", " return data[\"mass\"] / data[\"volume\"]\n", " else:\n", " return [mass / volume for mass, volume in zip(data[\"mass\"], data[\"volume\"])]\n", "```\n", "\n", "If you log a `DataFrame` (or single row via a dictionary) containing columns named `mass` and `volume`, a column named `density` will be added by applying the `density()` function before whylogs produces its profile. If either of the input columns is missing or the output column is already present, the UDF will not be invoked. Note that the code in the `else` branch works fine for `DataFrame` inputs as well, so the the `isinstance` check is just an optimization.\n", "\n", "The `@register_dataset_udf` decorator has several optional arguments to customize whylogs' behavior.\n", "```\n", "def register_dataset_udf(\n", " col_names: List[str],\n", " udf_name: Optional[str] = None,\n", " metrics: Optional[List[MetricSpec]] = None,\n", " namespace: Optional[str] = None,\n", " schema_name: str = \"\",\n", " anti_metrics: Optional[List[Metric]] = None,\n", ")\n", "```\n", "The `col_names` arguments lists the UDF's required input columns. The remaining arguments are optional:\n", "* `udf_name` specifies the name of the UDF's output column. It defaults to the name of the function.\n", "* `metrics` takes a list of `MetricSpec` instances (see [Schema Configuration](https://github.com/whylabs/whylogs/blob/mainline/python/examples/basic/Schema_Configuration.ipynb)) specifying the whylogs metrics to track for the column produced by the UDF. If this is omitted, the metrics are determined by the defualt schema or any metric specifications passed to `udf_schema()`.\n", "* `anti_metrics` is an optional list of whylogs `Metric` classes to prohibit from being attached to the UDFs output column.\n", "* `namespace`, if present, is prepended to the UDF name to help manage UDF name collisions.\n", "* `schema_name` helps manage collections of UDFs. A UDF can be registered in a specified schema. If omitted, it will be registered to the defualt schema. `udf_schema()` merges the UDFs registered in the requested schemas.\n", "\n", "### Multioutput UDFs\n", "\n", "The `@register_multioutput_udf` decorator declares multioutput UDFs.\n", "```\n", "from whylogs.experimental.core.udf_schema import register_multioutput_udf\n", "import pandas as pd\n", "\n", "@register_multioutput_udf([\"x\"])\n", "def powers(data: Union[Dict[str, List], pd.DataFrame]) -> Union[Dict[str, List], pd.DataFrame]:\n", " if isinstance(data, pd.DataFrame):\n", " result = pd.DataFrame()\n", " result[\"xx\"] = data[\"x\"] * data[\"x\"]\n", " result[\"xxx\"] = data[\"x\"] * data[\"x\"] * data[\"x\"]\n", " return result\n", " else:\n", " result = {\"xx\" : [data[\"x\"][0] * data[\"x\"][0]]}\n", " result[\"xxx\"] = [data[\"x\"][0] * data[\"x\"][0] * data[\"x\"][0]]\n", " return result\n", "```\n", "\n", "If you log a `DataFrame` (or single row via a dictionary) containing a column named `x`, columns named `powers.xx` and `powers.xxx` containing the squared and cubed input column will be added by applying the `powers()` function before whylogs produces its profile. If any of the input columns is missing, the UDF will not be invoked. While dataset UDFs do not execute if their output column already exists, multioutput UDFs always produce their output columns.\n", "\n", "### Type UDFs\n", "\n", "The `@register_type_udf` decorator declares type UDFs to be applied to columns of a specified type. Types can be specified as subclass of `whylogs.core.datatypes.DataType` or a plain Python type.\n", "```\n", "from whylogs.experimental.core.udf_schema import register_type_udf\n", "from whylogs.core.datatypes import Fractional\n", "import pandas as pd\n", "\n", "@register_type_udf(Fractional)\n", "def square(input: Union[List, pd.Series]) -> Union[List, pd.Series]:\n", " return [x * x for x in input]\n", "```\n", "The `square()` function will be applied to any floating point columns in a `DataFrame` or row logged. The output columns are named `square` prepended with the input column name. In this example, we use code that works for either `DataFrame` or single row (dictionary) input.\n", "\n", "The `@register_type_udf` decorator also has optional parameters to customize its behavior:\n", "```\n", "def register_type_udf(\n", " col_type: Type,\n", " udf_name: Optional[str] = None,\n", " namespace: Optional[str] = None,\n", " schema_name: str = \"\",\n", " type_mapper: Optional[TypeMapper] = None,\n", ")\n", "```\n", "* `col_type` is the column type the UDF should be applied to. It can be a subclass of `whylogs.core.datatype.DataType` or a Python type. Note that the argument must be a subclass of `DataType` or `Type`, not an instance.\n", "* `udf_name` specifies the suffix of the name of the UDF's output column. It defaults to the name of the function. The input column's name is the prefix.\n", "* `namespace`, if present, is prepended to the UDF name to help manage UDF name collisions.\n", "* `schema_name` helps manage collections of UDFs. A UDF can be registered in a specified schema. If omitted, it will be registered to the defualt schema. `udf_schema()` merges the UDFs registered in the requested schemas.\n", "* `type_mapper` is an instance of `whylogs.core.datatype.TypeMapper` responsible for mapping native Python data types to a subclass of `whylogs.core.datatype.DataType`.\n", "\n", "\n", "### Metric UDFs\n", "\n", "\n", "The `@register_metric_udf` decorator declares metric UDFs to be applied to columns specified by name or type. Types can be specified as subclass of `whylogs.core.datatypes.DataType` or a plain Python type.\n", "```\n", "from whylogs.experimental.core.metrics.udf_metric import register_metric_udf\n", "from whylogs.core.datatypes import String\n", "\n", "@register_metric_udf(col_type=String)\n", "def upper(input: Any) -> Any:\n", " return input.upper()\n", "```\n", "This will create a `UdfMetric` instance for all string columns. Note that there can only be one instance of a metric class for a column, so avoid specifying `UdfMetric` on string columns elswhere in your schema definition.\n", "\n", "The `UdfMetric` will have a submetric named `upper` that tracks metrics according to the default submetric schema for the `upper` UDF's return type, in this case also string.\n", "\n", "The `@register_metric_udf` decorator also has optional parameters to customize its behavior:\n", "```\n", "def register_metric_udf(\n", " col_name: Optional[str] = None,\n", " col_type: Optional[DataType] = None,\n", " submetric_name: Optional[str] = None,\n", " submetric_schema: Optional[SubmetricSchema] = None,\n", " type_mapper: Optional[TypeMapper] = None,\n", " namespace: Optional[str] = None,\n", " schema_name: str = \"\",\n", ")\n", "```\n", "You must specify exactly one of either `col_name` or `col_type`.\n", "`col_type` can be a subclass of `whylogs.core.datatype.DataType` or a Python type. Note that the argument must be a subclass of `DataType` or `Type`, not an instance.\n", "* `submetric_name` is the name of the submetric within the `UdfMetric`. It defautls to the name of the decorated function. Note that all lambdas are named \"lambda\" so omitting `submetric_name` on more than one lambda will result in name collisions. If you pass a namespace, it will be prepended to the UDF name.\n", "* `submetric_schema` allows you to specify and configure the metrics to be tracked for each metric UDF. This defualts to the `STANDARD_UDF_RESOLVER` metrics.\n", "* `type_mapper` is an instance of `whylogs.core.datatype.TypeMapper` responsible for mapping native Python data types to a subclass of `whylogs.core.datatype.DataType`.\n", "* `namespace`, if present, is prepended to the UDF name to help manage UDF name collisions.\n", "* `schema_name` helps manage collections of UDFs. A UDF can be registered in a specified schema. If omitted, it will be registered to the defualt schema. `udf_schema()` merges the UDFs registered in the requested schemas.\n", "\n", "`SubmetricSchema` is very similar to the `DeclarativeSchema` (see [Schema Configuration](https://github.com/whylabs/whylogs/blob/mainline/python/examples/basic/Schema_Configuration.ipynb)), but applies to just the submetrics within an instance of a `UdfMetric`. The defualt `STANDARD_UDF_RESOLVER` applies the same metrics as the `STANDARD_RESOLVER` for the dataset, except it does not include frequent items for string columns. You can customize the metrics tracked for your UDF outputs by specifying your own `submetric_schema`. Note that several `@register_metric_udf` decorators may apply to the same input column; you should make sure only one of the decorators is passed your submetric schema, or that they are all passed the same submetric schema.\n" ], "metadata": { "id": "GfHXYOhHRjPa" } }, { "cell_type": "markdown", "source": [ "## Examples\n", "\n", "### Logging\n", "\n", "Let's look at a full example using the UDFs defined above:" ], "metadata": { "id": "G3vOs4QNEGCI" } }, { "cell_type": "code", "source": [ "import whylogs as why\n", "from whylogs.core.datatypes import Fractional, String\n", "from whylogs.experimental.core.udf_schema import (\n", " register_dataset_udf,\n", " register_multioutput_udf,\n", " register_type_udf,\n", " udf_schema\n", ")\n", "from whylogs.experimental.core.metrics.udf_metric import register_metric_udf\n", "\n", "from typing import Any, Dict, List, Union\n", "import pandas as pd\n", "\n", "@register_dataset_udf([\"mass\", \"volume\"])\n", "def density(data: Union[Dict[str, List], pd.DataFrame]) -> Union[List, pd.Series]:\n", " if isinstance(data, pd.DataFrame):\n", " return data[\"mass\"] / data[\"volume\"]\n", " else:\n", " return [mass / volume for mass, volume in zip(data[\"mass\"], data[\"volume\"])]\n", "\n", "\n", "@register_multioutput_udfs([\"x\"])\n", "def powers(data: Union[Dict[str, List], pd.DataFrame]) -> Union[Dict[str, List], pd.DataFrame]:\n", " if isinstance(data, pd.DataFrame):\n", " result = pd.DataFrame()\n", " result[\"xx\"] = data[\"x\"] * data[\"x\"]\n", " result[\"xxx\"] = data[\"x\"] * data[\"x\"] * data[\"x\"]\n", " return result\n", " else:\n", " result = {\"xx\": [data[\"x\"][0] * data[\"x\"][0]]}\n", " result[\"xxx\"] = [data[\"x\"][0] * data[\"x\"][0] * data[\"x\"][0]]\n", " return result\n", "\n", "\n", "@register_type_udf(Fractional)\n", "def square(input: Union[List, pd.Series]) -> Union[List, pd.Series]:\n", " return [x * x for x in input]\n", "\n", "\n", "@register_metric_udf(col_type=String)\n", "def upper(input: Any) -> Any:\n", " return input.upper()\n", "\n", "\n", "df = pd.DataFrame({\n", " \"mass\": [1, 2, 3],\n", " \"volume\": [4, 5, 6],\n", " \"score\": [1.9, 4.2, 3.1],\n", " \"lower\": [\"a\", \"b\", \"c\"],\n", " \"x\": [1, 2, 3]\n", "})\n", "schema = udf_schema()\n", "result = why.log(df, schema=schema)\n", "result.view().to_pandas()\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 364 }, "id": "LVKWYEwkEXd5", "outputId": "470b8700-b22e-4912-d9e7-037922c1b694" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " cardinality/est cardinality/lower_1 cardinality/upper_1 \\\n", "column \n", "density 3.0 3.0 3.00015 \n", "lower 3.0 3.0 3.00015 \n", "mass 3.0 3.0 3.00015 \n", "score 3.0 3.0 3.00015 \n", "score.square 3.0 3.0 3.00015 \n", "volume 3.0 3.0 3.00015 \n", "\n", " counts/inf counts/n counts/nan counts/null distribution/max \\\n", "column \n", "density 0 3 0 0 0.50 \n", "lower 0 3 0 0 NaN \n", "mass 0 3 0 0 3.00 \n", "score 0 3 0 0 4.20 \n", "score.square 0 3 0 0 17.64 \n", "volume 0 3 0 0 6.00 \n", "\n", " distribution/mean distribution/median ... \\\n", "column ... \n", "density 0.383333 0.40 ... \n", "lower 0.000000 NaN ... \n", "mass 2.000000 2.00 ... \n", "score 3.066667 3.10 ... \n", "score.square 10.286667 9.61 ... \n", "volume 5.000000 5.00 ... \n", "\n", " udf/upper:distribution/stddev \\\n", "column \n", "density NaN \n", "lower 0.0 \n", "mass NaN \n", "score NaN \n", "score.square NaN \n", "volume NaN \n", "\n", " udf/upper:frequent_items/frequent_strings \\\n", "column \n", "density NaN \n", "lower [FrequentItem(value='A', est=1, upper=1, lower... \n", "mass NaN \n", "score NaN \n", "score.square NaN \n", "volume NaN \n", "\n", " udf/upper:types/boolean udf/upper:types/fractional \\\n", "column \n", "density NaN NaN \n", "lower 0.0 0.0 \n", "mass NaN NaN \n", "score NaN NaN \n", "score.square NaN NaN \n", "volume NaN NaN \n", "\n", " udf/upper:types/integral udf/upper:types/object \\\n", "column \n", "density NaN NaN \n", "lower 0.0 0.0 \n", "mass NaN NaN \n", "score NaN NaN \n", "score.square NaN NaN \n", "volume NaN NaN \n", "\n", " udf/upper:types/string udf/upper:types/tensor ints/max \\\n", "column \n", "density NaN NaN NaN \n", "lower 3.0 0.0 NaN \n", "mass NaN NaN 3.0 \n", "score NaN NaN NaN \n", "score.square NaN NaN NaN \n", "volume NaN NaN 6.0 \n", "\n", " ints/min \n", "column \n", "density NaN \n", "lower NaN \n", "mass 1.0 \n", "score NaN \n", "score.square NaN \n", "volume 4.0 \n", "\n", "[6 rows x 58 columns]" ], "text/html": [ "\n", "\n", "
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/inf | \n", "counts/n | \n", "counts/nan | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "... | \n", "udf/upper:distribution/stddev | \n", "udf/upper:frequent_items/frequent_strings | \n", "udf/upper:types/boolean | \n", "udf/upper:types/fractional | \n", "udf/upper:types/integral | \n", "udf/upper:types/object | \n", "udf/upper:types/string | \n", "udf/upper:types/tensor | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
density | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "0.50 | \n", "0.383333 | \n", "0.40 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
lower | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "... | \n", "0.0 | \n", "[FrequentItem(value='A', est=1, upper=1, lower... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "
mass | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "3.00 | \n", "2.000000 | \n", "2.00 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "3.0 | \n", "1.0 | \n", "
score | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "4.20 | \n", "3.066667 | \n", "3.10 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
score.square | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "17.64 | \n", "10.286667 | \n", "9.61 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
volume | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "6.00 | \n", "5.000000 | \n", "5.00 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "6.0 | \n", "4.0 | \n", "
6 rows × 58 columns
\n", "\n", " | mass | \n", "volume | \n", "score | \n", "lower | \n", "density | \n", "score.square | \n", "
---|---|---|---|---|---|---|
0 | \n", "1 | \n", "4 | \n", "1.9 | \n", "a | \n", "0.25 | \n", "3.61 | \n", "
1 | \n", "2 | \n", "5 | \n", "4.2 | \n", "b | \n", "0.40 | \n", "17.64 | \n", "
2 | \n", "3 | \n", "6 | \n", "3.1 | \n", "c | \n", "0.50 | \n", "9.61 | \n", "