{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "xUsdUYUbNrpl" }, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking) to leverage the power of whylogs and WhyLabs together!*" ] }, { "cell_type": "markdown", "metadata": { "id": "Y1-M8tfxNrpn" }, "source": [ "# String Tracking - Unicode Range and String Length" ] }, { "cell_type": "markdown", "metadata": { "id": "a6cXXyVONrpo" }, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/String_Tracking.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "WlasZRJiNrpo" }, "source": [ "By default, columns of type `str` will have the following metrics, when logged with whylogs:\n", "- Counts\n", "- Types\n", "- Frequent Items/Frequent Strings\n", "- Cardinality" ] }, { "cell_type": "markdown", "metadata": { "id": "ZmIMG8GKNrpp" }, "source": [ "In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as `mean`, `stddev` and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length." ] }, { "cell_type": "markdown", "metadata": { "id": "LYbbSwfMNrpp" }, "source": [ "In this example, we're interested in tracking two specific ranges of characters:\n", "- ASCII Digits (unicode range 48-57)\n", "- Latin alphabet (unicode range 97-122)" ] }, { "cell_type": "markdown", "metadata": { "id": "qMVX_EKXNrpq" }, "source": [ "For more info on the unicode list of characters, check this [Wikipedia Article](https://en.wikipedia.org/wiki/List_of_Unicode_characters)" ] }, { "cell_type": "markdown", "metadata": { "id": "PRGvUCCUNrpq" }, "source": [ "## Installing whylogs\n", "\n", "If you haven't already, install whylogs: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FG-k1QJfNrpr" }, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs" ] }, { "cell_type": "markdown", "metadata": { "id": "TdfAioiRNrps" }, "source": [ "## Creating the Data" ] }, { "cell_type": "markdown", "metadata": { "id": "x7VC5LX8Nrpt" }, "source": [ "Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:\n", "- `onlyDigits`: Column of strings that contain only digit characters\n", "- `onlyAlpha`: Column of strings that contain only latin letters (no digits)\n", "- `mixed`: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbols" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 228 }, "id": "sCjGDR4ZNrpu", "outputId": "8f8d8be0-63e6-46d5-b045-ee96ca9d2a6e" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onlyDigitsonlyAlphamixed
012Alicemy_email_1989@gmail.com
183BobADK-1171
21ChelseaCopacabana 272 - Rio de Janeiro
3992Danny21º C Friday - Sao Paulo, Brasil
47Eddie18127819ASW
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " onlyDigits onlyAlpha mixed\n", "0 12 Alice my_email_1989@gmail.com\n", "1 83 Bob ADK-1171\n", "2 1 Chelsea Copacabana 272 - Rio de Janeiro\n", "3 992 Danny 21º C Friday - Sao Paulo, Brasil\n", "4 7 Eddie 18127819ASW" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import whylogs as why\n", "import pandas as pd\n", "data = {\n", " \"onlyDigits\": [\"12\", \"83\", \"1\", \"992\", \"7\"],\n", " \"onlyAlpha\": [\"Alice\", \"Bob\", \"Chelsea\", \"Danny\", \"Eddie\"],\n", " \"mixed\": [\"my_email_1989@gmail.com\",\"ADK-1171\",\"Copacabana 272 - Rio de Janeiro\",\"21º C Friday - Sao Paulo, Brasil\",\"18127819ASW\"]\n", "}\n", "df = pd.DataFrame(data)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "osBGcgKgNrpv" }, "source": [ "## Configuring the Metrics in the DatasetSchema" ] }, { "cell_type": "markdown", "metadata": { "id": "aF64GVNVNrpv" }, "source": [ "whylogs uses `Resolvers` in order to define the set of metrics tracked for a column name or data type.\n", "In this case, we'll create a custom Resolver to apply the UnicodeRangeMetric to all of the columns.\n", "\n", "> If you're interested in seeing how you can add or remove different metrics according to the column type or column name, please refer to this example on [Schema Configuration](https://whylogs.readthedocs.io/en/stable/examples/basic/Schema_Configuration.html)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "6WtP8yHcNrpw" }, "outputs": [], "source": [ "from whylogs.core.schema import ColumnSchema, DatasetSchema\n", "from whylogs.core.metrics.unicode_range import UnicodeRangeMetric\n", "from whylogs.core.resolvers import Resolver\n", "from whylogs.core.datatypes import DataType\n", "from typing import Dict\n", "from whylogs.core.metrics import Metric, MetricConfig\n", "\n", "class UnicodeResolver(Resolver):\n", " def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:\n", " return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}" ] }, { "cell_type": "markdown", "metadata": { "id": "dsAipWyKNrpw" }, "source": [ "Resolvers are passed to whylogs through a `DatasetSchema`, so we'll have to create a custom Schema as well.\n", "\n", "We'll just have to:\n", "- Pass the UnicodeResolver created previously\n", "- Since we're interested in changing the default character ranges, we'll also pass a Metric Configuration with the desired ranges" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "bbKYEpZxNrpx" }, "outputs": [], "source": [ "config = MetricConfig(unicode_ranges={\"digits\": (48, 57), \"alpha\": (97, 122)})\n", "schema = DatasetSchema(resolvers=UnicodeResolver(), default_configs=config)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qjQsRk6RNrpx" }, "source": [ "If a default MetricConfig is not passed, it would use the default unicode ranges, which would track the default ranges such as: emoticons, control characters and extended latin. " ] }, { "cell_type": "markdown", "metadata": { "id": "yPkSOYuqNrpx" }, "source": [ "We can now log the dataframe and pass our schema when calling `log`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "_BTtYbFRNrpy" }, "outputs": [], "source": [ "import whylogs as why\n", "\n", "prof_results = why.log(df, schema=DatasetSchema(resolvers=UnicodeResolver(), default_configs=MetricConfig(unicode_ranges={\"digits\": (48, 57), \"alpha\": (97, 122)})))\n", "prof = prof_results.profile()" ] }, { "cell_type": "markdown", "metadata": { "id": "LFQt5RZzNrpy" }, "source": [ "## Unicode Range and String Length Metrics" ] }, { "cell_type": "markdown", "metadata": { "id": "WZvZCr68Nrpz" }, "source": [ "Let's take a look at the __Profile View__:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 270 }, "id": "qiLLsRQiNrpz", "outputId": "f3bd42ae-889f-4859-dbbe-f11b00087b68" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeunicode_range/UNKNOWN:cardinality/estunicode_range/UNKNOWN:cardinality/lower_1unicode_range/UNKNOWN:cardinality/upper_1unicode_range/UNKNOWN:counts/nunicode_range/UNKNOWN:counts/nullunicode_range/UNKNOWN:distribution/maxunicode_range/UNKNOWN:distribution/meanunicode_range/UNKNOWN:distribution/medianunicode_range/UNKNOWN:distribution/min...unicode_range/string_length:distribution/q_95unicode_range/string_length:distribution/q_99unicode_range/string_length:distribution/stddevunicode_range/string_length:ints/maxunicode_range/string_length:ints/minunicode_range/string_length:types/booleanunicode_range/string_length:types/fractionalunicode_range/string_length:types/integralunicode_range/string_length:types/objectunicode_range/string_length:types/string
column
mixedSummaryType.COLUMN5.05.05.00025009.04.04.00.0...32.032.09.939819-9223372036854775807922337203685477580700000
onlyAlphaSummaryType.COLUMN1.01.01.00005000.00.00.00.0...7.07.01.264911-9223372036854775807922337203685477580700000
onlyDigitsSummaryType.COLUMN1.01.01.00005000.00.00.00.0...3.03.00.748331-9223372036854775807922337203685477580700000
\n", "

3 rows × 105 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " type unicode_range/UNKNOWN:cardinality/est \\\n", "column \n", "mixed SummaryType.COLUMN 5.0 \n", "onlyAlpha SummaryType.COLUMN 1.0 \n", "onlyDigits SummaryType.COLUMN 1.0 \n", "\n", " unicode_range/UNKNOWN:cardinality/lower_1 \\\n", "column \n", "mixed 5.0 \n", "onlyAlpha 1.0 \n", "onlyDigits 1.0 \n", "\n", " unicode_range/UNKNOWN:cardinality/upper_1 \\\n", "column \n", "mixed 5.00025 \n", "onlyAlpha 1.00005 \n", "onlyDigits 1.00005 \n", "\n", " unicode_range/UNKNOWN:counts/n unicode_range/UNKNOWN:counts/null \\\n", "column \n", "mixed 0 0 \n", "onlyAlpha 0 0 \n", "onlyDigits 0 0 \n", "\n", " unicode_range/UNKNOWN:distribution/max \\\n", "column \n", "mixed 9.0 \n", "onlyAlpha 0.0 \n", "onlyDigits 0.0 \n", "\n", " unicode_range/UNKNOWN:distribution/mean \\\n", "column \n", "mixed 4.0 \n", "onlyAlpha 0.0 \n", "onlyDigits 0.0 \n", "\n", " unicode_range/UNKNOWN:distribution/median \\\n", "column \n", "mixed 4.0 \n", "onlyAlpha 0.0 \n", "onlyDigits 0.0 \n", "\n", " unicode_range/UNKNOWN:distribution/min ... \\\n", "column ... \n", "mixed 0.0 ... \n", "onlyAlpha 0.0 ... \n", "onlyDigits 0.0 ... \n", "\n", " unicode_range/string_length:distribution/q_95 \\\n", "column \n", "mixed 32.0 \n", "onlyAlpha 7.0 \n", "onlyDigits 3.0 \n", "\n", " unicode_range/string_length:distribution/q_99 \\\n", "column \n", "mixed 32.0 \n", "onlyAlpha 7.0 \n", "onlyDigits 3.0 \n", "\n", " unicode_range/string_length:distribution/stddev \\\n", "column \n", "mixed 9.939819 \n", "onlyAlpha 1.264911 \n", "onlyDigits 0.748331 \n", "\n", " unicode_range/string_length:ints/max \\\n", "column \n", "mixed -9223372036854775807 \n", "onlyAlpha -9223372036854775807 \n", "onlyDigits -9223372036854775807 \n", "\n", " unicode_range/string_length:ints/min \\\n", "column \n", "mixed 9223372036854775807 \n", "onlyAlpha 9223372036854775807 \n", "onlyDigits 9223372036854775807 \n", "\n", " unicode_range/string_length:types/boolean \\\n", "column \n", "mixed 0 \n", "onlyAlpha 0 \n", "onlyDigits 0 \n", "\n", " unicode_range/string_length:types/fractional \\\n", "column \n", "mixed 0 \n", "onlyAlpha 0 \n", "onlyDigits 0 \n", "\n", " unicode_range/string_length:types/integral \\\n", "column \n", "mixed 0 \n", "onlyAlpha 0 \n", "onlyDigits 0 \n", "\n", " unicode_range/string_length:types/object \\\n", "column \n", "mixed 0 \n", "onlyAlpha 0 \n", "onlyDigits 0 \n", "\n", " unicode_range/string_length:types/string \n", "column \n", "mixed 0 \n", "onlyAlpha 0 \n", "onlyDigits 0 \n", "\n", "[3 rows x 105 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view_df = prof.view().to_pandas()\n", "profile_view_df" ] }, { "cell_type": "markdown", "metadata": { "id": "eEWc0RXvNrpz" }, "source": [ "You can see there's a lot of different metrics for each of the original dataframe's columns. In the `unicode_range` metric, we'll have additional sub metrics. In this case, we have metrics for:\n", "\n", "- digits: distribution metrics for characters inside the unicode's digit range\n", "- alpha: distribution metrics for characters inside the unicode's lowercase letters range\n", "- UNKNOWN: distribution metrics for character that fall anywhere outside the predefined range (digits and alpha)\n", "- string_length: distribution metrics for overall string length\n", "\n", "For each of these submetrics, we have metric components such as:\n", "\n", "- mean: the calculated mean for the column\n", "- stddev: the calculated standard deviation for the column\n", "- n: the total number of record for the column\n", "- max, min: maximum and minimum values for the column\n", "- q_xx: the xx-th quantile value of the data’s distribution\n", "- median: the median for the column\n", "\n", "For instance, let's check the mean for `alpha`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W-KtqiTdNrp0", "outputId": "6ede40ca-ad30-454a-d0ca-c93c354f1986" }, "outputs": [ { "data": { "text/plain": [ "column\n", "mixed 12.8\n", "onlyAlpha 5.0\n", "onlyDigits 0.0\n", "Name: unicode_range/alpha:distribution/mean, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view_df['unicode_range/alpha:distribution/mean']" ] }, { "cell_type": "markdown", "metadata": { "id": "5YAQHmVFNrp0" }, "source": [ "The above values shows a mean of 0 for `onlyDigits` - which is expected, since we don't have any letters in this column, only digits. We also have a mean of 5 for `onlyAlpha`, which will coincide of the string's length mean for the same column, since we only have letters characters in this columns. For `mixed` the mean is 12.8, and we can indeed see that this column has a higher count of letter character than the previous columns." ] }, { "cell_type": "markdown", "metadata": { "id": "dAZUUCvmNrp1" }, "source": [ "> You might notice that, even though we defined the range for only lowercase letters, uppercase characters also are included when calculating the metrics. That happens because the strings are all lowercased during preprocessing before tracking the strings. " ] }, { "cell_type": "markdown", "metadata": { "id": "1PHzjslDNrp1" }, "source": [ "Let's now check the `UNKNOWN` namespace: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EJfG4UAlNrp1", "outputId": "b7621078-9043-4a66-9af2-61a0524a3b8d" }, "outputs": [ { "data": { "text/plain": [ "column\n", "mixed 4.0\n", "onlyAlpha 0.0\n", "onlyDigits 0.0\n", "Name: unicode_range/UNKNOWN:distribution/mean, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view_df['unicode_range/UNKNOWN:distribution/mean']" ] }, { "cell_type": "markdown", "metadata": { "id": "sw6a8_0ANrp2" }, "source": [ "Since we have only digits and letters in `onlyDigit` and `onlyAlpha`, there are no characters outside of the defined ranges, yielding means of 0. In the `mixed`, however, this value is non-zero, since there are characters such as `., -, º`, and whitespaces, that are not in any of the predefined ranges." ] }, { "cell_type": "markdown", "metadata": { "id": "RLCYSr3ZNrp2" }, "source": [ "The last namespace `string_lenth`, contains metrics for the string's length:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AAB90l8INrp2", "outputId": "cdd63600-08f4-4488-ea90-97c1d81986ae" }, "outputs": [ { "data": { "text/plain": [ "column\n", "mixed 8.0\n", "onlyAlpha 3.0\n", "onlyDigits 1.0\n", "Name: unicode_range/string_length:distribution/min, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view_df['unicode_range/string_length:distribution/min']" ] }, { "cell_type": "markdown", "metadata": { "id": "SxUFAOmRNrp3" }, "source": [ "The `string_length` doesn't take into account any particular range. It containts aggregate metrics for the overall string length of each column. In this case, we're seeing the minimum value for the 3 columns: 1 for `onlyDigits`, 3 for `onlyAlpha` and 8 for `mixed`. Since the dataframe used here is very small, we can easily check the original data and verify that these metrics are indeed correct." ] }, { "cell_type": "markdown", "metadata": { "id": "lZqP9eNmNrp3" }, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": { "id": "CoS2IUhsNrp3" }, "source": [ "Feel free to define your own ranges of interest and combine the UnicodeRange metrics with other standard metrics as you see fit!\n", "\n", "The resulting profiles can be:\n", "- merged together\n", "- stored locally or in the cloud (AWS' S3)\n", "\n", "or used for other purposes, such as:\n", "- Setting constraints for data quality validation\n", "- Visualizing and comparing profiles\n", "- Sent to monitoring and observability platforms\n", "\n", "Be sure to check the other examples at [whylogs' Documentation](https://whylogs.readthedocs.io/en/stable/examples.html)!" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3.8.13 ('v1.x')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "f76ec28949fecf16b926a3fc5a03c1aa6468ee82fa5da4ce6fd607df021af5b5" } } }, "nbformat": 4, "nbformat_minor": 0 }