{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "xUsdUYUbNrpl"
},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y1-M8tfxNrpn"
},
"source": [
"# String Tracking - Unicode Range and String Length"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a6cXXyVONrpo"
},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/String_Tracking.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WlasZRJiNrpo"
},
"source": [
"By default, columns of type `str` will have the following metrics, when logged with whylogs:\n",
"- Counts\n",
"- Types\n",
"- Frequent Items/Frequent Strings\n",
"- Cardinality"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZmIMG8GKNrpp"
},
"source": [
"In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as `mean`, `stddev` and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LYbbSwfMNrpp"
},
"source": [
"In this example, we're interested in tracking two specific ranges of characters:\n",
"- ASCII Digits (unicode range 48-57)\n",
"- Latin alphabet (unicode range 97-122)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qMVX_EKXNrpq"
},
"source": [
"For more info on the unicode list of characters, check this [Wikipedia Article](https://en.wikipedia.org/wiki/List_of_Unicode_characters)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PRGvUCCUNrpq"
},
"source": [
"## Installing whylogs\n",
"\n",
"If you haven't already, install whylogs: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FG-k1QJfNrpr"
},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TdfAioiRNrps"
},
"source": [
"## Creating the Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x7VC5LX8Nrpt"
},
"source": [
"Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:\n",
"- `onlyDigits`: Column of strings that contain only digit characters\n",
"- `onlyAlpha`: Column of strings that contain only latin letters (no digits)\n",
"- `mixed`: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbols"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 228
},
"id": "sCjGDR4ZNrpu",
"outputId": "8f8d8be0-63e6-46d5-b045-ee96ca9d2a6e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n", " | onlyDigits | \n", "onlyAlpha | \n", "mixed | \n", "
---|---|---|---|
0 | \n", "12 | \n", "Alice | \n", "my_email_1989@gmail.com | \n", "
1 | \n", "83 | \n", "Bob | \n", "ADK-1171 | \n", "
2 | \n", "1 | \n", "Chelsea | \n", "Copacabana 272 - Rio de Janeiro | \n", "
3 | \n", "992 | \n", "Danny | \n", "21º C Friday - Sao Paulo, Brasil | \n", "
4 | \n", "7 | \n", "Eddie | \n", "18127819ASW | \n", "
\n", " | type | \n", "unicode_range/UNKNOWN:cardinality/est | \n", "unicode_range/UNKNOWN:cardinality/lower_1 | \n", "unicode_range/UNKNOWN:cardinality/upper_1 | \n", "unicode_range/UNKNOWN:counts/n | \n", "unicode_range/UNKNOWN:counts/null | \n", "unicode_range/UNKNOWN:distribution/max | \n", "unicode_range/UNKNOWN:distribution/mean | \n", "unicode_range/UNKNOWN:distribution/median | \n", "unicode_range/UNKNOWN:distribution/min | \n", "... | \n", "unicode_range/string_length:distribution/q_95 | \n", "unicode_range/string_length:distribution/q_99 | \n", "unicode_range/string_length:distribution/stddev | \n", "unicode_range/string_length:ints/max | \n", "unicode_range/string_length:ints/min | \n", "unicode_range/string_length:types/boolean | \n", "unicode_range/string_length:types/fractional | \n", "unicode_range/string_length:types/integral | \n", "unicode_range/string_length:types/object | \n", "unicode_range/string_length:types/string | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
mixed | \n", "SummaryType.COLUMN | \n", "5.0 | \n", "5.0 | \n", "5.00025 | \n", "0 | \n", "0 | \n", "9.0 | \n", "4.0 | \n", "4.0 | \n", "0.0 | \n", "... | \n", "32.0 | \n", "32.0 | \n", "9.939819 | \n", "-9223372036854775807 | \n", "9223372036854775807 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
onlyAlpha | \n", "SummaryType.COLUMN | \n", "1.0 | \n", "1.0 | \n", "1.00005 | \n", "0 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "7.0 | \n", "7.0 | \n", "1.264911 | \n", "-9223372036854775807 | \n", "9223372036854775807 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
onlyDigits | \n", "SummaryType.COLUMN | \n", "1.0 | \n", "1.0 | \n", "1.00005 | \n", "0 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "3.0 | \n", "3.0 | \n", "0.748331 | \n", "-9223372036854775807 | \n", "9223372036854775807 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 rows × 105 columns
\n", "