{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intro to Segmentation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Segments.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, certain subgroups of data can behave very differently from the overall dataset. When monitoring the health of a dataset, it’s often helpful to have visibility at the sub-group level to better understand how these subgroups are contributing to trends in the overall dataset. whylogs supports data segmentation for this purpose.\n",
"\n",
"Data segmentation is done at the point of profiling a dataset.\n",
"\n",
"Segmentation can be done by a single feature or by multiple features simultaneously. For example, you could have different profiles according to the gender of your dataset (\"M\" or \"F\"), and also for different combinations of, let's say, Gender and City Code. You can also further filter the segments for specific partitions you are interested in - let's say, Gender \"M\" with age above 18.\n",
"\n",
"In this example, we will show you a number of ways you can segment your data, and also how you can write these profiles to different locations."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Content\n",
"\n",
"1. Segmenting on a single column\n",
"2. Segmenting on multiple columns\n",
"3. Filtering Segments\n",
"4. Writing Segmented Results to Disk\n",
"5. Sending Segmented Results to WhyLabs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing whylogs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't have it installed already, install whylogs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting the Data & Defining the Segments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first download the data we'll be working with.\n",
"\n",
"This dataset contains transaction information for an online grocery store, such as:\n",
"\n",
"- product description\n",
"- category\n",
"- user rating\n",
"- market price\n",
"- number of items sold last week"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | date | \n", "product | \n", "category | \n", "rating | \n", "market_price | \n", "sales_last_week | \n", "
---|---|---|---|---|---|---|
0 | \n", "2022-08-09 00:00:00+00:00 | \n", "Wood - Centre Filled Bar Infused With Dark Mou... | \n", "Snacks and Branded Foods | \n", "4 | \n", "350.0 | \n", "1 | \n", "
1 | \n", "2022-08-09 00:00:00+00:00 | \n", "Toasted Almonds | \n", "Gourmet and World Food | \n", "3 | \n", "399.0 | \n", "1 | \n", "
2 | \n", "2022-08-09 00:00:00+00:00 | \n", "Instant Thai Noodles - Hot & Spicy Tomyum | \n", "Gourmet and World Food | \n", "3 | \n", "95.0 | \n", "1 | \n", "
3 | \n", "2022-08-09 00:00:00+00:00 | \n", "Thokku - Vathakozhambu | \n", "Snacks and Branded Foods | \n", "4 | \n", "336.0 | \n", "1 | \n", "
4 | \n", "2022-08-09 00:00:00+00:00 | \n", "Beetroot Powder | \n", "Gourmet and World Food | \n", "3 | \n", "150.0 | \n", "1 | \n", "
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/n | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "distribution/min | \n", "distribution/n | \n", "... | \n", "distribution/stddev | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
category | \n", "1.000000 | \n", "1.0 | \n", "1.000050 | \n", "707 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Care', est=707, uppe... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "707 | \n", "NaN | \n", "NaN | \n", "
date | \n", "8.000000 | \n", "8.0 | \n", "8.000400 | \n", "707 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='2022-08-15 00:00:00+00:00... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "707 | \n", "NaN | \n", "NaN | \n", "
market_price | \n", "57.000008 | \n", "57.0 | \n", "57.002854 | \n", "707 | \n", "0 | \n", "2799.0 | \n", "621.190948 | \n", "299.0 | \n", "50.0 | \n", "707 | \n", "... | \n", "713.745256 | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "707 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
product | \n", "69.000012 | \n", "69.0 | \n", "69.003457 | \n", "707 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Powder', est=21, upp... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "707 | \n", "NaN | \n", "NaN | \n", "
rating | \n", "3.000000 | \n", "3.0 | \n", "3.000150 | \n", "707 | \n", "0 | \n", "5.0 | \n", "3.823197 | \n", "4.0 | \n", "3.0 | \n", "707 | \n", "... | \n", "0.500566 | \n", "[FrequentItem(value='4', est=508, upper=508, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "707 | \n", "0 | \n", "0 | \n", "5.0 | \n", "3.0 | \n", "
sales_last_week | \n", "5.000000 | \n", "5.0 | \n", "5.000250 | \n", "707 | \n", "0 | \n", "6.0 | \n", "1.391796 | \n", "1.0 | \n", "1.0 | \n", "707 | \n", "... | \n", "1.003162 | \n", "[FrequentItem(value='1', est=557, upper=557, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "707 | \n", "0 | \n", "0 | \n", "6.0 | \n", "1.0 | \n", "
6 rows × 28 columns
\n", "\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/n | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "distribution/min | \n", "distribution/n | \n", "... | \n", "distribution/stddev | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
category | \n", "1.000000 | \n", "1.0 | \n", "1.000050 | \n", "162 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Care', est=162, uppe... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "162 | \n", "NaN | \n", "NaN | \n", "
date | \n", "8.000000 | \n", "8.0 | \n", "8.000400 | \n", "162 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='2022-08-15 00:00:00+00:00... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "162 | \n", "NaN | \n", "NaN | \n", "
market_price | \n", "15.000001 | \n", "15.0 | \n", "15.000749 | \n", "162 | \n", "0 | \n", "2799.0 | \n", "649.987654 | \n", "265.0 | \n", "149.0 | \n", "162 | \n", "... | \n", "889.494280 | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "162 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
product | \n", "16.000001 | \n", "16.0 | \n", "16.000799 | \n", "162 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Sipper With Pop-up S... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "162 | \n", "NaN | \n", "NaN | \n", "
rating | \n", "1.000000 | \n", "1.0 | \n", "1.000050 | \n", "162 | \n", "0 | \n", "3.0 | \n", "3.000000 | \n", "3.0 | \n", "3.0 | \n", "162 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='3', est=162, upper=162, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "162 | \n", "0 | \n", "0 | \n", "3.0 | \n", "3.0 | \n", "
sales_last_week | \n", "3.000000 | \n", "3.0 | \n", "3.000150 | \n", "162 | \n", "0 | \n", "4.0 | \n", "1.271605 | \n", "1.0 | \n", "1.0 | \n", "162 | \n", "... | \n", "0.705125 | \n", "[FrequentItem(value='1', est=134, upper=134, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "162 | \n", "0 | \n", "0 | \n", "4.0 | \n", "1.0 | \n", "
6 rows × 28 columns
\n", "\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/n | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "distribution/min | \n", "distribution/n | \n", "... | \n", "distribution/stddev | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
category | \n", "1.000000 | \n", "1.0 | \n", "1.000050 | \n", "389 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Care', est=389, uppe... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "389 | \n", "NaN | \n", "NaN | \n", "
date | \n", "8.000000 | \n", "8.0 | \n", "8.000400 | \n", "389 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='2022-08-12 00:00:00+00:00... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "389 | \n", "NaN | \n", "NaN | \n", "
market_price | \n", "32.000002 | \n", "32.0 | \n", "32.001600 | \n", "389 | \n", "0 | \n", "2638.0 | \n", "809.352185 | \n", "495.0 | \n", "215.0 | \n", "389 | \n", "... | \n", "679.345870 | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "389 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
product | \n", "38.000003 | \n", "38.0 | \n", "38.001901 | \n", "389 | \n", "0 | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Baby Powder', est=21, upp... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "389 | \n", "NaN | \n", "NaN | \n", "
rating | \n", "2.000000 | \n", "2.0 | \n", "2.000100 | \n", "389 | \n", "0 | \n", "5.0 | \n", "4.071979 | \n", "4.0 | \n", "4.0 | \n", "389 | \n", "... | \n", "0.258787 | \n", "[FrequentItem(value='4', est=361, upper=361, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "389 | \n", "0 | \n", "0 | \n", "5.0 | \n", "4.0 | \n", "
sales_last_week | \n", "4.000000 | \n", "4.0 | \n", "4.000200 | \n", "389 | \n", "0 | \n", "6.0 | \n", "1.483290 | \n", "1.0 | \n", "1.0 | \n", "389 | \n", "... | \n", "1.170009 | \n", "[FrequentItem(value='1', est=292, upper=292, l... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "389 | \n", "0 | \n", "0 | \n", "6.0 | \n", "1.0 | \n", "
6 rows × 28 columns
\n", "