{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IJ2tqS2oh8wp"
},
"source": [
"# Merging Profiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Merging_Profiles.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TTP91R40h8wr"
},
"source": [
"Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range.\n",
"\n",
"In any case, merging profiles is a solution!\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rZoJE6nYh8wr"
},
"source": [
"## Installing whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ubeZjbMzh8ws"
},
"source": [
"whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "LgAFe39bh8ws"
},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4NnapN6Mh8wt"
},
"source": [
"## Loading a Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zAW_RioVh8wt"
},
"source": [
"Before profiling data, lets create a Pandas DataFrame from a public dataset. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 381
},
"id": "bI4RnpBoh8wt",
"outputId": "db4e9122-434f-4ef2-f6d2-25650333d135"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"row count: 945\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n", " | Transaction ID | \n", "Customer ID | \n", "Quantity | \n", "Item Price | \n", "Total Tax | \n", "Total Amount | \n", "Store Type | \n", "Product Category | \n", "Product Subcategory | \n", "Gender | \n", "Transaction Type | \n", "Age | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
682 | \n", "T74278458640 | \n", "C267835 | \n", "2 | \n", "63.2 | \n", "13.2720 | \n", "139.6720 | \n", "TeleShop | \n", "Books | \n", "DIY | \n", "M | \n", "Purchase | \n", "33.0 | \n", "
256 | \n", "T54377774372 | \n", "C270496 | \n", "5 | \n", "75.5 | \n", "39.6375 | \n", "417.1375 | \n", "MBR | \n", "Electronics | \n", "Audio and video | \n", "M | \n", "Purchase | \n", "23.0 | \n", "
67 | \n", "T64030190529 | \n", "C269524 | \n", "5 | \n", "52.5 | \n", "27.5625 | \n", "290.0625 | \n", "e-Shop | \n", "Bags | \n", "Mens | \n", "F | \n", "Purchase | \n", "24.0 | \n", "
762 | \n", "T18970114223 | \n", "C272730 | \n", "2 | \n", "80.6 | \n", "16.9260 | \n", "178.1260 | \n", "e-Shop | \n", "Home and kitchen | \n", "Kitchen | \n", "M | \n", "Purchase | \n", "39.0 | \n", "
94 | \n", "T94404065446 | \n", "C271648 | \n", "5 | \n", "48.0 | \n", "25.2000 | \n", "265.2000 | \n", "e-Shop | \n", "Clothing | \n", "Kids | \n", "F | \n", "Purchase | \n", "34.0 | \n", "
197 | \n", "T30540748600 | \n", "C269603 | \n", "1 | \n", "127.4 | \n", "13.3770 | \n", "140.7770 | \n", "TeleShop | \n", "Footwear | \n", "Women | \n", "M | \n", "Purchase | \n", "41.0 | \n", "
161 | \n", "T78998671169 | \n", "C270907 | \n", "1 | \n", "104.9 | \n", "11.0145 | \n", "115.9145 | \n", "MBR | \n", "Books | \n", "DIY | \n", "F | \n", "Purchase | \n", "41.0 | \n", "
574 | \n", "T19424023275 | \n", "C270462 | \n", "2 | \n", "127.5 | \n", "26.7750 | \n", "281.7750 | \n", "TeleShop | \n", "Electronics | \n", "Cameras | \n", "F | \n", "Purchase | \n", "38.0 | \n", "
583 | \n", "T7986658313 | \n", "C269047 | \n", "3 | \n", "25.9 | \n", "8.1585 | \n", "85.8585 | \n", "e-Shop | \n", "Clothing | \n", "Women | \n", "M | \n", "Purchase | \n", "25.0 | \n", "
805 | \n", "T36786634925 | \n", "C267437 | \n", "3 | \n", "50.0 | \n", "15.7500 | \n", "165.7500 | \n", "TeleShop | \n", "Books | \n", "Children | \n", "M | \n", "Purchase | \n", "35.0 | \n", "
\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "cardinality/lower_1 | \n", "... | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Gender | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "2.000000 | \n", "2.000100 | \n", "2.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Total Amount | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "-153.816 | \n", "8.619 | \n", "66.521 | \n", "216.359 | \n", "321.555 | \n", "580.788 | \n", "642.5575 | \n", "795.6 | \n", "NaN | \n", "NaN | \n", "
Customer ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "98.000024 | \n", "98.004917 | \n", "98.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Item Price | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "97.000023 | \n", "97.004866 | \n", "97.0 | \n", "... | \n", "10.000 | \n", "25.700 | \n", "40.700 | \n", "76.800 | \n", "111.100 | \n", "135.200 | \n", "139.8000 | \n", "148.9 | \n", "NaN | \n", "NaN | \n", "
Transaction ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 28 columns
\n", "\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "cardinality/lower_1 | \n", "... | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Gender | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "2.000000 | \n", "2.000100 | \n", "2.000000 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Total Amount | \n", "945 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "844.069184 | \n", "855.117588 | \n", "833.289540 | \n", "... | \n", "-233.376 | \n", "14.365 | \n", "79.0075 | \n", "179.452 | \n", "356.915 | \n", "580.788 | \n", "654.16 | \n", "804.44 | \n", "NaN | \n", "NaN | \n", "
Customer ID | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "869.683985 | \n", "881.067672 | \n", "858.577213 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Item Price | \n", "945 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "705.028228 | \n", "714.256661 | \n", "696.024282 | \n", "... | \n", "13.800 | \n", "22.300 | \n", "45.0000 | \n", "80.600 | \n", "116.600 | \n", "138.200 | \n", "145.10 | \n", "149.00 | \n", "NaN | \n", "NaN | \n", "
Transaction ID | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "935.275741 | \n", "947.517988 | \n", "923.331294 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 28 columns
\n", "\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "cardinality/lower_1 | \n", "... | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Gender | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "2.000000 | \n", "2.000100 | \n", "2.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Total Amount | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "-153.816 | \n", "8.619 | \n", "66.521 | \n", "216.359 | \n", "321.555 | \n", "580.788 | \n", "642.5575 | \n", "795.6 | \n", "NaN | \n", "NaN | \n", "
Customer ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "98.000024 | \n", "98.004917 | \n", "98.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Item Price | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "97.000023 | \n", "97.004866 | \n", "97.0 | \n", "... | \n", "10.000 | \n", "25.700 | \n", "40.700 | \n", "76.800 | \n", "111.100 | \n", "135.200 | \n", "139.8000 | \n", "148.9 | \n", "NaN | \n", "NaN | \n", "
Transaction ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 28 columns
\n", "\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "frequent_items/frequent_strings | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "... | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Gender | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "[FrequentItem(value='M', est=489, upper=489, l... | \n", "2.000000 | \n", "2.000100 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Total Amount | \n", "945 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "849.185065 | \n", "860.300432 | \n", "... | \n", "-233.376 | \n", "14.365 | \n", "78.676 | \n", "178.126 | \n", "357.0255 | \n", "580.346 | \n", "657.475 | \n", "804.44 | \n", "NaN | \n", "NaN | \n", "
Customer ID | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "[FrequentItem(value='C273096', est=3, upper=2,... | \n", "858.998625 | \n", "873.131713 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Item Price | \n", "945 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "701.002487 | \n", "710.178225 | \n", "... | \n", "15.000 | \n", "22.700 | \n", "45.200 | \n", "81.200 | \n", "116.7000 | \n", "138.200 | \n", "145.600 | \n", "149.00 | \n", "NaN | \n", "NaN | \n", "
Transaction ID | \n", "945 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "945 | \n", "0 | \n", "[FrequentItem(value='T79960195196', est=3, upp... | \n", "942.466233 | \n", "957.972612 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 28 columns
\n", "\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "cardinality/lower_1 | \n", "... | \n", "distribution/q_05 | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "distribution/q_95 | \n", "distribution/q_99 | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Gender | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "2.000000 | \n", "2.000100 | \n", "2.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Total Amount | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "-153.816 | \n", "8.619 | \n", "66.521 | \n", "216.359 | \n", "321.555 | \n", "580.788 | \n", "642.5575 | \n", "795.6 | \n", "NaN | \n", "NaN | \n", "
Customer ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "98.000024 | \n", "98.004917 | \n", "98.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Item Price | \n", "100 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "97.000023 | \n", "97.004866 | \n", "97.0 | \n", "... | \n", "10.000 | \n", "25.700 | \n", "40.700 | \n", "76.800 | \n", "111.100 | \n", "135.200 | \n", "139.8000 | \n", "148.9 | \n", "NaN | \n", "NaN | \n", "
Transaction ID | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "99.000024 | \n", "99.004967 | \n", "99.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 28 columns
\n", "