{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles) to leverage the power of whylogs and WhyLabs together!*" ] }, { "cell_type": "markdown", "metadata": { "id": "IJ2tqS2oh8wp" }, "source": [ "# Merging Profiles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Merging_Profiles.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "TTP91R40h8wr" }, "source": [ "Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range.\n", "\n", "In any case, merging profiles is a solution!\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rZoJE6nYh8wr" }, "source": [ "## Installing whylogs" ] }, { "cell_type": "markdown", "metadata": { "id": "ubeZjbMzh8ws" }, "source": [ "whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "LgAFe39bh8ws" }, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs" ] }, { "cell_type": "markdown", "metadata": { "id": "4NnapN6Mh8wt" }, "source": [ "## Loading a Pandas DataFrame" ] }, { "cell_type": "markdown", "metadata": { "id": "zAW_RioVh8wt" }, "source": [ "Before profiling data, lets create a Pandas DataFrame from a public dataset. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 381 }, "id": "bI4RnpBoh8wt", "outputId": "db4e9122-434f-4ef2-f6d2-25650333d135" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "row count: 945\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Transaction IDCustomer IDQuantityItem PriceTotal TaxTotal AmountStore TypeProduct CategoryProduct SubcategoryGenderTransaction TypeAge
682T74278458640C267835263.213.2720139.6720TeleShopBooksDIYMPurchase33.0
256T54377774372C270496575.539.6375417.1375MBRElectronicsAudio and videoMPurchase23.0
67T64030190529C269524552.527.5625290.0625e-ShopBagsMensFPurchase24.0
762T18970114223C272730280.616.9260178.1260e-ShopHome and kitchenKitchenMPurchase39.0
94T94404065446C271648548.025.2000265.2000e-ShopClothingKidsFPurchase34.0
197T30540748600C2696031127.413.3770140.7770TeleShopFootwearWomenMPurchase41.0
161T78998671169C2709071104.911.0145115.9145MBRBooksDIYFPurchase41.0
574T19424023275C2704622127.526.7750281.7750TeleShopElectronicsCamerasFPurchase38.0
583T7986658313C269047325.98.158585.8585e-ShopClothingWomenMPurchase25.0
805T36786634925C267437350.015.7500165.7500TeleShopBooksChildrenMPurchase35.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Transaction ID Customer ID Quantity Item Price Total Tax Total Amount \\\n", "682 T74278458640 C267835 2 63.2 13.2720 139.6720 \n", "256 T54377774372 C270496 5 75.5 39.6375 417.1375 \n", "67 T64030190529 C269524 5 52.5 27.5625 290.0625 \n", "762 T18970114223 C272730 2 80.6 16.9260 178.1260 \n", "94 T94404065446 C271648 5 48.0 25.2000 265.2000 \n", "197 T30540748600 C269603 1 127.4 13.3770 140.7770 \n", "161 T78998671169 C270907 1 104.9 11.0145 115.9145 \n", "574 T19424023275 C270462 2 127.5 26.7750 281.7750 \n", "583 T7986658313 C269047 3 25.9 8.1585 85.8585 \n", "805 T36786634925 C267437 3 50.0 15.7500 165.7500 \n", "\n", " Store Type Product Category Product Subcategory Gender Transaction Type \\\n", "682 TeleShop Books DIY M Purchase \n", "256 MBR Electronics Audio and video M Purchase \n", "67 e-Shop Bags Mens F Purchase \n", "762 e-Shop Home and kitchen Kitchen M Purchase \n", "94 e-Shop Clothing Kids F Purchase \n", "197 TeleShop Footwear Women M Purchase \n", "161 MBR Books DIY F Purchase \n", "574 TeleShop Electronics Cameras F Purchase \n", "583 e-Shop Clothing Women M Purchase \n", "805 TeleShop Books Children M Purchase \n", "\n", " Age \n", "682 33.0 \n", "256 23.0 \n", "67 24.0 \n", "762 39.0 \n", "94 34.0 \n", "197 41.0 \n", "161 41.0 \n", "574 38.0 \n", "583 25.0 \n", "805 35.0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_full= pd.read_csv(\"https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv\")\n", "\n", "print('row count: {}'.format(df_full.shape[0]))\n", "df_full.sample(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "6l8P81Q4oDcG" }, "source": [ "This dataset contains 945 rows and contains a mix of numeric and categorical features. Lets split this DataFrame into 3 chunks of different sizes." ] }, { "cell_type": "markdown", "metadata": { "id": "vCvnKAIbomkC" }, "source": [ "## Splitting the DataFrame" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tYIlfMjmokvx", "outputId": "758d2734-b4b2-47de-b541-cea34e6be96c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Row Counts:\n", "Subset 1: 100\n", "Subset 2: 300\n", "Subset 3: 545\n" ] } ], "source": [ "df_subset1= df_full[0:100]\n", "df_subset2= df_full[100:400]\n", "df_subset3= df_full[400:]\n", "\n", "print('Row Counts:')\n", "print('Subset 1: {}'.format(df_subset1.shape[0]))\n", "print('Subset 2: {}'.format(df_subset2.shape[0]))\n", "print('Subset 3: {}'.format(df_subset3.shape[0]))" ] }, { "cell_type": "markdown", "metadata": { "id": "S7g4GV4OpLPz" }, "source": [ "## Profiling a Single Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "X9AZ9QwYpN9R" }, "source": [ "Lets profile the first subset. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "-LPBoK0vpTIj" }, "outputs": [], "source": [ "import whylogs as why\n", "\n", "results = why.log(df_subset1)\n", "profile = results.profile()" ] }, { "cell_type": "markdown", "metadata": { "id": "iO97sPGmqBIR" }, "source": [ "The code above generates a *ProfileResultSet* instance and assigns it to the **results** variable. We then call the **profile** method on this object to generate a *DatasetProfile* instance which we assign to the **profile** variable. \n", "\n", "We can inspect our profile by generating a pandas DataFrame from it. Lets view the first few rows. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 383 }, "id": "kphCdAQmqHAY", "outputId": "04f837a2-39f8-423d-cdcc-741066553e6d" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectcardinality/estcardinality/upper_1cardinality/lower_1...distribution/q_05distribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90distribution/q_95distribution/q_99ints/maxints/min
column
Gender100000010002.0000002.0001002.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Total Amount1000010000099.00002499.00496799.0...-153.8168.61966.521216.359321.555580.788642.5575795.6NaNNaN
Customer ID1000000100098.00002498.00491798.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Item Price1000010000097.00002397.00486697.0...10.00025.70040.70076.800111.100135.200139.8000148.9NaNNaN
Transaction ID1000000100099.00002499.00496799.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 28 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "Gender 100 0 0 0 \n", "Total Amount 100 0 0 100 \n", "Customer ID 100 0 0 0 \n", "Item Price 100 0 0 100 \n", "Transaction ID 100 0 0 0 \n", "\n", " types/boolean types/string types/object cardinality/est \\\n", "column \n", "Gender 0 100 0 2.000000 \n", "Total Amount 0 0 0 99.000024 \n", "Customer ID 0 100 0 98.000024 \n", "Item Price 0 0 0 97.000023 \n", "Transaction ID 0 100 0 99.000024 \n", "\n", " cardinality/upper_1 cardinality/lower_1 ... \\\n", "column ... \n", "Gender 2.000100 2.0 ... \n", "Total Amount 99.004967 99.0 ... \n", "Customer ID 98.004917 98.0 ... \n", "Item Price 97.004866 97.0 ... \n", "Transaction ID 99.004967 99.0 ... \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount -153.816 8.619 66.521 \n", "Customer ID NaN NaN NaN \n", "Item Price 10.000 25.700 40.700 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 216.359 321.555 580.788 \n", "Customer ID NaN NaN NaN \n", "Item Price 76.800 111.100 135.200 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_95 distribution/q_99 ints/max ints/min \n", "column \n", "Gender NaN NaN NaN NaN \n", "Total Amount 642.5575 795.6 NaN NaN \n", "Customer ID NaN NaN NaN NaN \n", "Item Price 139.8000 148.9 NaN NaN \n", "Transaction ID NaN NaN NaN NaN \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset1_profile_df = profile.view().to_pandas()\n", "subset1_profile_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "FTBfgUxYqYJ0" }, "source": [ "From the **counts/n** column, we can see that our subset of data contained 100 rows, as expected. Before we start merging new profiles, lets grab the mean of the \"Item Price\" column for another point of reference. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "XQUZYX9rq39p", "outputId": "decf807d-1644-415e-b2ca-7e0ea09a09c4" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Mean Item Price for Subset 1: 74.189'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Mean Item Price for Subset 1: {}\".format(subset1_profile_df['distribution/mean'].loc['Item Price'])" ] }, { "cell_type": "markdown", "metadata": { "id": "IaKTy7I4rZBH" }, "source": [ "## Merging Profiles" ] }, { "cell_type": "markdown", "metadata": { "id": "LnZxOBYotJAT" }, "source": [ "We can call the track method on our profile to profile a new dataset and merge this with our existing profile in one step. This can be done successively for multiple subsets of data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "o-S-Eum9sM3l" }, "outputs": [], "source": [ "profile.track(df_subset2)\n", "profile.track(df_subset3)" ] }, { "cell_type": "markdown", "metadata": { "id": "QHjBSPhBtlnY" }, "source": [ "Lets now inspect the merged profile as a Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 383 }, "id": "dyFkZ8vdstZ_", "outputId": "e7bc0992-7129-47a7-c01b-6e4a6a3a58df" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectcardinality/estcardinality/upper_1cardinality/lower_1...distribution/q_05distribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90distribution/q_95distribution/q_99ints/maxints/min
column
Gender945000094502.0000002.0001002.000000...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Total Amount94500945000844.069184855.117588833.289540...-233.37614.36579.0075179.452356.915580.788654.16804.44NaNNaN
Customer ID94500009450869.683985881.067672858.577213...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Item Price94500945000705.028228714.256661696.024282...13.80022.30045.000080.600116.600138.200145.10149.00NaNNaN
Transaction ID94500009450935.275741947.517988923.331294...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 28 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "Gender 945 0 0 0 \n", "Total Amount 945 0 0 945 \n", "Customer ID 945 0 0 0 \n", "Item Price 945 0 0 945 \n", "Transaction ID 945 0 0 0 \n", "\n", " types/boolean types/string types/object cardinality/est \\\n", "column \n", "Gender 0 945 0 2.000000 \n", "Total Amount 0 0 0 844.069184 \n", "Customer ID 0 945 0 869.683985 \n", "Item Price 0 0 0 705.028228 \n", "Transaction ID 0 945 0 935.275741 \n", "\n", " cardinality/upper_1 cardinality/lower_1 ... \\\n", "column ... \n", "Gender 2.000100 2.000000 ... \n", "Total Amount 855.117588 833.289540 ... \n", "Customer ID 881.067672 858.577213 ... \n", "Item Price 714.256661 696.024282 ... \n", "Transaction ID 947.517988 923.331294 ... \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount -233.376 14.365 79.0075 \n", "Customer ID NaN NaN NaN \n", "Item Price 13.800 22.300 45.0000 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 179.452 356.915 580.788 \n", "Customer ID NaN NaN NaN \n", "Item Price 80.600 116.600 138.200 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_95 distribution/q_99 ints/max ints/min \n", "column \n", "Gender NaN NaN NaN NaN \n", "Total Amount 654.16 804.44 NaN NaN \n", "Customer ID NaN NaN NaN NaN \n", "Item Price 145.10 149.00 NaN NaN \n", "Transaction ID NaN NaN NaN NaN \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_profile_df = profile.view().to_pandas()\n", "full_profile_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "rrOONmeztuuv" }, "source": [ "We now see that each column has a count of 945 which we expect. Lets revisit the mean of the Items Price column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "7B_0YW4iuBX0", "outputId": "081d136c-a4e5-41fe-af44-d7c87fc8beae" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Mean Item Price from merged profile: 79.84814814814818'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Mean Item Price from merged profile: {}\".format(full_profile_df['distribution/mean'].loc['Item Price'])" ] }, { "cell_type": "markdown", "metadata": { "id": "4VgGQiyGuNtU" }, "source": [ "Lets compare this with the mean we get using the the **mean** method from Pandas." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3UWpqHP-uP98", "outputId": "5b2b6bee-9026-4029-97c7-d01b10e041f3" }, "outputs": [ { "data": { "text/plain": [ "79.84814814814808" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_full['Item Price'].mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "amhK30yeuv_6" }, "source": [ "Its nearly an exact match! Note that in this example, we profiled 3 datasets of unequal sizes independently and merged together 3 profiles. This merged profile captured telemetry describing our entire dataset. \n", "\n", "This property of **mergeability** makes whylogs particularly powerful. It allows us to profile datasets which live in distributed pipeline even if our data is never together in one place at any time. \n", "\n", "Mergeability also makes it a trivial matter to roll up from hourly to daily, weekly, or monthly level views of your data. " ] }, { "cell_type": "markdown", "metadata": { "id": "gFMVcxf4h8wu" }, "source": [ "## Merging Profile Views" ] }, { "cell_type": "markdown", "metadata": { "id": "tvqfp4WEv-Vk" }, "source": [ "Another option is to merge Profile *Views*. \n", "\n", "A ProfileView object can be generated from a DatasetProfile object which allows for inspection of individiaul profiles, as well as the ability to visualize profiles using the our visualization module. \n", "\n", "This is a good option if users wish to inspect profiles of their entire dataset while maintaining the ability to inspect individual profiles of the subsets of data. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "xnf-GXRpw039" }, "outputs": [], "source": [ "results = why.log(df_subset1)\n", "profile_view1 = results.profile().view()\n", "\n", "results = why.log(df_subset2)\n", "profile_view2 = results.profile().view()\n", "\n", "results = why.log(df_subset3)\n", "profile_view3 = results.profile().view()" ] }, { "cell_type": "markdown", "metadata": { "id": "zy_gZdTbx2PV" }, "source": [ "Similar to the previous example, we find that the first profile view counted 100 rows in the subset of data it profiled." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 383 }, "id": "7XaAUFVHyEB3", "outputId": "1ecbbfc0-6c24-42b9-df7d-6f5df4c6c105" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectcardinality/estcardinality/upper_1cardinality/lower_1...distribution/q_05distribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90distribution/q_95distribution/q_99ints/maxints/min
column
Gender100000010002.0000002.0001002.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Total Amount1000010000099.00002499.00496799.0...-153.8168.61966.521216.359321.555580.788642.5575795.6NaNNaN
Customer ID1000000100098.00002498.00491798.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Item Price1000010000097.00002397.00486697.0...10.00025.70040.70076.800111.100135.200139.8000148.9NaNNaN
Transaction ID1000000100099.00002499.00496799.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 28 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "Gender 100 0 0 0 \n", "Total Amount 100 0 0 100 \n", "Customer ID 100 0 0 0 \n", "Item Price 100 0 0 100 \n", "Transaction ID 100 0 0 0 \n", "\n", " types/boolean types/string types/object cardinality/est \\\n", "column \n", "Gender 0 100 0 2.000000 \n", "Total Amount 0 0 0 99.000024 \n", "Customer ID 0 100 0 98.000024 \n", "Item Price 0 0 0 97.000023 \n", "Transaction ID 0 100 0 99.000024 \n", "\n", " cardinality/upper_1 cardinality/lower_1 ... \\\n", "column ... \n", "Gender 2.000100 2.0 ... \n", "Total Amount 99.004967 99.0 ... \n", "Customer ID 98.004917 98.0 ... \n", "Item Price 97.004866 97.0 ... \n", "Transaction ID 99.004967 99.0 ... \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount -153.816 8.619 66.521 \n", "Customer ID NaN NaN NaN \n", "Item Price 10.000 25.700 40.700 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 216.359 321.555 580.788 \n", "Customer ID NaN NaN NaN \n", "Item Price 76.800 111.100 135.200 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_95 distribution/q_99 ints/max ints/min \n", "column \n", "Gender NaN NaN NaN NaN \n", "Total Amount 642.5575 795.6 NaN NaN \n", "Customer ID NaN NaN NaN NaN \n", "Item Price 139.8000 148.9 NaN NaN \n", "Transaction ID NaN NaN NaN NaN \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view1.to_pandas().head()" ] }, { "cell_type": "markdown", "metadata": { "id": "JZ_J3i1QyI4Q" }, "source": [ "We can merge these ProfileView objects using the **merge** method. We assign the result to a new variable and view a few rows of the profile's DataFrame." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 400 }, "id": "aFcvdlKJyM4z", "outputId": "762016fa-5079-47dd-f5c7-a93ef42358bd" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectfrequent_items/frequent_stringscardinality/estcardinality/upper_1...distribution/q_05distribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90distribution/q_95distribution/q_99ints/maxints/min
column
Gender94500009450[FrequentItem(value='M', est=489, upper=489, l...2.0000002.000100...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Total Amount94500945000NaN849.185065860.300432...-233.37614.36578.676178.126357.0255580.346657.475804.44NaNNaN
Customer ID94500009450[FrequentItem(value='C273096', est=3, upper=2,...858.998625873.131713...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Item Price94500945000NaN701.002487710.178225...15.00022.70045.20081.200116.7000138.200145.600149.00NaNNaN
Transaction ID94500009450[FrequentItem(value='T79960195196', est=3, upp...942.466233957.972612...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 28 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "Gender 945 0 0 0 \n", "Total Amount 945 0 0 945 \n", "Customer ID 945 0 0 0 \n", "Item Price 945 0 0 945 \n", "Transaction ID 945 0 0 0 \n", "\n", " types/boolean types/string types/object \\\n", "column \n", "Gender 0 945 0 \n", "Total Amount 0 0 0 \n", "Customer ID 0 945 0 \n", "Item Price 0 0 0 \n", "Transaction ID 0 945 0 \n", "\n", " frequent_items/frequent_strings \\\n", "column \n", "Gender [FrequentItem(value='M', est=489, upper=489, l... \n", "Total Amount NaN \n", "Customer ID [FrequentItem(value='C273096', est=3, upper=2,... \n", "Item Price NaN \n", "Transaction ID [FrequentItem(value='T79960195196', est=3, upp... \n", "\n", " cardinality/est cardinality/upper_1 ... distribution/q_05 \\\n", "column ... \n", "Gender 2.000000 2.000100 ... NaN \n", "Total Amount 849.185065 860.300432 ... -233.376 \n", "Customer ID 858.998625 873.131713 ... NaN \n", "Item Price 701.002487 710.178225 ... 15.000 \n", "Transaction ID 942.466233 957.972612 ... NaN \n", "\n", " distribution/q_10 distribution/q_25 distribution/median \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 14.365 78.676 178.126 \n", "Customer ID NaN NaN NaN \n", "Item Price 22.700 45.200 81.200 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_75 distribution/q_90 distribution/q_95 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 357.0255 580.346 657.475 \n", "Customer ID NaN NaN NaN \n", "Item Price 116.7000 138.200 145.600 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_99 ints/max ints/min \n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 804.44 NaN NaN \n", "Customer ID NaN NaN NaN \n", "Item Price 149.00 NaN NaN \n", "Transaction ID NaN NaN NaN \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_profile_view = profile_view1.merge(profile_view2).merge(profile_view3)\n", "\n", "merged_profile_view.to_pandas().head()" ] }, { "cell_type": "markdown", "metadata": { "id": "8XRzAjSSy4U2" }, "source": [ "As expected, we see 945 rows. Unlike the **track** method, the merge method doesn't update the original objects directly. In other words, we can still inspect the individual profiles views from our subsets of data. \n", "\n", "Keep in mind that the **track** method only works on *DatasetProfile* objects, while the **merge** method only operates on *DatasetProfileView* objects. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 383 }, "id": "RghlXHuozMpY", "outputId": "c787e26d-b46d-43ec-9181-5dd271357208" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectcardinality/estcardinality/upper_1cardinality/lower_1...distribution/q_05distribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90distribution/q_95distribution/q_99ints/maxints/min
column
Gender100000010002.0000002.0001002.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Total Amount1000010000099.00002499.00496799.0...-153.8168.61966.521216.359321.555580.788642.5575795.6NaNNaN
Customer ID1000000100098.00002498.00491798.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Item Price1000010000097.00002397.00486697.0...10.00025.70040.70076.800111.100135.200139.8000148.9NaNNaN
Transaction ID1000000100099.00002499.00496799.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 28 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " counts/n counts/null types/integral types/fractional \\\n", "column \n", "Gender 100 0 0 0 \n", "Total Amount 100 0 0 100 \n", "Customer ID 100 0 0 0 \n", "Item Price 100 0 0 100 \n", "Transaction ID 100 0 0 0 \n", "\n", " types/boolean types/string types/object cardinality/est \\\n", "column \n", "Gender 0 100 0 2.000000 \n", "Total Amount 0 0 0 99.000024 \n", "Customer ID 0 100 0 98.000024 \n", "Item Price 0 0 0 97.000023 \n", "Transaction ID 0 100 0 99.000024 \n", "\n", " cardinality/upper_1 cardinality/lower_1 ... \\\n", "column ... \n", "Gender 2.000100 2.0 ... \n", "Total Amount 99.004967 99.0 ... \n", "Customer ID 98.004917 98.0 ... \n", "Item Price 97.004866 97.0 ... \n", "Transaction ID 99.004967 99.0 ... \n", "\n", " distribution/q_05 distribution/q_10 distribution/q_25 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount -153.816 8.619 66.521 \n", "Customer ID NaN NaN NaN \n", "Item Price 10.000 25.700 40.700 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/median distribution/q_75 distribution/q_90 \\\n", "column \n", "Gender NaN NaN NaN \n", "Total Amount 216.359 321.555 580.788 \n", "Customer ID NaN NaN NaN \n", "Item Price 76.800 111.100 135.200 \n", "Transaction ID NaN NaN NaN \n", "\n", " distribution/q_95 distribution/q_99 ints/max ints/min \n", "column \n", "Gender NaN NaN NaN NaN \n", "Total Amount 642.5575 795.6 NaN NaN \n", "Customer ID NaN NaN NaN NaN \n", "Item Price 139.8000 148.9 NaN NaN \n", "Transaction ID NaN NaN NaN NaN \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profile_view1.to_pandas().head()" ] }, { "cell_type": "markdown", "metadata": { "id": "1rXsivqCzv4E" }, "source": [ "## Mergeability in WhyLabs" ] }, { "cell_type": "markdown", "metadata": { "id": "3P5Giv-YzzeK" }, "source": [ "In WhyLabs, profile merging is done automatically. If you have a WhyLabs dataset with a daily batch frequency of 1 day, then any profiles uploaded during that day will automatically merged for a day-level view of your data. " ] }, { "cell_type": "markdown", "metadata": { "id": "rNlxz2TLh8wv" }, "source": [ "## What's Next?" ] }, { "cell_type": "markdown", "metadata": { "id": "hbUXVXyYh8wv" }, "source": [ "There's a lot you can do with the profiles you just created. You can take a look at our other examples at https://whylogs.readthedocs.io/en/latest/examples !" ] } ], "metadata": { "colab": { "name": "Getting_Started.ipynb", "provenance": [] }, "interpreter": { "hash": "f76ec28949fecf16b926a3fc5a03c1aa6468ee82fa5da4ce6fd607df021af5b5" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 1 }