{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "013a0fd4-31f9-4f3f-bf3a-1efc9640422f", "metadata": { "id": "013a0fd4-31f9-4f3f-bf3a-1efc9640422f" }, "source": [ "\n", "![alt text](https://whylabs-public.s3.us-west-2.amazonaws.com/assets/whylabs-logo-night-blue.svg)\n", "\n", "*Run AI with Certainty*\n", "\n", "# **Getting Started with WhyLabs** " ] }, { "attachments": {}, "cell_type": "markdown", "id": "dFKGE4P7N06M", "metadata": { "id": "dFKGE4P7N06M" }, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/writers/Getting_Started_with_WhyLabsV1.ipynb)\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7iwPGGTl1Uuz", "metadata": { "id": "7iwPGGTl1Uuz" }, "source": [ "### ๐Ÿšฉ **Step 1: Create a WhyLabs account** \n", "In order to use this example notebook, you'll first need to head to [WhyLabs](https://www.whylabs.ai/free) and signup for a free account.\n", "\n", "**You can skip the onboarding code example if you are using this noteboook**\n", "\n", "As part of the onboarding workflow, you will receive an **organization ID** for your account. This is the identifier for your account.\n", "\n", "You'll also need to create an access token as part of the onboarding flow.\n", "\n", "#### ๐Ÿ”‘ *If you already have a WhyLabs account* \n", "Please go to *Settings* -> *Access Tokens* to generate tokens.\n", "\n", "\n", "\n", "---\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "l8o9dKSU1X6H", "metadata": { "id": "l8o9dKSU1X6H" }, "source": [ "### ๐Ÿ›  **Step 2: Install whylogs and import dependencies** \n", "To begin, uncomment the cell below and install the **[whylogs](https://github.com/whylabs/whylogs)** library.\n", "\n", "[![License](http://img.shields.io/:license-Apache%202-blue.svg)](https://github.com/whylabs/whylogs-python/blob/mainline/LICENSE)\n", "[![PyPI version](https://badge.fury.io/py/whylogs.svg)](https://badge.fury.io/py/whylogs)\n", "[![Coverage Status](https://coveralls.io/repos/github/whylabs/whylogs/badge.svg?branch=mainline)](https://coveralls.io/github/whylabs/whylogs?branch=mainline)\n", "[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n", "[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/4490/badge)](https://bestpractices.coreinfrastructure.org/projects/4490)\n", "[![PyPi Downloads](https://pepy.tech/badge/whylogs)](https://pepy.tech/project/whylogs)\n", "![CI](https://github.com/whylabs/whylogs-python/workflows/whylogs%20CI/badge.svg)\n", "[![Maintainability](https://api.codeclimate.com/v1/badges/442f6ca3dca1e583a488/maintainability)](https://codeclimate.com/github/whylabs/whylogs-python/maintainability)\n", "\n", "โœ… The `whylogs` library profiles data in real time, collecting thousands of metrics from structured data, unstructured data, and ML model predictions with zero configuration.\n", "\n", "\n", "โœ… This library runs locally on your machine and collects relevant metrics in dataset profiles that can both be logged to disk and uploaded to the WhyLabs Platform for monitoring." ] }, { "cell_type": "code", "execution_count": null, "id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b", "jupyter": { "outputs_hidden": true }, "outputId": "cbf178c0-9028-4002-ae01-568502d30b17", "tags": [] }, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "### The following WhyLabs Platform integration example requires the latest whylogs version: \n", "%pip install whylogs" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a244145c-ea35-4ab6-b03e-cf5f864ed94c", "metadata": { "id": "a244145c-ea35-4ab6-b03e-cf5f864ed94c" }, "source": [ "### ๐Ÿ“ **Step 3: Load example data batches**\n", "\n", "The example data is prepared from our public S3 bucket. Here in the example we have prepared a few examples CSVs for the example." ] }, { "cell_type": "code", "execution_count": 1, "id": "b78028ea-c7cb-494f-a303-071f1c345dfc", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b78028ea-c7cb-494f-a303-071f1c345dfc", "outputId": "6acdedee-c4fd-4377-b525-beeaec390a2e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_1.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_2.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_3.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_4.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_5.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_6.csv\n", "Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_7.csv\n" ] } ], "source": [ "import pandas as pd\n", "\n", "pdfs = []\n", "for i in range(1, 8):\n", " path = f\"https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_{i}.csv\"\n", " print(f\"Loading data from {path}\")\n", " df = pd.read_csv(path)\n", " pdfs.append(df)" ] }, { "cell_type": "code", "execution_count": 2, "id": "67b81ab4-a456-4d2d-9547-ad0d772e0aaa", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 394 }, "id": "67b81ab4-a456-4d2d-9547-ad0d772e0aaa", "outputId": "6c2268d7-43fe-4d1d-a4dc-738db5c747cd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0idmember_idloan_amntfunded_amntfunded_amnt_invint_rateinstallmentannual_incdesc...hardship_loan_statusorig_projected_additional_accrued_interesthardship_payoff_balance_amounthardship_last_payment_amountdebt_settlement_flag_datesettlement_statussettlement_datesettlement_amountsettlement_percentagesettlement_term
count407.0000004.070000e+020.0407.000000407.000000407.000000407.000000407.000000407.0000000.0...0.00.00.00.00.00.00.00.00.00.0
mean12548.7174451.158631e+08NaN14203.74692914203.74692914202.94840313.514054418.02034478818.956069NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
std125.3547721.207642e+06NaN9351.1423749351.1423749350.9978745.446881271.09653155864.939403NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
min12325.0000001.121538e+08NaN1000.0000001000.0000001000.0000005.32000034.2200000.000000NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
25%12442.5000001.150769e+08NaN7000.0000007000.0000007000.0000009.930000235.58000043325.000000NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
50%12550.0000001.157004e+08NaN12000.00000012000.00000012000.00000012.620000357.25000063300.000000NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
75%12653.5000001.168245e+08NaN20000.00000020000.00000020000.00000016.020000553.51500095000.000000NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
max12862.0000001.181592e+08NaN40000.00000040000.00000040000.00000030.9900001417.710000495000.000000NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

8 rows ร— 126 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 id member_id loan_amnt funded_amnt \\\n", "count 407.000000 4.070000e+02 0.0 407.000000 407.000000 \n", "mean 12548.717445 1.158631e+08 NaN 14203.746929 14203.746929 \n", "std 125.354772 1.207642e+06 NaN 9351.142374 9351.142374 \n", "min 12325.000000 1.121538e+08 NaN 1000.000000 1000.000000 \n", "25% 12442.500000 1.150769e+08 NaN 7000.000000 7000.000000 \n", "50% 12550.000000 1.157004e+08 NaN 12000.000000 12000.000000 \n", "75% 12653.500000 1.168245e+08 NaN 20000.000000 20000.000000 \n", "max 12862.000000 1.181592e+08 NaN 40000.000000 40000.000000 \n", "\n", " funded_amnt_inv int_rate installment annual_inc desc ... \\\n", "count 407.000000 407.000000 407.000000 407.000000 0.0 ... \n", "mean 14202.948403 13.514054 418.020344 78818.956069 NaN ... \n", "std 9350.997874 5.446881 271.096531 55864.939403 NaN ... \n", "min 1000.000000 5.320000 34.220000 0.000000 NaN ... \n", "25% 7000.000000 9.930000 235.580000 43325.000000 NaN ... \n", "50% 12000.000000 12.620000 357.250000 63300.000000 NaN ... \n", "75% 20000.000000 16.020000 553.515000 95000.000000 NaN ... \n", "max 40000.000000 30.990000 1417.710000 495000.000000 NaN ... \n", "\n", " hardship_loan_status orig_projected_additional_accrued_interest \\\n", "count 0.0 0.0 \n", "mean NaN NaN \n", "std NaN NaN \n", "min NaN NaN \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max NaN NaN \n", "\n", " hardship_payoff_balance_amount hardship_last_payment_amount \\\n", "count 0.0 0.0 \n", "mean NaN NaN \n", "std NaN NaN \n", "min NaN NaN \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max NaN NaN \n", "\n", " debt_settlement_flag_date settlement_status settlement_date \\\n", "count 0.0 0.0 0.0 \n", "mean NaN NaN NaN \n", "std NaN NaN NaN \n", "min NaN NaN NaN \n", "25% NaN NaN NaN \n", "50% NaN NaN NaN \n", "75% NaN NaN NaN \n", "max NaN NaN NaN \n", "\n", " settlement_amount settlement_percentage settlement_term \n", "count 0.0 0.0 0.0 \n", "mean NaN NaN NaN \n", "std NaN NaN NaN \n", "min NaN NaN NaN \n", "25% NaN NaN NaN \n", "50% NaN NaN NaN \n", "75% NaN NaN NaN \n", "max NaN NaN NaN \n", "\n", "[8 rows x 126 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdfs[0].describe()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "834d6471-8490-48ea-bb52-6be31662dc97", "metadata": { "id": "834d6471-8490-48ea-bb52-6be31662dc97" }, "source": [ "### โš™๏ธ **Step 4: Configure whylogs** \n", "\n", "`whylogs`, by default, does not send statistics to WhyLabs.\n", "\n", "There are a few small steps you need to set up. If you haven't got the access key, please onboard with WhyLabs and generate an API key https://hub.whylabsapp.com/settings/access-tokens.\n", "\n", "**WhyLabs only requires whylogs profiles - your raw data never leaves your machine.**" ] }, { "cell_type": "code", "execution_count": 3, "id": "31371bc6-4ec8-4518-84a0-4718b19d1506", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "31371bc6-4ec8-4518-84a0-4718b19d1506", "outputId": "ddb7451b-5354-4318-e2eb-294ab4d4a7e2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ“ What kind of session do you want to use?\n", " โคท 1. WhyLabs. Use an api key to upload to WhyLabs.\n", " โคท 2. WhyLabs Anonymous. Upload data anonymously to WhyLabs and get a viewing url.\n", "\n", "Initializing session with config /home/jamie/.config/whylogs/config.ini\n", "\n", "โœ… Using session type: WHYLABS_ANONYMOUS\n", " โคท session id: \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import whylogs as why\n", "\n", "# Create a model in the dashboard and use that model id as the default dataset id in the prompt here. It will be\n", "# saved in your whylogs conifg for future use. You can optionally supply reinit=True to reset your conifg. \n", "why.init()" ] }, { "cell_type": "markdown", "id": "49d0e0af", "metadata": {}, "source": [ "You can run this init from the command line as well with.\n", "\n", "```bash\n", "python -m whylogs.api.whylabs.session.why_init\n", "```\n", "\n", "You can use this to reset your config if you want to change your api key or default dataset it." ] }, { "attachments": {}, "cell_type": "markdown", "id": "8d7bbb9c-9b18-4e12-bab2-88a5176eeba7", "metadata": { "id": "8d7bbb9c-9b18-4e12-bab2-88a5176eeba7" }, "source": [ "### ๐Ÿ“ฌ **Step 5: Logging to WhyLabs** \n", "\n", "Ensure you have a **model ID** (also called **dataset ID**) before you start!\n", "\n", "#### Dataset Timestamp\n", "* To avoid confusion, it's recommended that you use **[aware datetime](https://docs.python.org/3/library/datetime.html#:~:text=For%20applications%20requiring,is%20in%20effect.)** with `whylogs`\n", "* If you don't set `dataset_timestamp` parameter, it'll default to `UTC` now\n", "* WhyLabs supports real time visualization when the timestamp is **within the last 7 days**. Anything older than than will be picked up when we run our batch processing\n", "* **If you log two profiles for the same day with different timestamps (12:00AM vs 12:01AM), they are merged to the same batch**\n", "\n", "#### Logging Different Batches of Data\n", "* We'll give the profiles different **dates**\n", "* Create a new logger for each date. Note that the logger needs to be closed to flush out the data (automatically with the context manager in the example" ] }, { "cell_type": "code", "execution_count": 4, "id": "3a006a2b-8403-477f-a1f5-a37ea33b160c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3a006a2b-8403-477f-a1f5-a37ea33b160c", "outputId": "7cbbb771-2936-4fbb-b927-d23a485e29fd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "โœ… Aggregated 407 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1725321600000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 390 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1725235200000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 382 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1725148800000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 371 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1725062400000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 301 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1724976000000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 392 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1724889600000&sessionToken=session-GKTK6PAd\n", "\n", "โœ… Aggregated 283 rows into profile \n", "\n", "Visualize and explore this profile with one-click\n", "๐Ÿ” https://hub.whylabsapp.com/resources/model-1/profiles?profile=1724803200000&sessionToken=session-GKTK6PAd\n" ] } ], "source": [ "import datetime\n", "\n", "import whylogs as why\n", "\n", "for i, df in enumerate(pdfs):\n", " # walking backwards. Each dataset has to map to a date to show up as a different batch\n", " # in WhyLabs\n", " dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)\n", "\n", " # log each day's data and set the date on the profile\n", " results = why.log(df, dataset_timestamp=dt)" ] }, { "cell_type": "code", "execution_count": 5, "id": "ScXRPmxJVjju", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "ScXRPmxJVjju", "outputId": "5d9e7f45-5c0b-4bd6-d369-3f7bb94c0f65" }, "outputs": [ { "data": { "text/html": [ "To view your statistics, go to the model dashboard" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.core.display import HTML\n", "\n", "from whylogs.api.whylabs.session.session_manager import get_current_session\n", "\n", "session = get_current_session()\n", "model_id = session.config.get_default_dataset_id()\n", "\n", "HTML(f'To view your statistics, go to the model dashboard')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b2c81d63-f420-4a36-8960-ef093b2f895f", "metadata": { "id": "b2c81d63-f420-4a36-8960-ef093b2f895f" }, "source": [ "### ๐Ÿ“ˆ **Step 6: Inspect statistics in WhyLabs** \n", "\n", "WhyLabs stores the follow statistics, from what is configured in `whylogs`\n", "\n", "* Simple counters: boolean, null values, data types.\n", "* Summary statistics: sum, min, max, median, variance.\n", "* Unique value counter or cardinality: tracks an approximate unique value of your feature using HyperLogLog algorithm.\n", "* Histograms for numerical features. whyLogs binary output can be queried to with dynamic binning based on the shape of your data.\n", "* Top frequent items (default is 128). Note that this configuration affects the memory footprint, especially for text features.\n", "\n", "Notice that these statistics are organized in batches. So if you run the above cells again, you'll see the statistics changed. \n", "\n", "* Now check the application to see if your **statistics** \n", "* Also, run the above cell again for the same model ID, do you see the statistics changes in WhyLabs? Especially the counters?" ] }, { "attachments": {}, "cell_type": "markdown", "id": "HO8KS8ep1Npe", "metadata": { "id": "HO8KS8ep1Npe" }, "source": [ "### ๐Ÿ“ **Step 7: Run WhyLabs with your data** \n", "\n", "To go further, visit our [documentation](https://docs.whylabs.ai/) for more detailed of everything that you can do to start monitoring your ML and data pipelines.\n", "\n", "You can also join our [Community Slack Channel](http://join.slack.whylabs.ai/) for questions related to `whylogs` or [cut us a ticket](https://support.whylabs.ai/) if you encounter issues with Whylabs onboarding.\n" ] } ], "metadata": { "colab": { "collapsed_sections": [], "provenance": [] }, "kernelspec": { "display_name": "Python 3.8.10 ('.venv': poetry)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "d39f874c9b8a97550ecbd783714b95e79c9b905449b34f44c40e3bf053b54b41" } } }, "nbformat": 4, "nbformat_minor": 5 }