{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "whylogs provides a standard to log any kind of data.\n", "\n", "With whylogs, we will show how to log data, generating statistical summaries called *profiles*. These profiles can be used in a number of ways, like:\n", "\n", "* Data Visualization\n", "* Data Validation\n", "* Tracking changes in your datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we'll explore the basics of logging data with whylogs:\n", "\n", "- Installing whylogs\n", "- Profiling data\n", "- Interacting with the profile\n", "- Writing/Reading profiles to/from disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing whylogs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Minimal requirements:\n", "\n", "- Python 3.7+ up to Python 3.10\n", "- Windows, Linux x86_64, and MacOS 10+" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading a Pandas DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "data = {\n", " \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n", " \"legs\": [4, 2, 0, 4],\n", " \"weight\": [4.3, 1.8, 1.3, 4.1],\n", "}\n", "\n", "df = pd.DataFrame(data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Profiling with whylogs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "\n", "results = why.log(df)\n", "profile = results.profile()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing Profiles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you're done logging the data, you can generate a `Profile View` and inspect it in a Pandas Dataframe format:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cardinality/estcardinality/lower_1cardinality/upper_1counts/infcounts/ncounts/nancounts/nulldistribution/maxdistribution/meandistribution/median...frequent_items/frequent_stringstypetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/stringtypes/tensorints/maxints/min
column
animal3.03.03.000150400NaN0.000NaN...[FrequentItem(value='cat', est=2, upper=2, low...SummaryType.COLUMN000040NaNNaN
legs3.03.03.0001504004.02.5004.0...[FrequentItem(value='4', est=2, upper=2, lower...SummaryType.COLUMN0040004.00.0
weight4.04.04.0002004004.32.8754.1...NaNSummaryType.COLUMN040000NaNNaN
\n", "

3 rows × 31 columns

\n", "
" ], "text/plain": [ " cardinality/est cardinality/lower_1 cardinality/upper_1 counts/inf \\\n", "column \n", "animal 3.0 3.0 3.00015 0 \n", "legs 3.0 3.0 3.00015 0 \n", "weight 4.0 4.0 4.00020 0 \n", "\n", " counts/n counts/nan counts/null distribution/max \\\n", "column \n", "animal 4 0 0 NaN \n", "legs 4 0 0 4.0 \n", "weight 4 0 0 4.3 \n", "\n", " distribution/mean distribution/median ... \\\n", "column ... \n", "animal 0.000 NaN ... \n", "legs 2.500 4.0 ... \n", "weight 2.875 4.1 ... \n", "\n", " frequent_items/frequent_strings type \\\n", "column \n", "animal [FrequentItem(value='cat', est=2, upper=2, low... SummaryType.COLUMN \n", "legs [FrequentItem(value='4', est=2, upper=2, lower... SummaryType.COLUMN \n", "weight NaN SummaryType.COLUMN \n", "\n", " types/boolean types/fractional types/integral types/object \\\n", "column \n", "animal 0 0 0 0 \n", "legs 0 0 4 0 \n", "weight 0 4 0 0 \n", "\n", " types/string types/tensor ints/max ints/min \n", "column \n", "animal 4 0 NaN NaN \n", "legs 0 0 4.0 0.0 \n", "weight 0 0 NaN NaN \n", "\n", "[3 rows x 31 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prof_view = profile.view()\n", "prof_df = prof_view.to_pandas()\n", "\n", "prof_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will provide you with valuable statistics on a column (feature) basis, such as:\n", "\n", "- Counters, such as number of samples and null values\n", "- Inferred types, such as integral, fractional and boolean\n", "- Estimated Cardinality\n", "- Frequent Items\n", "- Distribution Metrics: min,max, median, quantile values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing to Disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also store your profile in disk for further inspection:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "why.write(profile, \"profile.bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will create a profile binary file in your local filesystem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading from Disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can read the profile back into memory with:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "n_prof = why.read(\"profile.bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: `write` expects a profile as parameter, while `read` returns a `Profile View`. That means that you can use the loaded profile for visualization purposes and merging, but not for further tracking and updates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's Next?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a lot you can do with the profiles you just created. Keep getting your hands dirty with the following examples!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "- Basic\n", " - [Visualizing Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Notebook_Profile_Visualizer.html) - Compare profiles to detect distribution shifts, visualize histograms and bar charts and explore your data\n", " - [Logging Data](https://whylogs.readthedocs.io/en/stable/examples/basic/Logging_Different_Data.html) - See the different ways you can log your data with whylogs\n", " - [Inspecting Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Inspecting_Profiles.html) - A deeper dive on the metrics generated by whylogs\n", " - [Schema Configuration for Tracking Metrics](https://whylogs.readthedocs.io/en/stable/examples/basic/Schema_Configuration.html) - Configure tracking metrics according to data type or column features\n", " - [Data Constraints](https://whylogs.readthedocs.io/en/stable/examples/advanced/Metric_Constraints.html) - Set constraints to your data to ensure its quality\n", " - [Merging Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Merging_Profiles.html) - Merge your profiles logged across different computing instances, time periods or data segments\n", "- Integrations\n", " - [WhyLabs](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) - Monitor your profiles continuously with the WhyLabs Observability Platform\n", " - [Pyspark](https://whylogs.readthedocs.io/en/stable/examples/integrations/Pyspark_Profiling.html) - Use whylogs with pyspark\n", " - [Writing Profiles](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_Profiles.html) - See different ways and locations to output your profiles\n", " - [Flask](https://whylogs.readthedocs.io/en/stable/examples/integrations/flask_streaming/flask_with_whylogs.html) - See how you can create a Flask app with whylogs and WhyLabs integration\n", " - [Feature Stores](https://whylogs.readthedocs.io/en/stable/examples/integrations/Feature_Stores_and_whylogs.html) - Learn how to log features from your Feature Store with feast and whylogs\n", " - [BigQuery](https://whylogs.readthedocs.io/en/stable/examples/integrations/BigQuery_Example.html) - Profile data queried from a Google BigQuery table\n", " - [MLflow](https://whylogs.readthedocs.io/en/stable/examples/integrations/Mlflow_Logging.html) - Log your whylogs profiles to an MLflow environment\n", "\n", "Or go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea" } } }, "nbformat": 4, "nbformat_minor": 2 }