{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whylogs provides a standard to log any kind of data.\n",
"\n",
"With whylogs, we will show how to log data, generating statistical summaries called *profiles*. These profiles can be used in a number of ways, like:\n",
"\n",
"* Data Visualization\n",
"* Data Validation\n",
"* Tracking changes in your datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we'll explore the basics of logging data with whylogs:\n",
"\n",
"- Installing whylogs\n",
"- Profiling data\n",
"- Interacting with the profile\n",
"- Writing/Reading profiles to/from disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Minimal requirements:\n",
"\n",
"- Python 3.7+ up to Python 3.10\n",
"- Windows, Linux x86_64, and MacOS 10+"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading a Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"data = {\n",
" \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
" \"legs\": [4, 2, 0, 4],\n",
" \"weight\": [4.3, 1.8, 1.3, 4.1],\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling with whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import whylogs as why\n",
"\n",
"results = why.log(df)\n",
"profile = results.profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyzing Profiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you're done logging the data, you can generate a `Profile View` and inspect it in a Pandas Dataframe format:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/inf | \n", "counts/n | \n", "counts/nan | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "... | \n", "frequent_items/frequent_strings | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "types/tensor | \n", "ints/max | \n", "ints/min | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
animal | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "NaN | \n", "0.000 | \n", "NaN | \n", "... | \n", "[FrequentItem(value='cat', est=2, upper=2, low... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
legs | \n", "3.0 | \n", "3.0 | \n", "3.00015 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "4.0 | \n", "2.500 | \n", "4.0 | \n", "... | \n", "[FrequentItem(value='4', est=2, upper=2, lower... | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "4.0 | \n", "0.0 | \n", "
weight | \n", "4.0 | \n", "4.0 | \n", "4.00020 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "4.3 | \n", "2.875 | \n", "4.1 | \n", "... | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "NaN | \n", "NaN | \n", "
3 rows × 31 columns
\n", "