{
"cells": [
{
"cell_type": "markdown",
"id": "bbfe7789",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"id": "wgBeKz4TmtP7",
"metadata": {
"id": "wgBeKz4TmtP7"
},
"source": [
"# Inspecting Profiles\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Inspecting_Profiles.ipynb)\n",
"\n",
"In this notebook, we'll show how you can use whylog's Profile Viewer (`profile.view()`) to find useful statistics in a dataset. \n",
"\n",
"This includes:\n",
"\n",
"- Counters, such as number of samples and null values\n",
"- Inferred types, such as integral, fractional, boolean, and strings\n",
"- Estimated cardinality\n",
"- Frequent items\n",
"- Distribution metrics: min, max, mean, median, standard deviation, and quantile values\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "eShCq4LYGae9",
"metadata": {
"id": "eShCq4LYGae9"
},
"source": [
"## Setup\n",
"We'll need the `whylogs` and `pandas` libraries for this example.\n",
"\n",
"We'll also populate a dataframe with some data to inspect.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b",
"jupyter": {
"outputs_hidden": true
},
"outputId": "36cb94da-cb73-43d6-b26f-5e2360fe71f0",
"tags": []
},
"outputs": [],
"source": [
"# install whylogs & pandas if needed\n",
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs\n",
"%pip install pandas "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8369d3a8-9bf2-4043-a45a-13838498f211",
"metadata": {
"id": "8369d3a8-9bf2-4043-a45a-13838498f211"
},
"outputs": [],
"source": [
"# import whylogs and pandas\n",
"import whylogs as why\n",
"import pandas as pd\n",
"\n",
"# Set to show all columns in dataframe\n",
"pd.set_option(\"display.max_columns\", None)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "WdF4F9FugqHq",
"metadata": {
"id": "WdF4F9FugqHq"
},
"outputs": [],
"source": [
"# create a simple test dataset\n",
"data = {\n",
" \"animal\": [\"lion\", \"shark\", \"cat\", \"bear\", \"jellyfish\", \"kangaroo\",\n",
" \"jellyfish\", \"jellyfish\", \"fish\"],\n",
" \"legs\": [4, 0, 4, 4.0, None, 2, None, None, \"fins\"],\n",
" \"weight\": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],\n",
"}\n",
"\n",
"# Create dataframe with test dataset\n",
"df = pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"id": "0nzsw8mHdzO6",
"metadata": {
"id": "0nzsw8mHdzO6"
},
"source": [
"## Log data with whylogs, create a profile, and view statistics:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "OHDz8SmCgqE6",
"metadata": {
"id": "OHDz8SmCgqE6"
},
"outputs": [],
"source": [
"# Log data with whylogs & create profile\n",
"results = why.log(pandas=df)\n",
"profile = results.profile()\n",
"\n",
"# Create profile view dataframe\n",
"prof_view = profile.view()\n",
"prof_df = prof_view.to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e6CXme06hook",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 274
},
"id": "e6CXme06hook",
"outputId": "a5a61521-a39e-4daa-f386-bdda1252bf59"
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | counts/n | \n", "counts/null | \n", "types/integral | \n", "types/fractional | \n", "types/boolean | \n", "types/string | \n", "types/object | \n", "cardinality/est | \n", "cardinality/upper_1 | \n", "cardinality/lower_1 | \n", "frequent_items/frequent_strings | \n", "type | \n", "distribution/mean | \n", "distribution/stddev | \n", "distribution/n | \n", "distribution/max | \n", "distribution/min | \n", "distribution/q_10 | \n", "distribution/q_25 | \n", "distribution/median | \n", "distribution/q_75 | \n", "distribution/q_90 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
legs | \n", "9 | \n", "3 | \n", "4 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "4.0 | \n", "4.00020 | \n", "4.0 | \n", "[FrequentItem(value='4.000000', est=3, upper=3... | \n", "SummaryType.COLUMN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
weight | \n", "9 | \n", "0 | \n", "0 | \n", "9 | \n", "0 | \n", "0 | \n", "0 | \n", "9.0 | \n", "9.00045 | \n", "9.0 | \n", "NaN | \n", "SummaryType.COLUMN | \n", "20.955556 | \n", "38.29749 | \n", "9.0 | \n", "120.0 | \n", "1.2 | \n", "1.2 | \n", "2.2 | \n", "4.3 | \n", "14.3 | \n", "120.0 | \n", "
animal | \n", "9 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "9 | \n", "0 | \n", "7.0 | \n", "7.00035 | \n", "7.0 | \n", "[FrequentItem(value='jellyfish', est=3, upper=... | \n", "SummaryType.COLUMN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "