{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we will learn about the whylogs Python library and its output. \n",
"\n",
"# Getting Started with whylogs Profile Summaries\n",
"\n",
"We will first read sample raw data into Pandas from a file and explore that data briefly. To run whylogs, we will then import the whylogs library, initialize a logging session with whylogs, and create a profile for our data, producing a whylogs profile summary. Finally, we will explore some of the profile summary features.\n",
"\n",
"To get started, we will import a few standard data science Python libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: boto3 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.17.29)\n",
"Requirement already satisfied: certifi in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 2)) (2020.12.5)\n",
"Requirement already satisfied: chardet in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (4.0.0)\n",
"Requirement already satisfied: matplotlib in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 4)) (3.3.4)\n",
"Requirement already satisfied: numpy in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 5)) (1.20.1)\n",
"Requirement already satisfied: whylogs in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from -r requirements.txt (line 6)) (0.3.2)\n",
"Requirement already satisfied: botocore<1.21.0,>=1.20.29 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from boto3->-r requirements.txt (line 1)) (1.20.29)\n",
"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from boto3->-r requirements.txt (line 1)) (0.10.0)\n",
"Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from boto3->-r requirements.txt (line 1)) (0.3.4)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from botocore<1.21.0,>=1.20.29->boto3->-r requirements.txt (line 1)) (2.8.1)\n",
"Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from botocore<1.21.0,>=1.20.29->boto3->-r requirements.txt (line 1)) (1.26.4)\n",
"Requirement already satisfied: six>=1.5 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.21.0,>=1.20.29->boto3->-r requirements.txt (line 1)) (1.15.0)\n",
"Requirement already satisfied: pillow>=6.2.0 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from matplotlib->-r requirements.txt (line 4)) (8.1.2)\n",
"Requirement already satisfied: cycler>=0.10 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from matplotlib->-r requirements.txt (line 4)) (0.10.0)\n",
"Requirement already satisfied: kiwisolver>=1.0.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from matplotlib->-r requirements.txt (line 4)) (1.3.1)\n",
"Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from matplotlib->-r requirements.txt (line 4)) (2.4.7)\n",
"Requirement already satisfied: whylabs-datasketches>=2.2.0b1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (2.2.0b1)\n",
"Requirement already satisfied: smart-open==4.1.2 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (4.1.2)\n",
"Requirement already satisfied: click>=7.1.2 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (7.1.2)\n",
"Requirement already satisfied: scikit-learn==0.24.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (0.24.1)\n",
"Requirement already satisfied: tqdm==4.54.0 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (4.54.0)\n",
"Requirement already satisfied: protobuf>=3.12.2 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (3.15.6)\n",
"Requirement already satisfied: pyyaml>=5.3.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (5.4.1)\n",
"Requirement already satisfied: xlrd==2.0.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (2.0.1)\n",
"Requirement already satisfied: openpyxl==3.0.6 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (3.0.6)\n",
"Requirement already satisfied: puremagic==1.10 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (1.10)\n",
"Requirement already satisfied: pandas>1.0 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (1.2.3)\n",
"Requirement already satisfied: marshmallow>=3.7.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from whylogs->-r requirements.txt (line 6)) (3.10.0)\n",
"Requirement already satisfied: et-xmlfile in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from openpyxl==3.0.6->whylogs->-r requirements.txt (line 6)) (1.0.1)\n",
"Requirement already satisfied: jdcal in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from openpyxl==3.0.6->whylogs->-r requirements.txt (line 6)) (1.4.1)\n",
"Collecting argparse\n",
" Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from scikit-learn==0.24.1->whylogs->-r requirements.txt (line 6)) (2.1.0)\n",
"Requirement already satisfied: scipy>=0.19.1 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from scikit-learn==0.24.1->whylogs->-r requirements.txt (line 6)) (1.6.1)\n",
"Requirement already satisfied: joblib>=0.11 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from scikit-learn==0.24.1->whylogs->-r requirements.txt (line 6)) (1.0.1)\n",
"Requirement already satisfied: pytz>=2017.3 in /Users/andy/miniconda3/envs/demo/lib/python3.8/site-packages (from pandas>1.0->whylogs->-r requirements.txt (line 6)) (2021.1)\n",
"Installing collected packages: argparse\n",
"Successfully installed argparse-1.4.0\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"pip install -r requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"import os.path\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whylogs allows you to generate and store key characteristics of a growing dataset efficiently. In machine learning, datasets often consist of both input features and outputs of the model. In deployed systems, you often have a relatively static training dataset as well as a growing dataset from model input and output at inference time.\n",
"\n",
"## Downloading and exploring the raw Lending Club data\n",
"\n",
"In our case, we will download and explore a sample from the Lending Club dataset before logging a whylogs profile summary. Lending Club is a peer-to-peer lending and alternative investing website on which members can apply for personal loans and invest in personal loans to other Lending Club members. The company published a dataset with information spanning several years. This particular dataset contains only the accepted loans."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our sample input data is stored in `lending_club_demo.csv`. You may use the Jupyter command `!` in front of cell contents to execute a Bash command (e.g. `cd`) to navigate if necessary."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"data_file = \"lending_club_demo.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's read that data file into a Pandas dataframe and look at the entries for *January 2017*.\n",
"\n",
"Each row refers to a particular loan instance, while each column refers to a variable in our dataset."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
id
\n",
"
member_id
\n",
"
loan_amnt
\n",
"
funded_amnt
\n",
"
funded_amnt_inv
\n",
"
int_rate
\n",
"
installment
\n",
"
annual_inc
\n",
"
dti
\n",
"
delinq_2yrs
\n",
"
...
\n",
"
deferral_term
\n",
"
hardship_amount
\n",
"
hardship_length
\n",
"
hardship_dpd
\n",
"
orig_projected_additional_accrued_interest
\n",
"
hardship_payoff_balance_amount
\n",
"
hardship_last_payment_amount
\n",
"
settlement_amount
\n",
"
settlement_percentage
\n",
"
settlement_term
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
3.090000e+02
\n",
"
0.0
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
309.000000
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
mean
\n",
"
9.637541e+07
\n",
"
NaN
\n",
"
14511.407767
\n",
"
14511.407767
\n",
"
14506.957929
\n",
"
13.479159
\n",
"
446.427476
\n",
"
80151.667184
\n",
"
18.561489
\n",
"
0.372168
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
std
\n",
"
1.648219e+06
\n",
"
NaN
\n",
"
9011.801950
\n",
"
9011.801950
\n",
"
9011.257397
\n",
"
5.168002
\n",
"
280.454947
\n",
"
51337.356187
\n",
"
9.955114
\n",
"
0.929671
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
min
\n",
"
6.895309e+07
\n",
"
NaN
\n",
"
1000.000000
\n",
"
1000.000000
\n",
"
1000.000000
\n",
"
5.320000
\n",
"
32.930000
\n",
"
10000.000000
\n",
"
0.290000
\n",
"
0.000000
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
25%
\n",
"
9.627937e+07
\n",
"
NaN
\n",
"
7500.000000
\n",
"
7500.000000
\n",
"
7500.000000
\n",
"
10.490000
\n",
"
235.260000
\n",
"
49680.000000
\n",
"
12.480000
\n",
"
0.000000
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
50%
\n",
"
9.653771e+07
\n",
"
NaN
\n",
"
12000.000000
\n",
"
12000.000000
\n",
"
12000.000000
\n",
"
12.740000
\n",
"
370.480000
\n",
"
66000.000000
\n",
"
18.100000
\n",
"
0.000000
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
75%
\n",
"
9.681416e+07
\n",
"
NaN
\n",
"
20000.000000
\n",
"
20000.000000
\n",
"
20000.000000
\n",
"
15.990000
\n",
"
582.260000
\n",
"
98000.000000
\n",
"
23.350000
\n",
"
0.000000
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
max
\n",
"
9.752976e+07
\n",
"
NaN
\n",
"
40000.000000
\n",
"
40000.000000
\n",
"
40000.000000
\n",
"
30.940000
\n",
"
1400.690000
\n",
"
400000.000000
\n",
"
109.220000
\n",
"
8.000000
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
8 rows × 114 columns
\n",
"
"
],
"text/plain": [
" id member_id loan_amnt funded_amnt funded_amnt_inv \\\n",
"count 3.090000e+02 0.0 309.000000 309.000000 309.000000 \n",
"mean 9.637541e+07 NaN 14511.407767 14511.407767 14506.957929 \n",
"std 1.648219e+06 NaN 9011.801950 9011.801950 9011.257397 \n",
"min 6.895309e+07 NaN 1000.000000 1000.000000 1000.000000 \n",
"25% 9.627937e+07 NaN 7500.000000 7500.000000 7500.000000 \n",
"50% 9.653771e+07 NaN 12000.000000 12000.000000 12000.000000 \n",
"75% 9.681416e+07 NaN 20000.000000 20000.000000 20000.000000 \n",
"max 9.752976e+07 NaN 40000.000000 40000.000000 40000.000000 \n",
"\n",
" int_rate installment annual_inc dti delinq_2yrs ... \\\n",
"count 309.000000 309.000000 309.000000 309.000000 309.000000 ... \n",
"mean 13.479159 446.427476 80151.667184 18.561489 0.372168 ... \n",
"std 5.168002 280.454947 51337.356187 9.955114 0.929671 ... \n",
"min 5.320000 32.930000 10000.000000 0.290000 0.000000 ... \n",
"25% 10.490000 235.260000 49680.000000 12.480000 0.000000 ... \n",
"50% 12.740000 370.480000 66000.000000 18.100000 0.000000 ... \n",
"75% 15.990000 582.260000 98000.000000 23.350000 0.000000 ... \n",
"max 30.940000 1400.690000 400000.000000 109.220000 8.000000 ... \n",
"\n",
" deferral_term hardship_amount hardship_length hardship_dpd \\\n",
"count 0.0 0.0 0.0 0.0 \n",
"mean NaN NaN NaN NaN \n",
"std NaN NaN NaN NaN \n",
"min NaN NaN NaN NaN \n",
"25% NaN NaN NaN NaN \n",
"50% NaN NaN NaN NaN \n",
"75% NaN NaN NaN NaN \n",
"max NaN NaN NaN NaN \n",
"\n",
" orig_projected_additional_accrued_interest \\\n",
"count 0.0 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"\n",
" hardship_payoff_balance_amount hardship_last_payment_amount \\\n",
"count 0.0 0.0 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN \n",
"\n",
" settlement_amount settlement_percentage settlement_term \n",
"count 0.0 0.0 0.0 \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
"[8 rows x 114 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"full_data = pd.read_csv(os.path.join(data_file))\n",
"data = full_data[full_data['issue_d'] == 'Jan-2017']\n",
"\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interesting Lending Club dataset variables\n",
"\n",
"**`emp_length` (categorical, string)**:\n",
"> length of employment in years as text entries\n",
"\n",
"**`annual_inc` (numeric)**:\n",
"> the self-reported annual income provided by the borrower during registration\n",
"\n",
"**`dti` (numeric)**:\n",
"> ratio calculated using the borrower’s total monthly debt payments over their total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income\n",
"\n",
"**`issue_d` (timestamp, string)**:\n",
"> the month (and year) which the loan was funded -- useful for backfilling data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running whylogs for logging a single dataset\n",
"\n",
"Let's import a function from whylogs that will allow us to create a logging session.\n",
"\n",
"This session can be connected with multiple writers that output the results of our profiling in JSON, a flat CSV, or binary protobuf format. These profiles can be stored locally or in an AWS S3 bucket in the cloud. Additional writing functionality will be added over time.\n",
"\n",
"Let's create a default session below."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from whylogs import get_or_create_session\n",
"\n",
"session = get_or_create_session()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quickly log a dataframe\n",
"\n",
"You can call `log_dataframe` to quickly log a Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"session.log_dataframe(data.head(100), 'demo')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# whylogs output\n",
"\n",
"Now that we've logged our dataset, we can see the output of the whylogs profiling process in the newly created directory. WhyLogs logger creates an `output` directory within our original directory. This directory in turn contains folders with various summaries for our sample dataset called `demo`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Current working directory: /Volumes/Workspace/whylogs-examples/python\n"
]
}
],
"source": [
"print(\"Current working directory:\", os.getcwd())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"whylogs-output/demo/dataset_summary/freq_numbers/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/json/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/flat_table/dataset_summary-batch.csv\n",
"whylogs-output/demo/dataset_summary/histogram/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/frequent_strings/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_profile/protobuf/datase_profile-batch.bin\n",
"whylogs-output/another-dataset/dataset_summary/freq_numbers/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/freq_numbers/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/json/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/json/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/flat_table/dataset_summary-1498867200000.csv\n",
"whylogs-output/another-dataset/dataset_summary/flat_table/dataset_summary-1600732800000.csv\n",
"whylogs-output/another-dataset/dataset_summary/histogram/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/histogram/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/frequent_strings/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/frequent_strings/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_profile/protobuf/datase_profile-1498867200000.bin\n",
"whylogs-output/another-dataset/dataset_profile/protobuf/datase_profile-1600732800000.bin\n"
]
}
],
"source": [
"!find whylogs-output -type f"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the Logger API\n",
"The Logger API can be used to log data profiles to memory as well. This data stays in memory until you call `.close()`, either explicitly or using the `with` statement."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"with session.logger(dataset_name=\"another-dataset\", dataset_timestamp=datetime.datetime(2017, 1, 1, 0, 0)) as logger:\n",
" logger.log_dataframe(data.head(100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, you can see that the dataset has the timestamp added as the suffix."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"whylogs-output/demo/dataset_summary/freq_numbers/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/json/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/flat_table/dataset_summary-batch.csv\n",
"whylogs-output/demo/dataset_summary/histogram/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_summary/frequent_strings/dataset_summary-batch.json\n",
"whylogs-output/demo/dataset_profile/protobuf/datase_profile-batch.bin\n",
"whylogs-output/another-dataset/dataset_summary/freq_numbers/dataset_summary-1483228800000.json\n",
"whylogs-output/another-dataset/dataset_summary/freq_numbers/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/freq_numbers/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/json/dataset_summary-1483228800000.json\n",
"whylogs-output/another-dataset/dataset_summary/json/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/json/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/flat_table/dataset_summary-1483228800000.csv\n",
"whylogs-output/another-dataset/dataset_summary/flat_table/dataset_summary-1498867200000.csv\n",
"whylogs-output/another-dataset/dataset_summary/flat_table/dataset_summary-1600732800000.csv\n",
"whylogs-output/another-dataset/dataset_summary/histogram/dataset_summary-1483228800000.json\n",
"whylogs-output/another-dataset/dataset_summary/histogram/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/histogram/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_summary/frequent_strings/dataset_summary-1483228800000.json\n",
"whylogs-output/another-dataset/dataset_summary/frequent_strings/dataset_summary-1498867200000.json\n",
"whylogs-output/another-dataset/dataset_summary/frequent_strings/dataset_summary-1600732800000.json\n",
"whylogs-output/another-dataset/dataset_profile/protobuf/datase_profile-1498867200000.bin\n",
"whylogs-output/another-dataset/dataset_profile/protobuf/datase_profile-1600732800000.bin\n",
"whylogs-output/another-dataset/dataset_profile/protobuf/datase_profile-1483228800000.bin\n"
]
}
],
"source": [
"!find whylogs-output -type f"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interacting with Dataset Profiles\n",
"\n",
"Instead of interacting with the Logger, which writes to disk, sometimes you may want to use a `DatasetProfile` object directly.\n",
"\n",
"You can use `session.new_profile` to create an empty profile:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"profile = session.new_profile(dataset_name=\"in-memory\", \n",
" dataset_timestamp=datetime.datetime(2017, 1, 1, 0, 0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling a DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"profile.track_dataframe(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This DatasetProfile object, stored in the `profile` variable, can now be referenced from Python.\n",
"\n",
"This object contains helpful information about the profile, such as the session ID, the dates associated with both the data and the session, as well as user-specified metadata and tags."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's transform the dataset profile into the flat summary form. Unlike the binary `protobuf.bin` file and the hierarchical `whylogs.json` file that was written using the logger, the summary format makes it much easier to analyze and run data science processes on the data. This structure is much more flat, a table format or a single depth dictionary format organized by variable.\n",
"\n",
"These less hierarchical formats were also created with the `log_dataframe` functionality and can be found in the `summary_summary.csv`, `summary_histogram.json` and `summary_strings.json` files."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"summaries = profile.flat_summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first take a look at the overall summary for the profiled dataset."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
column
\n",
"
count
\n",
"
null_count
\n",
"
bool_count
\n",
"
numeric_count
\n",
"
max
\n",
"
mean
\n",
"
min
\n",
"
stddev
\n",
"
nunique_numbers
\n",
"
...
\n",
"
nunique_str_upper
\n",
"
quantile_0.0000
\n",
"
quantile_0.0100
\n",
"
quantile_0.0500
\n",
"
quantile_0.2500
\n",
"
quantile_0.5000
\n",
"
quantile_0.7500
\n",
"
quantile_0.9500
\n",
"
quantile_0.9900
\n",
"
quantile_1.0000
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
sec_app_open_act_il
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
bc_open_to_buy
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
305.0
\n",
"
96285.0
\n",
"
11781.862295
\n",
"
0.0
\n",
"
15110.810631
\n",
"
302.0
\n",
"
...
\n",
"
0.0
\n",
"
0.000000
\n",
"
10.000000
\n",
"
155.000000
\n",
"
2004.000000
\n",
"
6784.000000
\n",
"
15545.000000
\n",
"
43811.0
\n",
"
74544.0
\n",
"
96285.0
\n",
"
\n",
"
\n",
"
2
\n",
"
mths_since_rcnt_il
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
304.0
\n",
"
228.0
\n",
"
23.013158
\n",
"
1.0
\n",
"
27.996225
\n",
"
70.0
\n",
"
...
\n",
"
0.0
\n",
"
1.000000
\n",
"
1.000000
\n",
"
3.000000
\n",
"
7.000000
\n",
"
14.000000
\n",
"
27.000000
\n",
"
86.0
\n",
"
130.0
\n",
"
228.0
\n",
"
\n",
"
\n",
"
3
\n",
"
sec_app_collections_12_mths_ex_med
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
4
\n",
"
chargeoff_within_12_mths
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
309.0
\n",
"
1.0
\n",
"
0.003236
\n",
"
0.0
\n",
"
0.056888
\n",
"
2.0
\n",
"
...
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
145
\n",
"
settlement_percentage
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
146
\n",
"
pymnt_plan
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
...
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
147
\n",
"
total_rec_prncp
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
309.0
\n",
"
35000.0
\n",
"
5266.577896
\n",
"
262.7
\n",
"
6502.059928
\n",
"
276.0
\n",
"
...
\n",
"
0.0
\n",
"
262.700012
\n",
"
349.440002
\n",
"
848.909973
\n",
"
1697.630005
\n",
"
2965.600098
\n",
"
5597.330078
\n",
"
20000.0
\n",
"
35000.0
\n",
"
35000.0
\n",
"
\n",
"
\n",
"
148
\n",
"
all_util
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
309.0
\n",
"
117.0
\n",
"
56.757282
\n",
"
2.0
\n",
"
21.046084
\n",
"
87.0
\n",
"
...
\n",
"
0.0
\n",
"
2.000000
\n",
"
10.000000
\n",
"
18.000000
\n",
"
43.000000
\n",
"
58.000000
\n",
"
72.000000
\n",
"
89.0
\n",
"
106.0
\n",
"
117.0
\n",
"
\n",
"
\n",
"
149
\n",
"
sec_app_mort_acc
\n",
"
309.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
0.000000
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
150 rows × 32 columns
\n",
"
"
],
"text/plain": [
" column count null_count bool_count \\\n",
"0 sec_app_open_act_il 309.0 0.0 0.0 \n",
"1 bc_open_to_buy 309.0 0.0 0.0 \n",
"2 mths_since_rcnt_il 309.0 0.0 0.0 \n",
"3 sec_app_collections_12_mths_ex_med 309.0 0.0 0.0 \n",
"4 chargeoff_within_12_mths 309.0 0.0 0.0 \n",
".. ... ... ... ... \n",
"145 settlement_percentage 309.0 0.0 0.0 \n",
"146 pymnt_plan 309.0 0.0 0.0 \n",
"147 total_rec_prncp 309.0 0.0 0.0 \n",
"148 all_util 309.0 0.0 0.0 \n",
"149 sec_app_mort_acc 309.0 0.0 0.0 \n",
"\n",
" numeric_count max mean min stddev \\\n",
"0 0.0 0.0 0.000000 0.0 0.000000 \n",
"1 305.0 96285.0 11781.862295 0.0 15110.810631 \n",
"2 304.0 228.0 23.013158 1.0 27.996225 \n",
"3 0.0 0.0 0.000000 0.0 0.000000 \n",
"4 309.0 1.0 0.003236 0.0 0.056888 \n",
".. ... ... ... ... ... \n",
"145 0.0 0.0 0.000000 0.0 0.000000 \n",
"146 0.0 0.0 0.000000 0.0 0.000000 \n",
"147 309.0 35000.0 5266.577896 262.7 6502.059928 \n",
"148 309.0 117.0 56.757282 2.0 21.046084 \n",
"149 0.0 0.0 0.000000 0.0 0.000000 \n",
"\n",
" nunique_numbers ... nunique_str_upper quantile_0.0000 \\\n",
"0 0.0 ... 0.0 NaN \n",
"1 302.0 ... 0.0 0.000000 \n",
"2 70.0 ... 0.0 1.000000 \n",
"3 0.0 ... 0.0 NaN \n",
"4 2.0 ... 0.0 0.000000 \n",
".. ... ... ... ... \n",
"145 0.0 ... 0.0 NaN \n",
"146 0.0 ... 1.0 NaN \n",
"147 276.0 ... 0.0 262.700012 \n",
"148 87.0 ... 0.0 2.000000 \n",
"149 0.0 ... 0.0 NaN \n",
"\n",
" quantile_0.0100 quantile_0.0500 quantile_0.2500 quantile_0.5000 \\\n",
"0 NaN NaN NaN NaN \n",
"1 10.000000 155.000000 2004.000000 6784.000000 \n",
"2 1.000000 3.000000 7.000000 14.000000 \n",
"3 NaN NaN NaN NaN \n",
"4 0.000000 0.000000 0.000000 0.000000 \n",
".. ... ... ... ... \n",
"145 NaN NaN NaN NaN \n",
"146 NaN NaN NaN NaN \n",
"147 349.440002 848.909973 1697.630005 2965.600098 \n",
"148 10.000000 18.000000 43.000000 58.000000 \n",
"149 NaN NaN NaN NaN \n",
"\n",
" quantile_0.7500 quantile_0.9500 quantile_0.9900 quantile_1.0000 \n",
"0 NaN NaN NaN NaN \n",
"1 15545.000000 43811.0 74544.0 96285.0 \n",
"2 27.000000 86.0 130.0 228.0 \n",
"3 NaN NaN NaN NaN \n",
"4 0.000000 0.0 0.0 1.0 \n",
".. ... ... ... ... \n",
"145 NaN NaN NaN NaN \n",
"146 NaN NaN NaN NaN \n",
"147 5597.330078 20000.0 35000.0 35000.0 \n",
"148 72.000000 89.0 106.0 117.0 \n",
"149 NaN NaN NaN NaN \n",
"\n",
"[150 rows x 32 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary = summaries['summary']\n",
"summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the streaming mode\n",
"\n",
"It's convenient to call whylogs on a batch of data with a Pandas dataframe. However, in practice you might have only individual data points. In that case, `whylogs` can be called on each individual datum (Python dictionary object in this case).\n",
"\n",
"The following example shows how we can stream through individual data points by iterating with a dataframe and extracting rows as an object:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"profile2 = session.new_profile(dataset_name=\"in-memory\", \n",
" dataset_timestamp=datetime.datetime(2017, 1, 1, 0, 0))\n",
"for i, row in data.iterrows():\n",
" profile2.track(row.to_dict())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The counter should now be updated incrementally, and the two profiles can be merged:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"merged_profile = profile.merge(profile2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Streaming mode isn't limited to just the API. We can also merge the profiles across different sessions to get a holistic view:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"309\n",
"309\n",
"618\n"
]
}
],
"source": [
"print(profile.columns['dti'].counters.count)\n",
"print(profile2.columns['dti'].counters.count)\n",
"print(merged_profile.columns['dti'].counters.count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## whylogs output\n",
"\n",
"We can see that this summary object is much smaller at **roughly 150 rows x 32 columns** than the original dataset at **1000 rows x 151 columns**. Smaller storage sizes are important in reducing costs and making it easier for your data scientists to complete monitoring and post-analysis on large amounts of data.\n",
"\n",
"Each row of our flat profile summary contains the name of the variable found in the original dataset, in the column called `column`.\n",
"\n",
"We can also see a number of useful metrics as columns in our summary: descriptive statistics, type information, unique estimates and bounds, as well as specially formulated metrics like inferred_dtype and dtype_fraction.\n",
"\n",
"Let's explore the output of the whylogs profiler to check on a few of the interesting variables we mentioned earlier. For example, let's look at the `funded_amnt` variable."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
65
\n",
"
\n",
" \n",
" \n",
"
\n",
"
column
\n",
"
funded_amnt
\n",
"
\n",
"
\n",
"
count
\n",
"
309.0
\n",
"
\n",
"
\n",
"
null_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
bool_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
numeric_count
\n",
"
309.0
\n",
"
\n",
"
\n",
"
max
\n",
"
40000.0
\n",
"
\n",
"
\n",
"
mean
\n",
"
14511.407767
\n",
"
\n",
"
\n",
"
min
\n",
"
1000.0
\n",
"
\n",
"
\n",
"
stddev
\n",
"
9011.80195
\n",
"
\n",
"
\n",
"
nunique_numbers
\n",
"
117.0
\n",
"
\n",
"
\n",
"
nunique_numbers_lower
\n",
"
117.0
\n",
"
\n",
"
\n",
"
nunique_numbers_upper
\n",
"
117.0
\n",
"
\n",
"
\n",
"
inferred_dtype
\n",
"
2.0
\n",
"
\n",
"
\n",
"
dtype_fraction
\n",
"
1.0
\n",
"
\n",
"
\n",
"
type_unknown_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
type_null_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
type_fractional_count
\n",
"
309.0
\n",
"
\n",
"
\n",
"
type_integral_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
type_boolean_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
type_string_count
\n",
"
0.0
\n",
"
\n",
"
\n",
"
nunique_str
\n",
"
0.0
\n",
"
\n",
"
\n",
"
nunique_str_lower
\n",
"
0.0
\n",
"
\n",
"
\n",
"
nunique_str_upper
\n",
"
0.0
\n",
"
\n",
"
\n",
"
quantile_0.0000
\n",
"
1000.0
\n",
"
\n",
"
\n",
"
quantile_0.0100
\n",
"
1200.0
\n",
"
\n",
"
\n",
"
quantile_0.0500
\n",
"
3200.0
\n",
"
\n",
"
\n",
"
quantile_0.2500
\n",
"
7350.0
\n",
"
\n",
"
\n",
"
quantile_0.5000
\n",
"
12000.0
\n",
"
\n",
"
\n",
"
quantile_0.7500
\n",
"
20000.0
\n",
"
\n",
"
\n",
"
quantile_0.9500
\n",
"
35000.0
\n",
"
\n",
"
\n",
"
quantile_0.9900
\n",
"
36000.0
\n",
"
\n",
"
\n",
"
quantile_1.0000
\n",
"
40000.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 65\n",
"column funded_amnt\n",
"count 309.0\n",
"null_count 0.0\n",
"bool_count 0.0\n",
"numeric_count 309.0\n",
"max 40000.0\n",
"mean 14511.407767\n",
"min 1000.0\n",
"stddev 9011.80195\n",
"nunique_numbers 117.0\n",
"nunique_numbers_lower 117.0\n",
"nunique_numbers_upper 117.0\n",
"inferred_dtype 2.0\n",
"dtype_fraction 1.0\n",
"type_unknown_count 0.0\n",
"type_null_count 0.0\n",
"type_fractional_count 309.0\n",
"type_integral_count 0.0\n",
"type_boolean_count 0.0\n",
"type_string_count 0.0\n",
"nunique_str 0.0\n",
"nunique_str_lower 0.0\n",
"nunique_str_upper 0.0\n",
"quantile_0.0000 1000.0\n",
"quantile_0.0100 1200.0\n",
"quantile_0.0500 3200.0\n",
"quantile_0.2500 7350.0\n",
"quantile_0.5000 12000.0\n",
"quantile_0.7500 20000.0\n",
"quantile_0.9500 35000.0\n",
"quantile_0.9900 36000.0\n",
"quantile_1.0000 40000.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary[summary['column']=='funded_amnt'].T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may notice that the count for this variable was recorded as **309** hits, with a minimum loan amount of **$1,000.00 USD** and a maximum loan amount of **\\$40,000.00 USD**.\n",
"\n",
"For numerical variables like `funded_amnt`, we can view additional information in the histograms dictionary from the profile summaries object. The variable's histogram object contains bin edges along with counts."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'bin_edges': [1000.0, 2300.0001333333335, 3600.000266666667, 4900.000400000001, 6200.000533333334, 7500.000666666667, 8800.000800000002, 10100.000933333335, 11400.001066666668, 12700.0012, 14000.001333333334, 15300.001466666668, 16600.001600000003, 17900.001733333334, 19200.00186666667, 20500.002, 21800.002133333335, 23100.00226666667, 24400.0024, 25700.002533333336, 27000.002666666667, 28300.002800000002, 29600.002933333337, 30900.003066666668, 32200.003200000003, 33500.00333333334, 34800.00346666667, 36100.003600000004, 37400.00373333334, 38700.00386666667, 40000.004], 'counts': [7, 12, 11, 34, 14, 19, 32, 8, 24, 9, 22, 14, 9, 9, 24, 7, 3, 5, 8, 2, 5, 3, 5, 3, 2, 0, 15, 0, 0, 3]}\n"
]
}
],
"source": [
"histograms = summaries['hist']\n",
"print(histograms['funded_amnt'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For another variable, `loan_status`, we can discover intriguing information within other metrics. This is because loan status is a categorical field that takes strings as inputs.\n",
"\n",
"Let's look at a few relevant metrics for this and other string variables."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
type_string_count
\n",
"
type_null_count
\n",
"
nunique_str
\n",
"
nunique_str_lower
\n",
"
nunique_str_upper
\n",
"
\n",
" \n",
" \n",
"
\n",
"
138
\n",
"
309.0
\n",
"
0.0
\n",
"
6.0
\n",
"
6.0
\n",
"
6.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type_string_count type_null_count nunique_str nunique_str_lower \\\n",
"138 309.0 0.0 6.0 6.0 \n",
"\n",
" nunique_str_upper \n",
"138 6.0 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary[summary['column']=='loan_status'][['type_string_count', \n",
" 'type_null_count', \n",
" 'nunique_str', \n",
" 'nunique_str_lower', \n",
" 'nunique_str_upper']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that there are **309** elements of string type. Also, the unique string fields show **6** unique strings. The lower and upper bounds for the estimate are also **6**, meaning that this is an exact number. You will see many instances of this -- DataSketches in whylogs finds exact estimates for numbers as high as 400 unique values.\n",
"\n",
"Let's now explore the frequent strings object from our profile summaries."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Current': 239, 'Fully Paid': 54, 'Charged Off': 7, 'Late (31-120 days)': 5, 'In Grace Period': 3, 'Late (16-30 days)': 1}\n"
]
}
],
"source": [
"frequent_strings = summaries['frequent_strings']\n",
"print(frequent_strings['loan_status'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing data to disk\n",
"\n",
"Sometimes you want to write your data out manually rather than relying on the Logger framework (it's more opinionated!), you can perform your own serialization and deserialization.\n",
"\n",
"whylogs uses protobuf for efficient storage. Here's how it works:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"profile.write_protobuf(\"profile.bin\")\n",
"roundtrip = profile.read_protobuf(\"profile.bin\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"150"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(roundtrip.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualizing multiple datasets across time with whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use the whylogs visualization tools, we'll need to import the `ProfileVisualizer` object and use the Altair visualization framework."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from whylogs.viz import ProfileVisualizer\n",
"\n",
"viz = ProfileVisualizer()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've explored data for a single month, let's calculate profile summaries for a series of months. Normally, we'd expect whylogs to be operating on future data, so these new datasets would originate from data seen at inference time.\n",
"\n",
"But in special cases like this demo or diagnosing data collected prior to whylogs integration, it may be helpful to backfill with past data. Here we'll loop through subsets of data to create a list of profile summaries."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a list of data profiles\n",
"remaining_dates = ['Feb-2017', 'Mar-2017', 'Apr-2017', 'May-2017', 'Jun-2017']\n",
"\n",
"profiles = [profile] # list with original profile\n",
"for date in remaining_dates:\n",
" timestamp = datetime.datetime.strptime(date, '%b-%Y')\n",
" subset_data = full_data[full_data['issue_d']==date]\n",
" subset_prof = session.profile_dataframe(subset_data, \"demo\", dataset_timestamp=timestamp)\n",
" profiles.append(subset_prof)\n",
"\n",
"profiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's pass this list of profiles into the visualizer."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"viz.set_profiles(profiles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now explore temporal visualizations of our profiles at a quick glance."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"