{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=BigQuery_Example)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=BigQuery_Example) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# whylogs with BigQuery\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/BigQuery_Example.ipynb)\n",
"\n",
"\n",
"Hi there! 😀\n",
"In this example notebook, we will show you how to use `whylogs` with an existing BigQuery table. We will query the table with `pandas_gbq` and then profile the data with the read `pandas.DataFrame`. We will try to demonstrate some investigation scenarios you can do and also how to store this snapshot for further analysis and mergeability, to make sure you will keep track and make your ML and data pipelines more responsible.\n",
"\n",
"## Querying the data\n",
"To start off with this example notebook, you will need to open up a BigQuery sandbox environment on [this page](https://console.cloud.google.com/bigquery). This will allow us to query example datasets and also upload our own data, with their environment limitations. If you want to follow along, simply replace the `project_id` variable to your set project and you should be good to go.\n",
"\n",
"Then let's make the needed imports and also install the libraries we will use for this example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install 'whylogs[viz]'\n",
"%pip install pandas-gbq\n",
"%pip install tqdm"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import pandas_gbq\n",
"import whylogs as why"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Inform your GCP project id and it will prompt a one-time login access on the UI and create an environment config file for you."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"project_id = \"my-project-id\" # Update here with your project"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's query a public dataset and keep the data small for this demo purpose. With `pandas_gbq` we will end up with a pandas.DataFrame object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sql = \"\"\"\n",
"SELECT pack, bottle_volume_ml, state_bottle_cost, sale_dollars, city\n",
"FROM `bigquery-public-data.iowa_liquor_sales.sales`\n",
"LIMIT 1000\n",
"\"\"\"\n",
"df = pandas_gbq.read_gbq(sql, project_id=project_id)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Profiling with whylogs\n",
"\n",
"Now let's profile this dataset with whylogs and see what we can do with profiles. The first thing will be to turn it into a pandas DataFrame and inspect what metrics we can get out of the box"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | cardinality/est | \n", "cardinality/lower_1 | \n", "cardinality/upper_1 | \n", "counts/n | \n", "counts/null | \n", "distribution/max | \n", "distribution/mean | \n", "distribution/median | \n", "distribution/min | \n", "distribution/n | \n", "... | \n", "distribution/stddev | \n", "frequent_items/frequent_strings | \n", "ints/max | \n", "ints/min | \n", "type | \n", "types/boolean | \n", "types/fractional | \n", "types/integral | \n", "types/object | \n", "types/string | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
bottle_volume_ml | \n", "21.000001 | \n", "21.000000 | \n", "21.001050 | \n", "1000 | \n", "0 | \n", "4800.00 | \n", "893.20700 | \n", "1000.00 | \n", "20.0 | \n", "1000 | \n", "... | \n", "449.242697 | \n", "[FrequentItem(value='1000.000000', est=461, up... | \n", "4800.0 | \n", "20.0 | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "1000 | \n", "0 | \n", "0 | \n", "
city | \n", "238.000140 | \n", "238.000000 | \n", "238.012023 | \n", "1000 | \n", "5 | \n", "NaN | \n", "0.00000 | \n", "NaN | \n", "NaN | \n", "0 | \n", "... | \n", "0.000000 | \n", "[FrequentItem(value='Des Moines', est=67, uppe... | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "995 | \n", "
pack | \n", "11.000000 | \n", "11.000000 | \n", "11.000549 | \n", "1000 | \n", "0 | \n", "48.00 | \n", "11.02300 | \n", "12.00 | \n", "1.0 | \n", "1000 | \n", "... | \n", "4.664786 | \n", "[FrequentItem(value='12.000000', est=587, uppe... | \n", "48.0 | \n", "1.0 | \n", "SummaryType.COLUMN | \n", "0 | \n", "0 | \n", "1000 | \n", "0 | \n", "0 | \n", "
sale_dollars | \n", "531.304754 | \n", "524.519438 | \n", "538.259242 | \n", "1000 | \n", "0 | \n", "37514.88 | \n", "599.49446 | \n", "119.96 | \n", "0.0 | \n", "1000 | \n", "... | \n", "1677.743947 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "1000 | \n", "0 | \n", "0 | \n", "0 | \n", "
state_bottle_cost | \n", "259.000166 | \n", "259.000000 | \n", "259.013098 | \n", "1000 | \n", "0 | \n", "137.13 | \n", "10.52482 | \n", "7.62 | \n", "0.0 | \n", "1000 | \n", "... | \n", "8.750237 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "SummaryType.COLUMN | \n", "0 | \n", "1000 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 28 columns
\n", "