{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=BigQuery_Example)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=BigQuery_Example) to leverage the power of whylogs and WhyLabs together!*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "# whylogs with BigQuery\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/BigQuery_Example.ipynb)\n", "\n", "\n", "Hi there! 😀\n", "In this example notebook, we will show you how to use `whylogs` with an existing BigQuery table. We will query the table with `pandas_gbq` and then profile the data with the read `pandas.DataFrame`. We will try to demonstrate some investigation scenarios you can do and also how to store this snapshot for further analysis and mergeability, to make sure you will keep track and make your ML and data pipelines more responsible.\n", "\n", "## Querying the data\n", "To start off with this example notebook, you will need to open up a BigQuery sandbox environment on [this page](https://console.cloud.google.com/bigquery). This will allow us to query example datasets and also upload our own data, with their environment limitations. If you want to follow along, simply replace the `project_id` variable to your set project and you should be good to go.\n", "\n", "Then let's make the needed imports and also install the libraries we will use for this example" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install 'whylogs[viz]'\n", "%pip install pandas-gbq\n", "%pip install tqdm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas_gbq\n", "import whylogs as why" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "Inform your GCP project id and it will prompt a one-time login access on the UI and create an environment config file for you." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "project_id = \"my-project-id\" # Update here with your project" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "Let's query a public dataset and keep the data small for this demo purpose. With `pandas_gbq` we will end up with a pandas.DataFrame object" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "sql = \"\"\"\n", "SELECT pack, bottle_volume_ml, state_bottle_cost, sale_dollars, city\n", "FROM `bigquery-public-data.iowa_liquor_sales.sales`\n", "LIMIT 1000\n", "\"\"\"\n", "df = pandas_gbq.read_gbq(sql, project_id=project_id)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## Profiling with whylogs\n", "\n", "Now let's profile this dataset with whylogs and see what we can do with profiles. The first thing will be to turn it into a pandas DataFrame and inspect what metrics we can get out of the box" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cardinality/estcardinality/lower_1cardinality/upper_1counts/ncounts/nulldistribution/maxdistribution/meandistribution/mediandistribution/mindistribution/n...distribution/stddevfrequent_items/frequent_stringsints/maxints/mintypetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/string
column
bottle_volume_ml21.00000121.00000021.001050100004800.00893.207001000.0020.01000...449.242697[FrequentItem(value='1000.000000', est=461, up...4800.020.0SummaryType.COLUMN00100000
city238.000140238.000000238.01202310005NaN0.00000NaNNaN0...0.000000[FrequentItem(value='Des Moines', est=67, uppe...NaNNaNSummaryType.COLUMN0000995
pack11.00000011.00000011.0005491000048.0011.0230012.001.01000...4.664786[FrequentItem(value='12.000000', est=587, uppe...48.01.0SummaryType.COLUMN00100000
sale_dollars531.304754524.519438538.2592421000037514.88599.49446119.960.01000...1677.743947NaNNaNNaNSummaryType.COLUMN01000000
state_bottle_cost259.000166259.000000259.01309810000137.1310.524827.620.01000...8.750237NaNNaNNaNSummaryType.COLUMN01000000
\n", "

5 rows × 28 columns

\n", "
" ], "text/plain": [ " cardinality/est cardinality/lower_1 cardinality/upper_1 \\\n", "column \n", "bottle_volume_ml 21.000001 21.000000 21.001050 \n", "city 238.000140 238.000000 238.012023 \n", "pack 11.000000 11.000000 11.000549 \n", "sale_dollars 531.304754 524.519438 538.259242 \n", "state_bottle_cost 259.000166 259.000000 259.013098 \n", "\n", " counts/n counts/null distribution/max distribution/mean \\\n", "column \n", "bottle_volume_ml 1000 0 4800.00 893.20700 \n", "city 1000 5 NaN 0.00000 \n", "pack 1000 0 48.00 11.02300 \n", "sale_dollars 1000 0 37514.88 599.49446 \n", "state_bottle_cost 1000 0 137.13 10.52482 \n", "\n", " distribution/median distribution/min distribution/n ... \\\n", "column ... \n", "bottle_volume_ml 1000.00 20.0 1000 ... \n", "city NaN NaN 0 ... \n", "pack 12.00 1.0 1000 ... \n", "sale_dollars 119.96 0.0 1000 ... \n", "state_bottle_cost 7.62 0.0 1000 ... \n", "\n", " distribution/stddev \\\n", "column \n", "bottle_volume_ml 449.242697 \n", "city 0.000000 \n", "pack 4.664786 \n", "sale_dollars 1677.743947 \n", "state_bottle_cost 8.750237 \n", "\n", " frequent_items/frequent_strings \\\n", "column \n", "bottle_volume_ml [FrequentItem(value='1000.000000', est=461, up... \n", "city [FrequentItem(value='Des Moines', est=67, uppe... \n", "pack [FrequentItem(value='12.000000', est=587, uppe... \n", "sale_dollars NaN \n", "state_bottle_cost NaN \n", "\n", " ints/max ints/min type types/boolean \\\n", "column \n", "bottle_volume_ml 4800.0 20.0 SummaryType.COLUMN 0 \n", "city NaN NaN SummaryType.COLUMN 0 \n", "pack 48.0 1.0 SummaryType.COLUMN 0 \n", "sale_dollars NaN NaN SummaryType.COLUMN 0 \n", "state_bottle_cost NaN NaN SummaryType.COLUMN 0 \n", "\n", " types/fractional types/integral types/object types/string \n", "column \n", "bottle_volume_ml 0 1000 0 0 \n", "city 0 0 0 995 \n", "pack 0 1000 0 0 \n", "sale_dollars 1000 0 0 0 \n", "state_bottle_cost 1000 0 0 0 \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = why.log(df)\n", "profile_view = results.view()\n", "\n", "profile_view.to_pandas()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from whylogs.viz.extensions.reports.profile_summary import ProfileSummaryReport" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ProfileSummaryReport(target_view=profile_view).report()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "This report can be very useful, since it shows us in a bit more detail the characteristics of the distributions, and also lists some of the core metric names that we can refer to quickly." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## Constraints check\n", "\n", "You can also create a constraints suite and generate a report with passed and failed constraints, by using the same `profile_view` object." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from whylogs.core.constraints import ConstraintsBuilder\n", "from whylogs.core.constraints.factories import greater_than_number, mean_between_range" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "builder = ConstraintsBuilder(dataset_profile_view=profile_view)\n", "builder.add_constraint(\n", " greater_than_number(column_name=\"bottle_volume_ml\", number=100)\n", ")\n", "builder.add_constraint(\n", " mean_between_range(column_name=\"sale_dollars\", lower=400.0, upper=700.0)\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "constraints = builder.build()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "constraints.validate()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "[('bottle_volume_ml greater than number 100', 0, 1),\n", " ('sale_dollars mean between 400.0 and 700.0 (inclusive)', 1, 0)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "constraints.generate_constraints_report()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass the constraints to the `NotebookProfileVisualizer` and generate a visualization of the report\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from whylogs.viz import NotebookProfileVisualizer\n", "\n", "visualization = NotebookProfileVisualizer()\n", "visualization.constraints_report(constraints=constraints)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## Writing the profile results\n", "\n", "In order to write the results, all you have to do is execute the following cell. You will notice that it will create a lightweight binary file, that can be further read and iterated on with `whylogs`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "results.writer(\"local\").write()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Writing to WhyLabs\n", "\n", "The same way you can write it locally, you are also able to write to other locations. Here we will demonstrate how easy it is to write to WhyLabs. To use this writer without explicitly calling your secrets, make sure you have the following environment variables defined:\n", "\n", "- WHYLABS_API_KEY\n", "- WHYLABS_DEFAULT_DATASET_ID\n", "- WHYLABS_DEFAULT_ORG_ID\n", "\n", "To learn more about writing to WhyLabs, refer to [this example](https://github.com/whylabs/whylogs/blob/mainline/python/examples/integrations/writers/Writing_to_WhyLabs.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results.writer(\"whylabs\").write()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "Users can also benefit from our API to write created profiles to different integration locations, such as `mlflow`, `s3` and more. If you want to learn more about whylogs and find out other cool features and integrations, check out our [examples page](https://github.com/whylabs/whylogs/tree/mainline/python/examples).\n", "Happy coding! 🚀 😄" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.10 ('.venv': poetry)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "d39f874c9b8a97550ecbd783714b95e79c9b905449b34f44c40e3bf053b54b41" } } }, "nbformat": 4, "nbformat_minor": 2 }