{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving Profiles to S3 \n", "---" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "from whylogs import get_or_create_session\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The autoreload extension is already loaded. To reload it, use:\n", " %reload_ext autoreload\n" ] } ], "source": [ "\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a mock s3 server \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example we will create a fake s3 server using moto lib. You should remove this section if you have you own bucket setup on aws. Make sure you have your aws configuration set. By default this mock server creates a server in region \"us-east-1\"" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "BUCKET=\"super_awesome_bucket\"" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "s3.Bucket(name='super_awesome_bucket')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from moto import mock_s3\n", "from moto.s3.responses import DEFAULT_REGION_NAME\n", "import boto3\n", "\n", "mocks3 = mock_s3()\n", "mocks3.start()\n", "res = boto3.resource('s3', region_name=DEFAULT_REGION_NAME)\n", "res.create_bucket(Bucket=BUCKET)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can go by our usual way, load a example csv data" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"lending_club_1000.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Config File\n", "---\n", "Seting up whylogs to save your data on s3 can be in several ways. Simplest is to simply create a config file,where each data format can be saved to a specific location. As shown below " ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "CONFIG = \"\"\"\n", "project: s3_example_project\n", "pipeline: latest_results\n", "verbose: false\n", "writers:\n", "- formats:\n", " - protobuf\n", " output_path: s3://super_awesome_bucket/\n", " path_template: $name/dataset_summary\n", " filename_template: dataset_summary\n", " type: s3\n", "- formats:\n", " - flat\n", " output_path: s3://super_awesome_bucket/\n", " path_template: $name/dataset_summary\n", " filename_template: dataset_summary\n", " type: s3\n", "- formats:\n", " - json\n", " output_path: s3://super_awesome_bucket/\n", " path_template: $name/dataset_summary\n", " filename_template: dataset_summary\n", " type: s3\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "config_path=\".whylogs.yaml\"\n", "with open(\".whylogs.yaml\",\"w\") as file:\n", " file.write(CONFIG)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking the content:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "project: s3_example_project\r\n", "pipeline: latest_results\r\n", "verbose: false\r\n", "writers:\r\n", "- formats:\r\n", " - protobuf\r\n", " output_path: s3://super_awesome_bucket/\r\n", " path_template: $name/dataset_summary\r\n", " filename_template: dataset_summary\r\n", " type: s3\r\n", "- formats:\r\n", " - flat\r\n", " output_path: s3://super_awesome_bucket/\r\n", " path_template: $name/dataset_summary\r\n", " filename_template: dataset_summary\r\n", " type: s3\r\n", "- formats:\r\n", " - json\r\n", " output_path: s3://super_awesome_bucket/\r\n", " path_template: $name/dataset_summary\r\n", " filename_template: dataset_summary\r\n", " type: s3\r\n" ] } ], "source": [ "%cat .whylogs.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a custom name for your config file or place it in a special location you can use the helper function" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cache: 1\n", "pipeline: latest_results\n", "project: s3_example_project\n", "verbose: false\n", "with_rotation_time: null\n", "writers:\n", "- filename_template: \n", " formats:\n", " - OutputFormat.protobuf\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "- filename_template: \n", " formats:\n", " - OutputFormat.flat\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "- filename_template: \n", " formats:\n", " - OutputFormat.json\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "\n" ] } ], "source": [ "from whylogs.app.session import load_config, session_from_config\n", "config = load_config(\".whylogs.yaml\")\n", "session = session_from_config(config)\n", "print(session.get_config().to_yaml())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otherwise if the file is located in your home directory or current location you are running, you can simply run `get_or_create_session()`" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cache: 1\n", "pipeline: latest_results\n", "project: s3_example_project\n", "verbose: false\n", "with_rotation_time: null\n", "writers:\n", "- filename_template: \n", " formats:\n", " - OutputFormat.protobuf\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "- filename_template: \n", " formats:\n", " - OutputFormat.flat\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "- filename_template: \n", " formats:\n", " - OutputFormat.json\n", " output_path: s3://super_awesome_bucket/\n", " path_template: \n", "\n" ] } ], "source": [ "session= get_or_create_session()\n", "print(session.get_config().to_yaml())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loggin Data \n", "--- \n", "The data can be save by simply closing a logger, or one a logger is out of scope." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "with session.logger(\"dataset_test_s3\") as logger:\n", " logger.log_dataframe(df)\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['dataset_test_s3/dataset_summary/flat_table/dataset_summary.csv',\n", " 'dataset_test_s3/dataset_summary/freq_numbers/dataset_summary.json',\n", " 'dataset_test_s3/dataset_summary/frequent_strings/dataset_summary.json',\n", " 'dataset_test_s3/dataset_summary/histogram/dataset_summary.json',\n", " 'dataset_test_s3/dataset_summary/json/dataset_summary.json',\n", " 'dataset_test_s3/dataset_summary/protobuf/dataset_summary.bin']" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client = boto3.client('s3')\n", "objects = client.list_objects(Bucket=BUCKET)\n", "[obj[\"Key\"] for obj in objects[\"Contents\"]]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can define the configure for were the data is save through a configuration file or creating a custom writer.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Close mock s3 server " ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "mocks3.stop()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }