{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Record metadata on Kubeflow from Notebooks\n", "> Demonstration of how lineage tracking works\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- categories: [jupyter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lineage Tracking\n", "* This blog post will first guide you through the metadata SDK API, to create a notebook and log several actions to the metadata DB. Afterwards, you will be able to navigate to the Kubeflow UI and the resulting lineage graph, which gives you a graphical representation of the dependencies between the objects you logged using the SDK." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install the _Kubeflow-metadata_ library" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# To use the latest publish `kubeflow-metadata` library, you can run:\n", "!pip install kubeflow-metadata --user\n", "# Install other packages:\n", "!pip install pandas --user\n", "# Then restart the Notebook kernel." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "from kubeflow.metadata import metadata\n", "from datetime import datetime\n", "from uuid import uuid4" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "METADATA_STORE_HOST = \"metadata-grpc-service.kubeflow\" # default DNS of Kubeflow Metadata gRPC serivce.\n", "METADATA_STORE_PORT = 8080" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a new Workspace and Run in a workspace\n", "* A [Workspace](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L92) groups a set of pipelines or notebooks runs, and their related artifacts and executions\n", "* [Store](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L59) is an object that provides a connection to the Metadata gRPC service\n", "* The [Run](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L227) object captures a pipeline or notebook run in a workspace" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ws1 = metadata.Workspace(\n", " # Connect to metadata service in namespace kubeflow in k8s cluster.\n", " store=metadata.Store(grpc_host=METADATA_STORE_HOST, grpc_port=METADATA_STORE_PORT),\n", " name=\"xgboost-synthetic\",\n", " description=\"workspace for xgboost-synthetic artifacts and executions\",\n", " labels={\"n1\": \"v1\"})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "r = metadata.Run(\n", " workspace=ws1,\n", " name=\"xgboost-synthetic-faring-run\" + datetime.utcnow().isoformat(\"T\") ,\n", " description=\"a notebook run\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an execution in a run\n", "* An [Execution](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L251) is a specific instance of a run, and you can bind specific input/output artifacts to this instance. Execution also serves as object for logging artifacts as its input or output" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "An execution was created with id 290\n" } ], "source": [ "exec = metadata.Execution(\n", " name = \"execution\" + datetime.utcnow().isoformat(\"T\") ,\n", " workspace=ws1,\n", " run=r,\n", " description=\"execution for training xgboost-synthetic\",\n", ")\n", "print(\"An execution was created with id %s\" % exec.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log a data set and a model\n", "* A [Log_input](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L319) log an artifact as an input of this execution. Here exec.log_input accept an artifact class as an argument, a [DataSet](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L412) is an artifact. Every artifacts has different paramenters such as name, uri, query. The way to create DataSet artifact is calling ready-to-use APIs metadata.DataSet and provide arguments\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "Data set id is 171 with version 'data_set_version_cbebc757-0d76-4e1e-bbd9-02b065e4c3ea'\n" } ], "source": [ "date_set_version = \"data_set_version_\" + str(uuid4())\n", "data_set = exec.log_input(\n", " metadata.DataSet(\n", " description=\"xgboost synthetic data\",\n", " name=\"synthetic-data\",\n", " owner=\"someone@kubeflow.org\",\n", " uri=\"file://path/to/dataset\",\n", " version=\"v1.0.0\",\n", " query=\"SELECT * FROM mytable\"))\n", "print(\"Data set id is {0.id} with version '{0.version}'\".format(data_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A [Log_output](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L337) log an artifact as a output of this execution. Here exec.log_output accept an artifact class as an argument, a [Model](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L518) is an artifact. Every artifacts has different paramenters such as name, uri, hyperparameters. The way to create Model artifact is calling ready-to-use APIs metadata.Model and provide arguments\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "kubeflow.metadata.metadata.Model(workspace=None, name='MNIST', description='model to recognize handwritten digits', owner='someone@kubeflow.org', uri='gcs://my-bucket/mnist', version='model_version_50b419e2-af69-4c0e-a251-78246d4c0578', model_type='neural network', training_framework={'name': 'tensorflow', 'version': 'v1.0'}, hyperparameters={'learning_rate': 0.5, 'layers': [10, 3, 1], 'early_stop': True}, labels={'mylabel': 'l1'}, id=172, create_time='2019-12-04T00:44:49.444411Z', kwargs={})\n\nModel id is 172 and version is model_version_50b419e2-af69-4c0e-a251-78246d4c0578\n" } ], "source": [ "model_version = \"model_version_\" + str(uuid4())\n", "model = exec.log_output(\n", " metadata.Model(\n", " name=\"MNIST\",\n", " description=\"model to recognize handwritten digits\",\n", " owner=\"someone@kubeflow.org\",\n", " uri=\"gcs://my-bucket/mnist\",\n", " model_type=\"neural network\",\n", " training_framework={\n", " \"name\": \"tensorflow\",\n", " \"version\": \"v1.0\"\n", " },\n", " hyperparameters={\n", " \"learning_rate\": 0.5,\n", " \"layers\": [10, 3, 1],\n", " \"early_stop\": True\n", " },\n", " version=model_version,\n", " labels={\"mylabel\": \"l1\"}))\n", "print(model)\n", "print(\"\\nModel id is {0.id} and version is {0.version}\".format(model))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log the evaluation of a model\n", "* [Metrics](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L639) captures an evaluation metrics of a model on a data set" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "Metrics id is 173\n" } ], "source": [ "metrics = exec.log_output(\n", " metadata.Metrics(\n", " name=\"MNIST-evaluation\",\n", " description=\"validating the MNIST model to recognize handwritten digits\",\n", " owner=\"someone@kubeflow.org\",\n", " uri=\"gcs://my-bucket/mnist-eval.csv\",\n", " data_set_id=str(data_set.id),\n", " model_id=str(model.id),\n", " metrics_type=metadata.Metrics.VALIDATION,\n", " values={\"accuracy\": 0.95},\n", " labels={\"mylabel\": \"l1\"}))\n", "print(\"Metrics id is %s\" % metrics.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add Metadata for serving the model" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found the mode with id 172 and version 'model_version_50b419e2-af69-4c0e-a251-78246d4c0578'.\n" ] } ], "source": [ "serving_application = metadata.Execution(\n", " name=\"serving model\",\n", " workspace=ws1,\n", " description=\"an execution to represent model serving component\",\n", ")\n", "# Noticed we use model name, version, uri to uniquely identify existing model.\n", "served_model = metadata.Model(\n", " name=\"MNIST\",\n", " uri=\"gcs://my-bucket/mnist\",\n", " version=model.version,\n", ")\n", "m=serving_application.log_input(served_model)\n", "print(\"Found the mode with id {0.id} and version '{0.version}'.\".format(m))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Plot the lineage graph\n", "\n", "![](images-lineage/lineage.png)\n", "\n", "* The figure above shows an example of the lineage graph from our xgboost example. Follow below steps for you to try out:\n", "\n", "1. Follow the guide to [setting up your Jupyter notebooks in Kubeflow](https://www.kubeflow.org/docs/notebooks/setup/)\n", "2. Go back to your Jupyter notebook server in the Kubeflow UI. (If you’ve moved away from the notebooks section in Kubeflow, click Notebook Servers in the left-hand navigation panel to get back there.)\n", "3. In the Jupyter notebook UI, click Upload and follow the prompts to upload the [xgboost example](https://github.com/kubeflow/examples/blob/master/xgboost_synthetic/build-train-deploy.ipynb) notebook.\n", "4. Click the notebook name (build-train-deploy.ipynb.ipynb) to open the notebook in your Kubeflow cluster.\n", "5. Run the steps in the notebook to install and use the Metadata SDK.\n", "6. Click Artifact Store in the left-hand navigation panel on the Kubeflow UI.\n", "7. Select Pipelines -> Artifacts\n", "8. Navigate to xgboost-synthetic-traing-eval\n", "9. Click on Lineage explorer\n", "\n" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }