{
 "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Record metadata on Kubeflow from Notebooks\n",
     "> Demonstration of how lineage tracking works\n",
     "\n",
     "- toc: true \n",
     "- badges: true\n",
     "- comments: true\n",
     "- categories: [jupyter]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Lineage Tracking\n",
     "* This blog post will first guide you through the metadata SDK API, to create a notebook and log several actions to the metadata DB. Afterwards, you will be able to navigate to the Kubeflow UI and the resulting lineage graph, which gives you a graphical representation of the dependencies between the objects you logged using the SDK."
    ]
   },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install the _Kubeflow-metadata_ library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To use the latest publish `kubeflow-metadata` library, you can run:\n",
    "!pip install kubeflow-metadata --user\n",
    "# Install other packages:\n",
    "!pip install pandas --user\n",
    "# Then restart the Notebook kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "from kubeflow.metadata import metadata\n",
    "from datetime import datetime\n",
    "from uuid import uuid4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "METADATA_STORE_HOST = \"metadata-grpc-service.kubeflow\" # default DNS of Kubeflow Metadata gRPC serivce.\n",
    "METADATA_STORE_PORT = 8080"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a new Workspace and Run in a workspace\n",
    "* A [Workspace](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L92) groups a set of pipelines or notebooks runs, and their related artifacts and executions\n",
    "* [Store](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L59) is an object that provides a connection to the Metadata gRPC service\n",
    "* The [Run](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L227) object captures a pipeline or notebook run in a workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws1 = metadata.Workspace(\n",
    "    # Connect to metadata service in namespace kubeflow in k8s cluster.\n",
    "    store=metadata.Store(grpc_host=METADATA_STORE_HOST, grpc_port=METADATA_STORE_PORT),\n",
    "    name=\"xgboost-synthetic\",\n",
    "    description=\"workspace for xgboost-synthetic artifacts and executions\",\n",
    "    labels={\"n1\": \"v1\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = metadata.Run(\n",
    "    workspace=ws1,\n",
    "    name=\"xgboost-synthetic-faring-run\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    description=\"a notebook run\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an execution in a run\n",
    "* An [Execution](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L251) is a specific instance of a run, and you can bind specific input/output artifacts to this instance. Execution also serves as object for logging artifacts as its input or output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "An execution was created with id 290\n"
    }
   ],
   "source": [
    "exec = metadata.Execution(\n",
    "    name = \"execution\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    workspace=ws1,\n",
    "    run=r,\n",
    "    description=\"execution for training xgboost-synthetic\",\n",
    ")\n",
    "print(\"An execution was created with id %s\" % exec.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Log a data set and a model\n",
    "* A [Log_input](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L319) log an artifact as an input of this execution. Here exec.log_input accept an artifact class as an argument, a [DataSet](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L412) is an artifact. Every artifacts has different paramenters such as name, uri, query. The way to create DataSet artifact is calling ready-to-use APIs metadata.DataSet and provide arguments\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "Data set id is 171 with version 'data_set_version_cbebc757-0d76-4e1e-bbd9-02b065e4c3ea'\n"
    }
   ],
   "source": [
    "date_set_version = \"data_set_version_\" + str(uuid4())\n",
    "data_set = exec.log_input(\n",
    "        metadata.DataSet(\n",
    "            description=\"xgboost synthetic data\",\n",
    "            name=\"synthetic-data\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"file://path/to/dataset\",\n",
    "            version=\"v1.0.0\",\n",
    "            query=\"SELECT * FROM mytable\"))\n",
    "print(\"Data set id is {0.id} with version '{0.version}'\".format(data_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* A [Log_output](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L337) log an artifact as a output of this execution. Here exec.log_output accept an artifact class as an argument, a [Model](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L518) is an artifact. Every artifacts has different paramenters such as name, uri, hyperparameters. The way to create Model artifact is calling ready-to-use APIs metadata.Model and provide arguments\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "kubeflow.metadata.metadata.Model(workspace=None, name='MNIST', description='model to recognize handwritten digits', owner='someone@kubeflow.org', uri='gcs://my-bucket/mnist', version='model_version_50b419e2-af69-4c0e-a251-78246d4c0578', model_type='neural network', training_framework={'name': 'tensorflow', 'version': 'v1.0'}, hyperparameters={'learning_rate': 0.5, 'layers': [10, 3, 1], 'early_stop': True}, labels={'mylabel': 'l1'}, id=172, create_time='2019-12-04T00:44:49.444411Z', kwargs={})\n\nModel id is 172 and version is model_version_50b419e2-af69-4c0e-a251-78246d4c0578\n"
    }
   ],
   "source": [
    "model_version = \"model_version_\" + str(uuid4())\n",
    "model = exec.log_output(\n",
    "    metadata.Model(\n",
    "            name=\"MNIST\",\n",
    "            description=\"model to recognize handwritten digits\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"gcs://my-bucket/mnist\",\n",
    "            model_type=\"neural network\",\n",
    "            training_framework={\n",
    "                \"name\": \"tensorflow\",\n",
    "                \"version\": \"v1.0\"\n",
    "            },\n",
    "            hyperparameters={\n",
    "                \"learning_rate\": 0.5,\n",
    "                \"layers\": [10, 3, 1],\n",
    "                \"early_stop\": True\n",
    "            },\n",
    "            version=model_version,\n",
    "            labels={\"mylabel\": \"l1\"}))\n",
    "print(model)\n",
    "print(\"\\nModel id is {0.id} and version is {0.version}\".format(model))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Log the evaluation of a model\n",
    "* [Metrics](https://github.com/kubeflow/metadata/blob/25b44da29213968a2c438d24aad3656cc86d0499/sdk/python/kubeflow/metadata/metadata.py#L639) captures an evaluation metrics of a model on a data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "Metrics id is 173\n"
    }
   ],
   "source": [
    "metrics = exec.log_output(\n",
    "    metadata.Metrics(\n",
    "            name=\"MNIST-evaluation\",\n",
    "            description=\"validating the MNIST model to recognize handwritten digits\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"gcs://my-bucket/mnist-eval.csv\",\n",
    "            data_set_id=str(data_set.id),\n",
    "            model_id=str(model.id),\n",
    "            metrics_type=metadata.Metrics.VALIDATION,\n",
    "            values={\"accuracy\": 0.95},\n",
    "            labels={\"mylabel\": \"l1\"}))\n",
    "print(\"Metrics id is %s\" % metrics.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add Metadata for serving the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found the mode with id 172 and version 'model_version_50b419e2-af69-4c0e-a251-78246d4c0578'.\n"
     ]
    }
   ],
   "source": [
    "serving_application = metadata.Execution(\n",
    "    name=\"serving model\",\n",
    "    workspace=ws1,\n",
    "    description=\"an execution to represent model serving component\",\n",
    ")\n",
    "# Noticed we use model name, version, uri to uniquely identify existing model.\n",
    "served_model = metadata.Model(\n",
    "    name=\"MNIST\",\n",
    "    uri=\"gcs://my-bucket/mnist\",\n",
    "    version=model.version,\n",
    ")\n",
    "m=serving_application.log_input(served_model)\n",
    "print(\"Found the mode with id {0.id} and version '{0.version}'.\".format(m))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plot the lineage graph\n",
    "\n",
    "![](images-lineage/lineage.png)\n",
    "\n",
    "* The figure above shows an example of the lineage graph from our xgboost example. Follow below steps for you to try out:\n",
    "\n",
    "1. Follow the guide to [setting up your Jupyter notebooks in Kubeflow](https://www.kubeflow.org/docs/notebooks/setup/)\n",
    "2. Go back to your Jupyter notebook server in the Kubeflow UI. (If you’ve moved away from the notebooks section in Kubeflow, click Notebook Servers in the left-hand navigation panel to get back there.)\n",
    "3. In the Jupyter notebook UI, click Upload and follow the prompts to upload the [xgboost example](https://github.com/kubeflow/examples/blob/master/xgboost_synthetic/build-train-deploy.ipynb) notebook.\n",
    "4. Click the notebook name (build-train-deploy.ipynb.ipynb) to open the notebook in your Kubeflow cluster.\n",
    "5. Run the steps in the notebook to install and use the Metadata SDK.\n",
    "6. Click Artifact Store in the left-hand navigation panel on the Kubeflow UI.\n",
    "7. Select Pipelines -> Artifacts\n",
    "8. Navigate to xgboost-synthetic-traing-eval\n",
    "9. Click on Lineage explorer\n",
    "\n"
  ]
 }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}