{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "HFpLgHPVmtrt" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "L3wXps5XChOO" }, "source": [ "# Introduction\n", "\n", "[Artifacts](https://www.comet.com/site/products/artifacts-dataset-management/?utm_campaign-artifacts-launch&utm_source=colab-example&utm_medium=frontmatter) is a new tool that provides Machine Learning Teams with a convenient way to log, version, and access data from all parts of the experimentation pipeline. \n", "\n", "\n", "## What are Artifacts?\n", "\n", "We’ve built Comet Artifacts to help Machine Learning teams solve the challenges of iterating on datasets and tracking pipelines where the data generated from one experiment is fed into another experiment.\n", "\n", "An Artifact is composed of Artifact versions. Each Artifact has a name, a type, description, tags, and metadata.\n", "\n", "An Artifact version is a snapshot of files and assets, arranged in a folder-like logical structure. This snapshot can be tracked using metadata, a version number, tags, and aliases. A version tracks which experiments consumed it, and which experiment produced it.\n", "\n", "For a more complete overview [check out our full annoucement here](https://www.comet.com/site/blog/announcing-comet-artifacts/?utm_campaign-artifacts-launch&utm_source=colab-example&utm_medium=additional-resources)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xiyFPPFi6OrT" }, "source": [ "# Setup\n", "\n", "Install Comet and initialize a Project to try out Artifacts " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9xG1W2gayTck" }, "outputs": [], "source": [ "%pip install comet_ml pandas scikit-learn joblib" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-HXASUjXyW3M" }, "outputs": [], "source": [ "import comet_ml\n", "\n", "comet_ml.init(project_name=\"guide-artifacts-demo\")" ] }, { "cell_type": "markdown", "metadata": { "id": "vLc_qnULNTS0" }, "source": [ "# Getting the Data\n", "\n", "For this example, we will use the California Housing Prices Dataset. Lets load the data and create a training and test set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VtsItkYN0t54" }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import fetch_california_housing as load_data\n", "from sklearn.model_selection import train_test_split\n", "\n", "dataset = load_data()\n", "X, y = dataset.data, dataset.target\n", "featurecols = dataset.feature_names\n", "\n", "# Train-Test Split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n", "\n", "train_df = pd.DataFrame(X_train, columns=featurecols)\n", "test_df = pd.DataFrame(X_test, columns=featurecols)\n", "\n", "train_df[\"target\"] = y_train\n", "test_df[\"target\"] = y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-fDn9MNK1teM" }, "outputs": [], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oAtIItHO1xec" }, "outputs": [], "source": [ "import os\n", "\n", "os.makedirs(\"./datasets\", exist_ok=True)\n", "\n", "train_df.to_csv(\"./datasets/train.csv\", index=False)\n", "test_df.to_csv(\"./datasets/test.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "y0qRHLQBNYpl" }, "source": [ "# Creating an Artifact to track the Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "LUyh2ZJ4Ur9y" }, "source": [ "Let's track our dataset with an Artifact. In order to create an Artifact, you will have to provide a name for it. You also have the option of providing additional information about the Artifact. You can provide a type string that identifies what kind of Artifact you are uploading (a model, dataset, etc). \n", "\n", "You can add alias identifiers to the Artifact, such as \"test data\" or \"staging model\". Later in this tutorial we will show you how Artifacts can be retrieved based on these aliases. \n", "\n", "Finally, you can attach a metadata dictionary to both the individual data assets uploaded to an Artifact as well as the Artifact itself. You can add any additional information about your Artifact in this dictionary. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "w7U5Q0Ou1-iP" }, "outputs": [], "source": [ "# Create a Comet Artifact\n", "artifact = comet_ml.Artifact(\n", " name=\"california\",\n", " artifact_type=\"dataset\",\n", " aliases=[\"raw\"],\n", " metadata={\"task\": \"regression\"},\n", ")\n", "\n", "# Add files to the Artifact\n", "for split, asset in zip(\n", " [\"train\", \"test\"], [\"./datasets/train.csv\", \"./datasets/test.csv\"]\n", "):\n", " artifact.add(asset, metadata={\"dataset_stage\": \"raw\", \"dataset_split\": split})\n", "\n", "experiment = comet_ml.Experiment()\n", "experiment.add_tag(\"upload\")\n", "experiment.log_artifact(artifact)\n", "\n", "experiment.end()" ] }, { "cell_type": "markdown", "metadata": { "id": "K9l11MP2Y5W3" }, "source": [ "In your Workspace, you will see an Artifacts tab where you can view the data that has been uploaded. " ] }, { "cell_type": "markdown", "metadata": { "id": "YeOo1D4TezCl" }, "source": [ "![Screenshot 2023-01-10 at 23-15-56 Comet.ml - Supercharging Machine Learning.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "lKlQUvmmZscF" }, "source": [ "Clicking on the Artifact will bring up the Version information and associated Metadata. " ] }, { "cell_type": "markdown", "metadata": { "id": "0_4q8B8ee_Ez" }, "source": [ "![Screenshot 2023-01-10 at 23-16-23 Comet.ml - Supercharging Machine Learning.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "DI34eveoNtuc" }, "source": [ "# Using an Artifact" ] }, { "cell_type": "markdown", "metadata": { "id": "G7YJYBrxNyj9" }, "source": [ "Now that we have an Artifact tracking our dataset, let's move on to using it to train a model! \n", "\n", "\n", "First, let's make a directory to save our Artifacts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p-DvX6Pe2e0p" }, "outputs": [], "source": [ "! mkdir ./artifacts" ] }, { "cell_type": "markdown", "metadata": { "id": "3f47rvl9vPz4" }, "source": [ "### Download the Artifact" ] }, { "cell_type": "markdown", "metadata": { "id": "d0QtzIuzaOvL" }, "source": [ "We can fetch the Artifact we need using its name, and either the version or alias. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "89PLtHw53dX8" }, "outputs": [], "source": [ "experiment = comet_ml.Experiment()\n", "experiment.add_tag(\"train\")\n", "\n", "# Fetch the Artifact object from Comet\n", "name = \"california\"\n", "version_or_alias = \"raw\"\n", "artifact = experiment.get_artifact(name, version_or_alias=version_or_alias)\n", "\n", "# Download Artifact\n", "output_path = \"./artifacts\"\n", "artifact.download(output_path, overwrite_strategy=\"PRESERVE\")" ] }, { "cell_type": "markdown", "metadata": { "id": "vDr2VsVSRKfN" }, "source": [ "### Train a Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UYLYqoMH6H08" }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from joblib import dump\n", "\n", "# Load Data from Artifact\n", "train_df = pd.read_csv(\"./artifacts/train.csv\")\n", "test_df = pd.read_csv(\"./artifacts/test.csv\")\n", "\n", "y_train = train_df.pop(\"target\").values\n", "X_train = train_df.values\n", "\n", "y_test = test_df.pop(\"target\").values\n", "X_test = test_df.values\n", "\n", "# Initialize Model\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "\n", "# Evaluate Model\n", "train_score = model.score(X_train, y_train)\n", "test_score = model.score(X_test, y_test)\n", "\n", "experiment.log_metric(\"train-score\", train_score)\n", "experiment.log_metric(\"test-score\", test_score)\n", "\n", "# Save Model\n", "model_path = \"./linear-model.pkl\"\n", "dump(model, model_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "Whe6tZBNRV7R" }, "source": [ "# Log Model as an Artifact\n", "\n", "Let's log the model we just trained as an Artifact. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GBL50p59RUui" }, "outputs": [], "source": [ "# Log Model as an Artifact\n", "model_artifact = comet_ml.Artifact(\n", " \"housing-model\", artifact_type=\"model\", aliases=[\"baseline\"]\n", ")\n", "model_artifact.add(model_path)\n", "experiment.log_artifact(model_artifact)" ] }, { "cell_type": "markdown", "metadata": { "id": "0_M1HZ8_dH-S" }, "source": [ "You can view the Artifacts Produced and Consumed by an Experiment in the \"Assets and Artifacts\" tab under Artifacts. Toggle the direction selector to filter by Input, which refers to Artifacts that were consumed, and Output which refers to Artifacts that were produced " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "flCC_reqcFbF" }, "outputs": [], "source": [ "experiment.display(tab=\"assets\")\n", "experiment.end()" ] }, { "cell_type": "markdown", "metadata": { "id": "gQHLMYXMOCPq" }, "source": [ "# Updating an Artifact " ] }, { "cell_type": "markdown", "metadata": { "id": "6ablGuV8bNgV" }, "source": [ "Our scores on the raw dataset were not that great. Why don't we scale the data and update our Artifact to reflect this. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t1RtWsxy7uec" }, "outputs": [], "source": [ "# Scores aren't great, so lets scale the features\n", "from sklearn.preprocessing import StandardScaler as Scaler\n", "\n", "experiment = comet_ml.Experiment()\n", "experiment.add_tag(\"upload\")\n", "\n", "X_scaler = Scaler().fit(X_train)\n", "y_scaler = Scaler().fit(y_train.reshape(-1, 1))\n", "\n", "X_train_scaled = X_scaler.transform(X_train)\n", "X_test_scaled = X_scaler.transform(X_test)\n", "\n", "y_train_scaled = y_scaler.transform(y_train.reshape(-1, 1))\n", "y_test_scaled = y_scaler.transform(y_test.reshape(-1, 1))\n", "\n", "train_scaled_df = pd.DataFrame(X_train, columns=featurecols)\n", "test_scaled_df = pd.DataFrame(X_test, columns=featurecols)\n", "\n", "train_scaled_df[\"target\"] = y_train\n", "test_scaled_df[\"target\"] = y_test\n", "\n", "train_scaled_df.to_csv(\"./datasets/train-scaled.csv\")\n", "test_scaled_df.to_csv(\"./datasets/test-scaled.csv\")\n", "\n", "# Update Artifact with Scaled Data\n", "scaled_dataset_artifact = comet_ml.Artifact(\n", " \"california\",\n", " artifact_type=\"dataset\",\n", " aliases=[\"standard-scaled\"],\n", " metadata={\"task\": \"regression\"},\n", ")\n", "# Add files to the Artifact\n", "for split, asset in zip(\n", " [\"train\", \"test\"], [\"./datasets/train-scaled.csv\", \"./datasets/test-scaled.csv\"]\n", "):\n", " scaled_dataset_artifact.add(\n", " asset, metadata={\"dataset_stage\": \"standard-scaled\", \"dataset_split\": split}\n", " )\n", "\n", "experiment.log_artifact(scaled_dataset_artifact)\n", "experiment.end()" ] }, { "cell_type": "markdown", "metadata": { "id": "Zt54EPm6OYlV" }, "source": [ "# Train a Model with the Latest Version of the Dataset Artifact" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9UuAGSYEFS7L" }, "outputs": [], "source": [ "experiment = comet_ml.Experiment()\n", "experiment.add_tag(\"train\")\n", "\n", "# Fetch the Artifact object from Comet\n", "name = \"california\"\n", "version_or_alias = \"standard-scaled\"\n", "artifact = experiment.get_artifact(name, version_or_alias=version_or_alias)\n", "\n", "\n", "# Download Artifact\n", "output_path = \"./artifacts\"\n", "artifact.download(output_path, overwrite_strategy=\"PRESERVE\")\n", "\n", "# Load Data from Artifact\n", "train_df = pd.read_csv(\"./artifacts/train-scaled.csv\")\n", "test_df = pd.read_csv(\"./artifacts/test-scaled.csv\")\n", "\n", "y_train = train_df.pop(\"target\").values\n", "X_train = train_df.values\n", "\n", "y_test = test_df.pop(\"target\").values\n", "X_test = test_df.values\n", "\n", "# Initialize Model\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "\n", "# Evaluate Model\n", "train_score = model.score(X_train, y_train)\n", "test_score = model.score(X_test, y_test)\n", "\n", "experiment.log_metric(\"train-score\", train_score)\n", "experiment.log_metric(\"test-score\", test_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t2Vc-NxGdFhK" }, "outputs": [], "source": [ "experiment.end()" ] }, { "cell_type": "markdown", "metadata": { "id": "_HQe28iuj5sn" }, "source": [ "Doesn't look like the scaling helped :( Back to the drawing board!" ] }, { "cell_type": "markdown", "metadata": { "id": "k-DU-t_xiJ_x" }, "source": [ "# Conclusion\n", "\n", "We hope you enjoyed this introductory guide to Artifacts, a simple, light weight way to version your datasets and models, while providing information about the lineage of your data through your experiments. \n", "\n", "Interested in learning more about Artifacts? Check out the [docs](https://www.comet.ml/docs/user-interface/artifacts/?utm_campaign-artifacts-launch&utm_source=colab-example&utm_medium=additional-resources)" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 0 }