{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Hugging Face CIFAR-100 Embeddings Example\n", "\n", "In this notebook we will see how to use a pre-trained Vision Transformers (ViT) model to collect embeddings on the CIFAR-100 dataset.\n", "\n", "![](../images/cifar100-embeddings.png)\n", "\n", "\n", "\n", "This notebook demonstrates:\n", "\n", "- Registering the `CIFAR-100` dataset from Hugging Face.\n", "- Computing image embeddings with `transformers` and reducing them to 2D with UMAP.\n", "- Adding the computed embeddings as metrics to a 3LC `Run`." ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Project Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - CIFAR-100\"\n", "RUN_NAME = \"Collect Image Embeddings\"\n", "DESCRIPTION = \"Collect image embeddings from ViT model on CIFAR-100\"\n", "DEVICE = None\n", "TRAIN_DATASET_NAME = \"hf-cifar-100-train\"\n", "TEST_DATASET_NAME = \"hf-cifar-100-test\"\n", "MODEL = \"google/vit-base-patch16-224\"\n", "BATCH_SIZE = 32\n", "DOWNLOAD_PATH = \"../../transient_data\"\n", "NUM_WORKERS = 4\n", "INSTALL_DEPENDENCIES = True" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "if INSTALL_DEPENDENCIES:\n", " %pip --quiet install 3lc[umap,huggingface] \"transformers<=4.56.0\"" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "import datasets\n", "import tlc\n", "\n", "logging.getLogger(\"transformers.modeling_utils\").setLevel(logging.ERROR) # Reduce model loading logs\n", "datasets.utils.logging.disable_progress_bar()" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## Prepare the data\n", "\n", "To read the data into 3LC, we use `tlc.Table.from_hugging_face()` available under the Hugging Face integration. This returns a `Table` that works similarly to a Hugging Face `datasets.Dataset`." ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "cifar100_train = tlc.Table.from_hugging_face(\n", " path=\"cifar100\",\n", " split=\"train\",\n", " table_name=\"train\",\n", " project_name=PROJECT_NAME,\n", " dataset_name=TRAIN_DATASET_NAME,\n", " description=\"CIFAR-100 training dataset\",\n", " if_exists=\"overwrite\",\n", ")\n", "\n", "cifar100_test = tlc.Table.from_hugging_face(\n", " path=\"cifar100\",\n", " split=\"test\",\n", " table_name=\"test\",\n", " project_name=PROJECT_NAME,\n", " dataset_name=TEST_DATASET_NAME,\n", " description=\"CIFAR-100 test dataset\",\n", " if_exists=\"overwrite\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "cifar100_train[0][\"img\"]" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## Compute the data\n", "\n", "We then use the `transformers` library to compute embeddings and `umap-learn` to reduce the embeddings to two dimensions. " ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torch.utils.data import DataLoader\n", "from tqdm.auto import tqdm\n", "from transformers import ViTImageProcessor, ViTModel\n", "\n", "if DEVICE is None:\n", " if torch.cuda.is_available():\n", " device = \"cuda:0\"\n", " elif torch.backends.mps.is_available():\n", " device = \"mps\"\n", " else:\n", " device = \"cpu\"\n", "else:\n", " device = DEVICE\n", "\n", "device = torch.device(device)\n", "print(f\"Using device: {device}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "feature_extractor = ViTImageProcessor.from_pretrained(MODEL)\n", "model = ViTModel.from_pretrained(MODEL).to(device)" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "def extract_feature(sample):\n", " return feature_extractor(images=sample[\"img\"], return_tensors=\"pt\")" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "def infer_on_dataset(dataset):\n", " activations = []\n", " dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=False)\n", " for inputs in tqdm(dataloader, total=len(dataloader)):\n", " inputs[\"pixel_values\"] = inputs[\"pixel_values\"].squeeze()\n", " inputs = inputs.to(device)\n", " outputs = model(**inputs)\n", " activations.append(outputs.last_hidden_state[:, 0, :].detach().cpu())\n", "\n", " return activations" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "activations = []\n", "model.eval()\n", "\n", "for dataset in (cifar100_train, cifar100_test):\n", " dataset = dataset.map(extract_feature)\n", " activations.extend(infer_on_dataset(dataset))" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "activations = torch.cat(activations).numpy()\n", "activations.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "import umap\n", "\n", "reducer = umap.UMAP(n_components=2)\n", "embeddings_2d = reducer.fit_transform(activations)" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "## Collect the embeddings as 3LC metrics\n", "\n", "In this example the metrics are contained in a `numpy.ndarray` object. We can specify the schema of this data and provide it directly to 3LC using `Run.add_metrics()`." ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "run = tlc.init(\n", " project_name=PROJECT_NAME,\n", " run_name=RUN_NAME,\n", " description=DESCRIPTION,\n", " if_exists=\"overwrite\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "embeddings_2d_train = embeddings_2d[: len(cifar100_train)]\n", "embeddings_2d_test = embeddings_2d[len(cifar100_train) :]" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "embeddings_2d_train = embeddings_2d[: len(cifar100_train)]\n", "embeddings_2d_test = embeddings_2d[len(cifar100_train) :]" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "for dataset, embeddings in ((cifar100_train, embeddings_2d_train), (cifar100_test, embeddings_2d_test)):\n", " run.add_metrics(\n", " {\"embeddings\": embeddings.tolist()},\n", " column_schemas={\"embeddings\": tlc.FloatVector2Schema()},\n", " foreign_table_url=dataset.url,\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "run.set_status_completed()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }