{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Hugging Face CIFAR-100 Embeddings Example\n",
    "\n",
    "In this notebook we will see how to use a pre-trained Vision Transformers (ViT) model to collect embeddings on the CIFAR-100 dataset.\n",
    "\n",
    "![](../images/cifar100-embeddings.png)\n",
    "\n",
    "<!-- Tags: [\"cifar-100\", \"embeddings\", \"hugging-face\"] -->\n",
    "\n",
    "This notebook demonstrates:\n",
    "\n",
    "- Registering the `CIFAR-100` dataset from Hugging Face.\n",
    "- Computing image embeddings with `transformers` and reducing them to 2D with UMAP.\n",
    "- Adding the computed embeddings as metrics to a 3LC `Run`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Project Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - CIFAR-100\"\n",
    "RUN_NAME = \"Collect Image Embeddings\"\n",
    "DESCRIPTION = \"Collect image embeddings from ViT model on CIFAR-100\"\n",
    "DEVICE = None\n",
    "TRAIN_DATASET_NAME = \"hf-cifar-100-train\"\n",
    "TEST_DATASET_NAME = \"hf-cifar-100-test\"\n",
    "MODEL = \"google/vit-base-patch16-224\"\n",
    "BATCH_SIZE = 32\n",
    "DOWNLOAD_PATH = \"../../transient_data\"\n",
    "NUM_WORKERS = 4\n",
    "INSTALL_DEPENDENCIES = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "if INSTALL_DEPENDENCIES:\n",
    "    %pip --quiet install 3lc[umap,huggingface] \"transformers<=4.56.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "import datasets\n",
    "import tlc\n",
    "\n",
    "logging.getLogger(\"transformers.modeling_utils\").setLevel(logging.ERROR)  # Reduce model loading logs\n",
    "datasets.utils.logging.disable_progress_bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## Prepare the data\n",
    "\n",
    "To read the data into 3LC, we use `tlc.Table.from_hugging_face()` available under the Hugging Face integration. This returns a `Table` that works similarly to a Hugging Face `datasets.Dataset`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "cifar100_train = tlc.Table.from_hugging_face(\n",
    "    path=\"cifar100\",\n",
    "    split=\"train\",\n",
    "    table_name=\"train\",\n",
    "    project_name=PROJECT_NAME,\n",
    "    dataset_name=TRAIN_DATASET_NAME,\n",
    "    description=\"CIFAR-100 training dataset\",\n",
    "    if_exists=\"overwrite\",\n",
    ")\n",
    "\n",
    "cifar100_test = tlc.Table.from_hugging_face(\n",
    "    path=\"cifar100\",\n",
    "    split=\"test\",\n",
    "    table_name=\"test\",\n",
    "    project_name=PROJECT_NAME,\n",
    "    dataset_name=TEST_DATASET_NAME,\n",
    "    description=\"CIFAR-100 test dataset\",\n",
    "    if_exists=\"overwrite\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "cifar100_train[0][\"img\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## Compute the data\n",
    "\n",
    "We then use the `transformers` library to compute embeddings and `umap-learn` to reduce the embeddings to two dimensions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.utils.data import DataLoader\n",
    "from tqdm.auto import tqdm\n",
    "from transformers import ViTImageProcessor, ViTModel\n",
    "\n",
    "if DEVICE is None:\n",
    "    if torch.cuda.is_available():\n",
    "        device = \"cuda:0\"\n",
    "    elif torch.backends.mps.is_available():\n",
    "        device = \"mps\"\n",
    "    else:\n",
    "        device = \"cpu\"\n",
    "else:\n",
    "    device = DEVICE\n",
    "\n",
    "device = torch.device(device)\n",
    "print(f\"Using device: {device}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_extractor = ViTImageProcessor.from_pretrained(MODEL)\n",
    "model = ViTModel.from_pretrained(MODEL).to(device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_feature(sample):\n",
    "    return feature_extractor(images=sample[\"img\"], return_tensors=\"pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "def infer_on_dataset(dataset):\n",
    "    activations = []\n",
    "    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=False)\n",
    "    for inputs in tqdm(dataloader, total=len(dataloader)):\n",
    "        inputs[\"pixel_values\"] = inputs[\"pixel_values\"].squeeze()\n",
    "        inputs = inputs.to(device)\n",
    "        outputs = model(**inputs)\n",
    "        activations.append(outputs.last_hidden_state[:, 0, :].detach().cpu())\n",
    "\n",
    "    return activations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "activations = []\n",
    "model.eval()\n",
    "\n",
    "for dataset in (cifar100_train, cifar100_test):\n",
    "    dataset = dataset.map(extract_feature)\n",
    "    activations.extend(infer_on_dataset(dataset))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "activations = torch.cat(activations).numpy()\n",
    "activations.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "import umap\n",
    "\n",
    "reducer = umap.UMAP(n_components=2)\n",
    "embeddings_2d = reducer.fit_transform(activations)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## Collect the embeddings as 3LC metrics\n",
    "\n",
    "In this example the metrics are contained in a `numpy.ndarray` object. We can specify the schema of this data and provide it directly to 3LC using `Run.add_metrics()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "run = tlc.init(\n",
    "    project_name=PROJECT_NAME,\n",
    "    run_name=RUN_NAME,\n",
    "    description=DESCRIPTION,\n",
    "    if_exists=\"overwrite\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_2d_train = embeddings_2d[: len(cifar100_train)]\n",
    "embeddings_2d_test = embeddings_2d[len(cifar100_train) :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_2d_train = embeddings_2d[: len(cifar100_train)]\n",
    "embeddings_2d_test = embeddings_2d[len(cifar100_train) :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "for dataset, embeddings in ((cifar100_train, embeddings_2d_train), (cifar100_test, embeddings_2d_test)):\n",
    "    run.add_metrics(\n",
    "        {\"embeddings\": embeddings.tolist()},\n",
    "        column_schemas={\"embeddings\": tlc.FloatVector2Schema()},\n",
    "        foreign_table_url=dataset.url,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "run.set_status_completed()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}