{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Apply dimensionality reduction to multiple Tables\n",
    "\n",
    "This example shows how to use the \"producer-consumer\" pattern for re-using dimensionality reduction models across different tables.\n",
    "\n",
    "![](../images/dimensionality-reduction.jpg)\n",
    "\n",
    "<!-- Tags: [\"dimensionality-reduction\", \"cifar-10\"] -->\n",
    "\n",
    "Specifically, high-dimensional embeddings from the same model are added as new columns to the train and val split of the CIFAR-10 dataset. With a single call, a UMAP model is trained on the train split embeddings, and then used to transform both the train and val split embeddings. This ensures that the reduced, 3-dimensional embeddings are mapped to the same space, which is crucial for comparing embeddings across tables.\n",
    "\n",
    "The `tlc` package contains several helper functions for working with dimensionality reduction, and currently support both the UMAP and PaCMAP algorithms. A \"producer\" table is a reduction table that fits a dimensionality reduction model to the data, and saves the model for later use. A \"consumer\" table is a reduction table that uses the model from a producer table to only transform the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - CIFAR-10\"\n",
    "MODEL_NAME = \"resnet18\"\n",
    "METHOD = \"pacmap\"\n",
    "BATCH_SIZE = 32\n",
    "DOWNLOAD_PATH = \"../../transient_data\"\n",
    "NUM_COMPONENTS = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 3lc[huggingface,pacmap]\n",
    "%pip install timm\n",
    "%pip install git+https://github.com/3lc-ai/3lc-examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import timm\n",
    "import tlc\n",
    "from torchvision.transforms import Compose, Normalize, Resize, ToTensor\n",
    "\n",
    "from tlc_tools.common import infer_torch_device\n",
    "from tlc_tools.embeddings import add_embeddings_to_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load input Tables\n",
    "\n",
    "We will re-use the CIFAR-10 tables created in an earlier notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_table = tlc.Table.from_names(\"initial\", \"CIFAR-10-train\", PROJECT_NAME)\n",
    "val_table = tlc.Table.from_names(\"initial\", \"CIFAR-10-val\", PROJECT_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = timm.create_model(MODEL_NAME, pretrained=True, num_classes=0, cache_dir=Path(DOWNLOAD_PATH) / \"models\")\n",
    "model = model.to(infer_torch_device())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Map the table to ensure only suitably preprocessed images are passed to the model\n",
    "\n",
    "transform = Compose([Resize(256), ToTensor(), Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])\n",
    "\n",
    "\n",
    "def transformed_image(sample):\n",
    "    return transform(sample[0])\n",
    "\n",
    "\n",
    "train_table.map(transformed_image)\n",
    "\n",
    "val_table.map(transformed_image)\n",
    "\n",
    "train_table[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_table_with_embeddings = add_embeddings_to_table(table=train_table, model=model, batch_size=BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_table_with_embeddings = add_embeddings_to_table(table=val_table, model=model, batch_size=BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perform dimensionality reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url_mapping = tlc.reduce_embeddings_with_producer_consumer(\n",
    "    producer=val_table_with_embeddings,\n",
    "    consumers=[train_table_with_embeddings],\n",
    "    method=METHOD,\n",
    "    n_components=NUM_COMPONENTS,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_train_table_url = url_mapping[train_table_with_embeddings.url]\n",
    "reduced_val_table_url = url_mapping[val_table_with_embeddings.url]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Reduced train table url: {reduced_train_table_url}\")\n",
    "print(f\"Reduced val table url: {reduced_val_table_url}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "test_marks": [
   "slow"
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 2
}