{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apply dimensionality reduction to multiple Tables\n", "\n", "This example shows how to use the \"producer-consumer\" pattern for re-using dimensionality reduction models across different tables.\n", "\n", "![](../images/dimensionality-reduction.jpg)\n", "\n", "\n", "\n", "Specifically, high-dimensional embeddings from the same model are added as new columns to the train and val split of the CIFAR-10 dataset. With a single call, a UMAP model is trained on the train split embeddings, and then used to transform both the train and val split embeddings. This ensures that the reduced, 3-dimensional embeddings are mapped to the same space, which is crucial for comparing embeddings across tables.\n", "\n", "The `tlc` package contains several helper functions for working with dimensionality reduction, and currently support both the UMAP and PaCMAP algorithms. A \"producer\" table is a reduction table that fits a dimensionality reduction model to the data, and saves the model for later use. A \"consumer\" table is a reduction table that uses the model from a producer table to only transform the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Project setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - CIFAR-10\"\n", "MODEL_NAME = \"resnet18\"\n", "METHOD = \"pacmap\"\n", "BATCH_SIZE = 32\n", "DOWNLOAD_PATH = \"../../transient_data\"\n", "NUM_COMPONENTS = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install 3lc[huggingface,pacmap]\n", "%pip install timm\n", "%pip install git+https://github.com/3lc-ai/3lc-examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import timm\n", "import tlc\n", "from torchvision.transforms import Compose, Normalize, Resize, ToTensor\n", "\n", "from tlc_tools.common import infer_torch_device\n", "from tlc_tools.embeddings import add_embeddings_to_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load input Tables\n", "\n", "We will re-use the CIFAR-10 tables created in an earlier notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_table = tlc.Table.from_names(\"initial\", \"CIFAR-10-train\", PROJECT_NAME)\n", "val_table = tlc.Table.from_names(\"initial\", \"CIFAR-10-val\", PROJECT_NAME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = timm.create_model(MODEL_NAME, pretrained=True, num_classes=0, cache_dir=Path(DOWNLOAD_PATH) / \"models\")\n", "model = model.to(infer_torch_device())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Map the table to ensure only suitably preprocessed images are passed to the model\n", "\n", "transform = Compose([Resize(256), ToTensor(), Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])\n", "\n", "\n", "def transformed_image(sample):\n", " return transform(sample[0])\n", "\n", "\n", "train_table.map(transformed_image)\n", "\n", "val_table.map(transformed_image)\n", "\n", "train_table[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_table_with_embeddings = add_embeddings_to_table(table=train_table, model=model, batch_size=BATCH_SIZE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val_table_with_embeddings = add_embeddings_to_table(table=val_table, model=model, batch_size=BATCH_SIZE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform dimensionality reduction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url_mapping = tlc.reduce_embeddings_with_producer_consumer(\n", " producer=val_table_with_embeddings,\n", " consumers=[train_table_with_embeddings],\n", " method=METHOD,\n", " n_components=NUM_COMPONENTS,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reduced_train_table_url = url_mapping[train_table_with_embeddings.url]\n", "reduced_val_table_url = url_mapping[val_table_with_embeddings.url]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Reduced train table url: {reduced_train_table_url}\")\n", "print(f\"Reduced val table url: {reduced_val_table_url}\")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" }, "test_marks": [ "slow" ] }, "nbformat": 4, "nbformat_minor": 2 }