{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Add uniqueness metrics to Table\n",
    "\n",
    "This notebook computes the global image metrics \"uniqueness\" and \"diversity\" for each image in a dataset containing images and labels.\n",
    "\n",
    "![](../images/add-classification-metrics.png)\n",
    "\n",
    "<!-- Tags: [\"classification\", \"metrics\", \"cifar-10\", \"data-curation\"] -->\n",
    "\n",
    "The uniqueness metric quantifies how distinct an image is compared to others within its label group. It is based on the average pairwise distance between the image's embedding and all other embeddings of the same label. Higher uniqueness scores indicate that an image is less similar to others in its group, which can help identify outliers or rare examples within a class.\n",
    "\n",
    "The diversity metric, on the other hand, ranks images within a label group based on their proximity to the cluster center of that group. Images closer to the cluster center receive lower diversity scores, while those further away are ranked higher. This metric highlights how representative or central an image is to its class, helping to identify examples that are more typical versus those that are more unusual. Together, these metrics provide valuable insights into the structure and variability of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 3lc\n",
    "%pip install git+https://github.com/3lc-ai/3lc-examples.git\n",
    "%pip install transformers\n",
    "%pip install pacmap\n",
    "%pip install joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import tlc\n",
    "\n",
    "from tlc_tools import add_columns_to_table\n",
    "from tlc_tools.embeddings import add_embeddings_to_table\n",
    "from tlc_tools.metrics import diversity, uniqueness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load input table\n",
    "\n",
    "We will reuse the CIFAR-10 subset table from the notebook [1-weight-coreset.ipynb](./1-weight-coreset.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = tlc.Table.from_names(\"balanced-subset\", \"CIFAR-10-train\", \"3LC Tutorials - CIFAR-10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.map(lambda x: x[0])  # Only return the image\n",
    "\n",
    "# In order to compute the \"uniqueness\" and \"diversity\" metrics, we need embeddings.\n",
    "# This line adds default embeddings from a Visual Image Transformer model.\n",
    "added_embeddings_table = add_embeddings_to_table(table)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_table = tlc.reduce_embeddings(added_embeddings_table, method=\"pacmap\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_table.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute metrics and add to Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_column_pyarrow = reduced_table.get_column(\"Label\")\n",
    "label_column = np.array(label_column_pyarrow.to_numpy(zero_copy_only=False).tolist(), dtype=np.int32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_column_pyarrow = reduced_table.get_column(\"embedding_pacmap\")\n",
    "embeddings_column = np.array(embeddings_column_pyarrow.to_numpy(zero_copy_only=False).tolist(), dtype=np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diversity = diversity(label_column, embeddings_column)\n",
    "uniqueness = uniqueness(label_column, embeddings_column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "added_columns_table = add_columns_to_table(\n",
    "    reduced_table,\n",
    "    columns={\n",
    "        \"image_diversity\": diversity,\n",
    "        \"image_uniqueness\": uniqueness,\n",
    "    },\n",
    "    output_table_name=\"added-classification-metrics\",\n",
    "    description=\"Added uniqueness and diversity metrics based on embeddings\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "test_marks": [
   "slow"
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 2
}