{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: Smart Sample Selection\n",
    "\n",
    "Random sampling wastes labels on redundant near-duplicates. This step uses **diversity-based selection** to pick high-value scenes that cover your data distribution efficiently.\n",
    "\n",
    "We'll use **ZCore (Zero-Shot Coreset Selection)** to score samples based on:\n",
    "- **Coverage**: How much of the embedding space does this sample represent?\n",
    "- **Redundancy**: How many near-duplicates exist?\n",
    "\n",
    "High ZCore score = valuable for labeling. Low score = redundant, skip it.\n",
    "\n",
    "> **Note:** For grouped datasets, we compute embeddings on the **left camera slice** and select at the **group level** (scene)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import fiftyone as fo\nimport fiftyone.brain as fob\nimport numpy as np\nfrom fiftyone import ViewField as F\n\ndataset = fo.load_dataset(\"annotation_tutorial\")\npool = dataset.load_saved_view(\"pool\")\n\n# Get pool groups (scenes)\npool_groups = pool.distinct(\"group.id\")\nprint(f\"Pool: {len(pool_groups)} groups (scenes)\")\ntotal_samples = sum(len(pool.select_group_slices([s])) for s in dataset.group_slices)\nprint(f\"Pool: {total_samples} total samples (all slices)\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Compute Embeddings on Left Camera Slice\n\nFor diversity selection, we need embeddings. We compute them on the **left camera images** since that is our primary 2D annotation target.\n\n> **Dependencies:** This step requires `torch` and `umap-learn`. Install them if needed:\n> ```bash\n> pip install torch torchvision umap-learn\n> ```"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get left camera slice from pool\n",
    "pool_left = pool.select_group_slices([\"left\"])\n",
    "\n",
    "print(f\"Left camera samples in pool: {len(pool_left)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute embeddings (takes a few minutes)\n",
    "fob.compute_visualization(\n",
    "    pool_left,\n",
    "    embeddings=\"embeddings\",\n",
    "    brain_key=\"img_viz\",\n",
    "    verbose=True\n",
    ")\n",
    "\n",
    "print(\"Embeddings computed on left camera slice.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ZCore: Zero-Shot Coreset Selection\n",
    "\n",
    "ZCore scores each sample by iteratively:\n",
    "1. Sampling random points in embedding space\n",
    "2. Finding the nearest data point (coverage bonus)\n",
    "3. Penalizing nearby neighbors (redundancy penalty)\n",
    "\n",
    "The result: samples covering unique regions score high; redundant samples score low."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def zcore_score(embeddings, n_sample=10000, sample_dim=2, redund_nn=100, redund_exp=4, seed=42):\n",
    "    \"\"\"\n",
    "    Compute ZCore scores for coverage-based sample selection.\n",
    "    \n",
    "    Reference implementation from https://github.com/voxel51/zcore\n",
    "    \n",
    "    Args:\n",
    "        embeddings: np.array of shape (n_samples, embedding_dim)\n",
    "        n_sample: Number of random samples to draw\n",
    "        sample_dim: Number of dimensions to sample at a time\n",
    "        redund_nn: Number of nearest neighbors for redundancy penalty\n",
    "        redund_exp: Exponent for distance-based redundancy penalty\n",
    "        seed: Random seed for reproducibility\n",
    "    \n",
    "    Returns:\n",
    "        Normalized scores (0-1) where higher = more valuable for labeling\n",
    "    \"\"\"\n",
    "    np.random.seed(seed)\n",
    "    \n",
    "    n = len(embeddings)\n",
    "    n_dim = embeddings.shape[1]\n",
    "    \n",
    "    emb_min = np.min(embeddings, axis=0)\n",
    "    emb_max = np.max(embeddings, axis=0)\n",
    "    emb_med = np.median(embeddings, axis=0)\n",
    "    \n",
    "    scores = np.random.uniform(0, 1, n)\n",
    "    \n",
    "    for i in range(n_sample):\n",
    "        if i % 2000 == 0:\n",
    "            print(f\"  ZCore progress: {i}/{n_sample}\")\n",
    "        \n",
    "        dim = np.random.choice(n_dim, min(sample_dim, n_dim), replace=False)\n",
    "        sample = np.random.triangular(emb_min[dim], emb_med[dim], emb_max[dim])\n",
    "        \n",
    "        embed_dist = np.sum(np.abs(embeddings[:, dim] - sample), axis=1)\n",
    "        idx = np.argmin(embed_dist)\n",
    "        scores[idx] += 1\n",
    "        \n",
    "        cover_sample = embeddings[idx, dim]\n",
    "        nn_dist = np.sum(np.abs(embeddings[:, dim] - cover_sample), axis=1)\n",
    "        nn = np.argsort(nn_dist)[1:]\n",
    "        \n",
    "        if nn_dist[nn[0]] == 0:\n",
    "            scores[nn[0]] -= 1\n",
    "        else:\n",
    "            nn = nn[:redund_nn]\n",
    "            dist_penalty = 1 / (nn_dist[nn] ** redund_exp + 1e-8)\n",
    "            dist_penalty /= np.sum(dist_penalty)\n",
    "            scores[nn] -= dist_penalty\n",
    "    \n",
    "    scores = (scores - np.min(scores)) / (np.max(scores) - np.min(scores) + 1e-8)\n",
    "    return scores.astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get embeddings from left camera samples\n",
    "pool_left_samples = list(pool_left)\n",
    "embeddings = np.array([s.embeddings for s in pool_left_samples if s.embeddings is not None])\n",
    "valid_samples = [s for s in pool_left_samples if s.embeddings is not None]\n",
    "\n",
    "print(f\"Computing ZCore for {len(embeddings)} samples...\")\n",
    "print(f\"Embedding dimension: {embeddings.shape[1]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute ZCore scores\n",
    "scores = zcore_score(\n",
    "    embeddings,\n",
    "    n_sample=5000,\n",
    "    sample_dim=2,\n",
    "    redund_nn=50,\n",
    "    redund_exp=4,\n",
    "    seed=42\n",
    ")\n",
    "\n",
    "print(f\"\\nZCore scores computed!\")\n",
    "print(f\"Score range: {scores.min():.3f} - {scores.max():.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add ZCore scores to the left camera samples\n",
    "for sample, score in zip(valid_samples, scores):\n",
    "    sample[\"zcore\"] = float(score)\n",
    "    sample.save()\n",
    "\n",
    "print(f\"Added 'zcore' field to {len(valid_samples)} left camera samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select at the Group Level\n",
    "\n",
    "We computed scores on individual samples (left camera), but we need to select **groups** (scenes). Each group includes all slices (left, right, pcd).\n",
    "\n",
    "Selection strategy: Use the ZCore score from the left camera sample to rank groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build group_id -> zcore mapping\n",
    "group_scores = {}\n",
    "for sample in valid_samples:\n",
    "    group_id = sample.group.id\n",
    "    group_scores[group_id] = sample.zcore\n",
    "\n",
    "# Sort groups by ZCore score\n",
    "sorted_groups = sorted(group_scores.items(), key=lambda x: x[1], reverse=True)\n",
    "\n",
    "print(f\"Groups with ZCore scores: {len(sorted_groups)}\")\n",
    "print(f\"Top 5 groups by ZCore:\")\n",
    "for gid, score in sorted_groups[:5]:\n",
    "    print(f\"  {gid[:8]}...: {score:.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select top groups for batch v0\n",
    "# ~25% of pool groups, minimum 20\n",
    "batch_size = max(20, int(0.25 * len(sorted_groups)))\n",
    "selected_group_ids = [gid for gid, _ in sorted_groups[:batch_size]]\n",
    "\n",
    "print(f\"Selected {len(selected_group_ids)} groups for Batch v0\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Tag ALL samples in selected groups (all slices)\n# Must iterate all slices since F(\"group.id\").is_in() does not work on grouped datasets\nfor slice_name in dataset.group_slices:\n    view = dataset.select_group_slices([slice_name])\n    for sample in view.iter_samples(autosave=True):\n        if sample.group.id in selected_group_ids:\n            sample.tags.append(\"batch:v0\")\n            sample.tags.append(\"to_annotate\")\n            sample[\"annotation_status\"] = \"selected\"\n\n# Save as view\ndataset.save_view(\"batch_v0\", dataset.match_tags(\"batch:v0\"))\n\n# Count results\nbatch_v0_view = dataset.load_saved_view(\"batch_v0\")\nn_groups = len(batch_v0_view.distinct(\"group.id\"))\nn_samples = sum(len(batch_v0_view.select_group_slices([s])) for s in dataset.group_slices)\n\nprint(f\"\\nBatch v0:\")\nprint(f\"  Groups: {n_groups}\")\nprint(f\"  Total samples (all slices): {n_samples}\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify: check sample counts per slice\n",
    "batch_view = dataset.load_saved_view(\"batch_v0\")\n",
    "for slice_name in dataset.group_slices:\n",
    "    slice_count = len(batch_view.select_group_slices([slice_name]))\n",
    "    print(f\"  {slice_name}: {slice_count} samples\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize in the App\n",
    "session = fo.launch_app(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "In the App:\n1. Open the **Embeddings** panel to see the 2D projection\n2. Color by `zcore` to see score distribution\n3. Filter by `batch:v0` tag to see selected groups\n4. Verify high-ZCore samples are spread across clusters (good coverage)\n\n![Embeddings panel with ZCore scores](https://cdn.voxel51.com/getting_started_annotation/notebook3/embeddings_zcore.webp)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why Diversity Sampling Beats Random\n",
    "\n",
    "| Method | What it does | Result |\n",
    "|--------|-------------|--------|\n",
    "| **Random** | Picks samples uniformly | Over-samples dense regions, misses rare cases |\n",
    "| **ZCore** | Balances coverage vs redundancy | Maximizes diversity, fewer wasted labels |\n",
    "\n",
    "Research shows diversity-based selection can significantly reduce labeling requirements while maintaining model performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "You selected a diverse batch using ZCore:\n",
    "- Computed embeddings on **left camera slice**\n",
    "- Ran ZCore to score coverage vs redundancy\n",
    "- Selected top-scoring **groups** (scenes)\n",
    "- Tagged all slices (left, right, pcd) for annotation\n",
    "\n",
    "**Artifacts:**\n",
    "- `embeddings` field on left camera samples\n",
    "- `zcore` field with selection scores\n",
    "- `batch_v0` saved view (all slices for selected groups)\n",
    "- Tags: `batch:v0`, `to_annotate`\n",
    "\n",
    "**Next:** Step 4 - 2D Annotation + QA"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}