{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: Setup Data Splits\n",
    "\n",
    "Before iterating on annotations, you need proper data splits. Without them, you'll contaminate your evaluation and build a model that only looks good on paper.\n",
    "\n",
    "This step uses the **quickstart-groups** dataset (KITTI multimodal data with left/right cameras and point clouds) and creates:\n",
    "\n",
    "- **Test set (15%)** - Frozen. Never used for selection or training. Final evaluation only.\n",
    "- **Validation set (15%)** - For iteration decisions. Used to evaluate between training rounds.\n",
    "- **Golden QA set (5%)** - Small, heavily reviewed. Detects label drift.\n",
    "- **Pool (65%)** - Active learning pool. All new labels come from here.\n",
    "\n",
    "> **Critical:** Splits are created at the **group level** (scene), not sample level. This ensures the same scene stays together across all slices (left, right, pcd), preventing data leakage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fiftyone as fo\n",
    "import fiftyone.zoo as foz\n",
    "import random\n",
    "\n",
    "DATASET_NAME = \"annotation_tutorial\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load or Create the Dataset\n",
    "\n",
    "We clone `quickstart-groups` to a persistent working dataset. This keeps your annotations separate from the zoo dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load or create the dataset (idempotent - safe to rerun)\n",
    "if DATASET_NAME in fo.list_datasets():\n",
    "    print(f\"Loading existing dataset: {DATASET_NAME}\")\n",
    "    dataset = fo.load_dataset(DATASET_NAME)\n",
    "    \n",
    "    # Check if splits already exist\n",
    "    existing_views = dataset.list_saved_views()\n",
    "    if \"pool\" in existing_views:\n",
    "        print(\"Splits already exist. Skipping creation.\")\n",
    "        SPLITS_EXIST = True\n",
    "    else:\n",
    "        SPLITS_EXIST = False\n",
    "else:\n",
    "    print(f\"Creating new dataset: {DATASET_NAME}\")\n",
    "    source = foz.load_zoo_dataset(\"quickstart-groups\")\n",
    "    dataset = source.clone(DATASET_NAME)\n",
    "    dataset.persistent = True\n",
    "    SPLITS_EXIST = False\n",
    "\n",
    "print(f\"\\nDataset: {dataset.name}\")\n",
    "print(f\"Media type: {dataset.media_type}\")\n",
    "print(f\"Group slices: {dataset.group_slices}\")\n",
    "print(f\"Default slice: {dataset.default_group_slice}\")\n",
    "print(f\"Num groups (scenes): {len(dataset.distinct('group.id'))}\")\n",
    "total_samples = sum(len(dataset.select_group_slices([s])) for s in dataset.group_slices)\n",
    "print(f\"Num samples (all slices): {total_samples}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand the Grouped Structure\n",
    "\n",
    "The `quickstart-groups` dataset is a **grouped dataset** from KITTI:\n",
    "\n",
    "| Slice | Content | Purpose |\n",
    "|-------|---------|--------|\n",
    "| `left` | Left camera images | 2D detection annotation |\n",
    "| `right` | Right camera images | Stereo pair (optional use) |\n",
    "| `pcd` | Point cloud data | 3D cuboid annotation |\n",
    "\n",
    "Each **group** represents one scene/frame with synchronized data across all sensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore a single group\n",
    "group_ids = dataset.distinct(\"group.id\")\n",
    "example_group_id = group_ids[0]\n",
    "\n",
    "print(f\"Example group ID: {example_group_id}\")\n",
    "print(f\"\\nSamples in this group:\")\n",
    "\n",
    "example_group = dataset.get_group(example_group_id)\n",
    "for slice_name, sample in example_group.items():\n",
    "    print(f\"  {slice_name}: {sample.filepath.split(\"/\")[-1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Splits at the Group Level\n",
    "\n",
    "**Why group-level splits?**\n",
    "\n",
    "If we split at the sample level, the same scene could end up in both train and test (just different slices). This causes data leakage - the model \"sees\" scenes at training time that appear in evaluation.\n",
    "\n",
    "By splitting at the **group level**, we ensure:\n",
    "- All slices from the same scene stay together\n",
    "- No information leaks between splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tag ALL samples in each group with the split tag\n",
    "# Must iterate all slices since grouped datasets segment by slice\n",
    "if not SPLITS_EXIST:\n",
    "    from fiftyone import ViewField as F\n",
    "    \n",
    "    # Build group-to-split mapping\n",
    "    group_to_split = {}\n",
    "    for gid in test_groups:\n",
    "        group_to_split[gid] = \"split:test\"\n",
    "    for gid in val_groups:\n",
    "        group_to_split[gid] = \"split:val\"\n",
    "    for gid in golden_groups:\n",
    "        group_to_split[gid] = \"split:golden\"\n",
    "    for gid in pool_groups:\n",
    "        group_to_split[gid] = \"split:pool\"\n",
    "    \n",
    "    # Tag samples across ALL slices\n",
    "    for slice_name in dataset.group_slices:\n",
    "        view = dataset.select_group_slices([slice_name])\n",
    "        for sample in view.iter_samples(autosave=True):\n",
    "            split_tag = group_to_split.get(sample.group.id)\n",
    "            if split_tag:\n",
    "                sample.tags.append(split_tag)\n",
    "    \n",
    "    # Save views for easy access\n",
    "    dataset.save_view(\"test_set\", dataset.match_tags(\"split:test\"))\n",
    "    dataset.save_view(\"val_set\", dataset.match_tags(\"split:val\"))\n",
    "    dataset.save_view(\"golden_qa\", dataset.match_tags(\"split:golden\"))\n",
    "    dataset.save_view(\"pool\", dataset.match_tags(\"split:pool\"))\n",
    "    \n",
    "    print(\"Splits created and saved as views.\")\n",
    "else:\n",
    "    print(\"Using existing splits.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add annotation tracking field (idempotent)\n",
    "if \"annotation_status\" not in dataset.get_field_schema():\n",
    "    dataset.add_sample_field(\"annotation_status\", fo.StringField)\n",
    "    dataset.set_values(\"annotation_status\", [\"unlabeled\"] * dataset.count())\n",
    "    print(\"Added annotation_status field (all samples start as 'unlabeled')\")\n",
    "else:\n",
    "    print(\"annotation_status field already exists.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify setup\n",
    "from fiftyone import ViewField as F\n",
    "\n",
    "print(\"Saved views:\", dataset.list_saved_views())\n",
    "print()\n",
    "\n",
    "for view_name in [\"test_set\", \"val_set\", \"golden_qa\", \"pool\"]:\n",
    "    view = dataset.load_saved_view(view_name)\n",
    "    # Count unique groups in view\n",
    "    n_groups = len(view.distinct(\"group.id\"))\n",
    "    n_samples = len(view)\n",
    "    print(f\"{view_name}: {n_groups} groups, {n_samples} samples (all slices)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launch the App\n",
    "\n",
    "Explore your grouped dataset in the App. Notice:\n",
    "- The **group mode** shows synchronized samples\n",
    "- Use the **slice selector** to switch between left, right, and pcd\n",
    "- Filter by split tags to see each partition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session = fo.launch_app(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "You created four data splits with clear purposes:\n",
    "- Test (frozen), Val (iteration), Golden (QA), Pool (labeling source)\n",
    "- **Splits are at the group level** - same scene = same split across all slices\n",
    "\n",
    "**Artifacts:**\n",
    "- `annotation_tutorial` dataset (persistent clone of quickstart-groups)\n",
    "- Split tags: `split:test`, `split:val`, `split:golden`, `split:pool`\n",
    "- Saved views: `test_set`, `val_set`, `golden_qa`, `pool`\n",
    "- `annotation_status` field for tracking progress\n",
    "\n",
    "**Next:** Step 3 - Smart Sample Selection"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}