{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2: Setup Data Splits\n", "\n", "Before iterating on annotations, you need proper data splits. Without them, you'll contaminate your evaluation and build a model that only looks good on paper.\n", "\n", "This step uses the **quickstart-groups** dataset (KITTI multimodal data with left/right cameras and point clouds) and creates:\n", "\n", "- **Test set (15%)** - Frozen. Never used for selection or training. Final evaluation only.\n", "- **Validation set (15%)** - For iteration decisions. Used to evaluate between training rounds.\n", "- **Golden QA set (5%)** - Small, heavily reviewed. Detects label drift.\n", "- **Pool (65%)** - Active learning pool. All new labels come from here.\n", "\n", "> **Critical:** Splits are created at the **group level** (scene), not sample level. This ensures the same scene stays together across all slices (left, right, pcd), preventing data leakage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import fiftyone.zoo as foz\n", "import random\n", "\n", "DATASET_NAME = \"annotation_tutorial\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load or Create the Dataset\n", "\n", "We clone `quickstart-groups` to a persistent working dataset. This keeps your annotations separate from the zoo dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load or create the dataset (idempotent - safe to rerun)\n", "if DATASET_NAME in fo.list_datasets():\n", " print(f\"Loading existing dataset: {DATASET_NAME}\")\n", " dataset = fo.load_dataset(DATASET_NAME)\n", " \n", " # Check if splits already exist\n", " existing_views = dataset.list_saved_views()\n", " if \"pool\" in existing_views:\n", " print(\"Splits already exist. Skipping creation.\")\n", " SPLITS_EXIST = True\n", " else:\n", " SPLITS_EXIST = False\n", "else:\n", " print(f\"Creating new dataset: {DATASET_NAME}\")\n", " source = foz.load_zoo_dataset(\"quickstart-groups\")\n", " dataset = source.clone(DATASET_NAME)\n", " dataset.persistent = True\n", " SPLITS_EXIST = False\n", "\n", "print(f\"\\nDataset: {dataset.name}\")\n", "print(f\"Media type: {dataset.media_type}\")\n", "print(f\"Group slices: {dataset.group_slices}\")\n", "print(f\"Default slice: {dataset.default_group_slice}\")\n", "print(f\"Num groups (scenes): {len(dataset.distinct('group.id'))}\")\n", "total_samples = sum(len(dataset.select_group_slices([s])) for s in dataset.group_slices)\n", "print(f\"Num samples (all slices): {total_samples}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understand the Grouped Structure\n", "\n", "The `quickstart-groups` dataset is a **grouped dataset** from KITTI:\n", "\n", "| Slice | Content | Purpose |\n", "|-------|---------|--------|\n", "| `left` | Left camera images | 2D detection annotation |\n", "| `right` | Right camera images | Stereo pair (optional use) |\n", "| `pcd` | Point cloud data | 3D cuboid annotation |\n", "\n", "Each **group** represents one scene/frame with synchronized data across all sensors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Explore a single group\n", "group_ids = dataset.distinct(\"group.id\")\n", "example_group_id = group_ids[0]\n", "\n", "print(f\"Example group ID: {example_group_id}\")\n", "print(f\"\\nSamples in this group:\")\n", "\n", "example_group = dataset.get_group(example_group_id)\n", "for slice_name, sample in example_group.items():\n", " print(f\" {slice_name}: {sample.filepath.split(\"/\")[-1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Splits at the Group Level\n", "\n", "**Why group-level splits?**\n", "\n", "If we split at the sample level, the same scene could end up in both train and test (just different slices). This causes data leakage - the model \"sees\" scenes at training time that appear in evaluation.\n", "\n", "By splitting at the **group level**, we ensure:\n", "- All slices from the same scene stay together\n", "- No information leaks between splits" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tag ALL samples in each group with the split tag\n", "# Must iterate all slices since grouped datasets segment by slice\n", "if not SPLITS_EXIST:\n", " from fiftyone import ViewField as F\n", " \n", " # Build group-to-split mapping\n", " group_to_split = {}\n", " for gid in test_groups:\n", " group_to_split[gid] = \"split:test\"\n", " for gid in val_groups:\n", " group_to_split[gid] = \"split:val\"\n", " for gid in golden_groups:\n", " group_to_split[gid] = \"split:golden\"\n", " for gid in pool_groups:\n", " group_to_split[gid] = \"split:pool\"\n", " \n", " # Tag samples across ALL slices\n", " for slice_name in dataset.group_slices:\n", " view = dataset.select_group_slices([slice_name])\n", " for sample in view.iter_samples(autosave=True):\n", " split_tag = group_to_split.get(sample.group.id)\n", " if split_tag:\n", " sample.tags.append(split_tag)\n", " \n", " # Save views for easy access\n", " dataset.save_view(\"test_set\", dataset.match_tags(\"split:test\"))\n", " dataset.save_view(\"val_set\", dataset.match_tags(\"split:val\"))\n", " dataset.save_view(\"golden_qa\", dataset.match_tags(\"split:golden\"))\n", " dataset.save_view(\"pool\", dataset.match_tags(\"split:pool\"))\n", " \n", " print(\"Splits created and saved as views.\")\n", "else:\n", " print(\"Using existing splits.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add annotation tracking field (idempotent)\n", "if \"annotation_status\" not in dataset.get_field_schema():\n", " dataset.add_sample_field(\"annotation_status\", fo.StringField)\n", " dataset.set_values(\"annotation_status\", [\"unlabeled\"] * dataset.count())\n", " print(\"Added annotation_status field (all samples start as 'unlabeled')\")\n", "else:\n", " print(\"annotation_status field already exists.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verify setup\n", "from fiftyone import ViewField as F\n", "\n", "print(\"Saved views:\", dataset.list_saved_views())\n", "print()\n", "\n", "for view_name in [\"test_set\", \"val_set\", \"golden_qa\", \"pool\"]:\n", " view = dataset.load_saved_view(view_name)\n", " # Count unique groups in view\n", " n_groups = len(view.distinct(\"group.id\"))\n", " n_samples = len(view)\n", " print(f\"{view_name}: {n_groups} groups, {n_samples} samples (all slices)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch the App\n", "\n", "Explore your grouped dataset in the App. Notice:\n", "- The **group mode** shows synchronized samples\n", "- Use the **slice selector** to switch between left, right, and pcd\n", "- Filter by split tags to see each partition" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "You created four data splits with clear purposes:\n", "- Test (frozen), Val (iteration), Golden (QA), Pool (labeling source)\n", "- **Splits are at the group level** - same scene = same split across all slices\n", "\n", "**Artifacts:**\n", "- `annotation_tutorial` dataset (persistent clone of quickstart-groups)\n", "- Split tags: `split:test`, `split:val`, `split:golden`, `split:pool`\n", "- Saved views: `test_set`, `val_set`, `golden_qa`, `pool`\n", "- `annotation_status` field for tracking progress\n", "\n", "**Next:** Step 3 - Smart Sample Selection" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }