{ "cells": [ { "cell_type": "markdown", "id": "730d412d", "metadata": {}, "source": [ "# Step 3: Using SAM 2\n", "\n", "**Segment Anything 2 (SAM 2)** is a powerful segmentation model released in July 2024 that pushes the boundaries of image and video segmentation. It brings new capabilities to computer vision applications, including the ability to generate precise masks and track objects across frames in videos using just simple prompts.\n", "\n", "In this notebook, you'll learn how to:\n", "- Understand the key innovations in SAM 2\n", "- Apply SAM 2 to image datasets using bounding boxes, keypoints, or no prompts at all\n", "- Leverage SAM 2’s video segmentation and mask tracking capabilities with a single-frame prompt\n" ] }, { "cell_type": "markdown", "id": "8dbb5b2f", "metadata": {}, "source": [ "## What is SAM 2?\n", "\n", "SAM 2 is the next generation of the Segment Anything Model, originally introduced by Meta in 2023. While SAM was designed for zero-shot segmentation on still images, SAM 2 adds robust video segmentation and tracking capabilities. With just a bounding box or a set of keypoints on a single frame, SAM 2 can segment and track objects across entire video sequences." ] }, { "cell_type": "markdown", "id": "9ac21217", "metadata": {}, "source": [ "## Using SAM 2 for Images\n", "\n", "SAM 2 integrates directly with the FiftyOne Model Zoo, allowing you to apply segmentation to image datasets with minimal code. Whether you're working with ground truth bounding boxes, keypoints, or want to explore automatic mask generation, FiftyOne makes the process seamless." ] }, { "cell_type": "code", "execution_count": null, "id": "fc9580f6", "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import fiftyone.zoo as foz\n", "\n", "# Load dataset\n", "dataset = foz.load_zoo_dataset(\"quickstart\", max_samples=25, shuffle=True, seed=51)\n", "\n", "# Load SAM 2 image model\n", "model = foz.load_zoo_model(\"segment-anything-2-hiera-tiny-image-torch\")\n", "\n", "# Prompt with bounding boxes\n", "dataset.apply_model(model, label_field=\"segmentations\", prompt_field=\"ground_truth\")\n", "\n", "# Launch app to view segmentations\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "id": "1a7cc583", "metadata": {}, "source": [ "![sam2_inference](https://cdn.voxel51.com/getting_started_segmentation/notebook3/sam2_inference.webp)\n" ] }, { "cell_type": "markdown", "id": "67ae8bf1", "metadata": {}, "source": [ "## Using a custom segmentation dataset\n", "\n", "We will use a segmenation dataset with coffee beans, this is a FiftyOne Dataset. ```pjramg/my_colombian_coffe_FO```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "19c7120f", "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo # base library and app\n", "import fiftyone.utils.huggingface as fouh # Hugging Face integration\n", "dataset_ = fouh.load_from_hub(\"pjramg/my_colombian_coffe_FO\", persistent=True, overwrite=True)\n", "\n", "# Define the new dataset name\n", "dataset_name = \"coffee_FO_SAM2\"\n", "\n", "# Check if the dataset exists\n", "if dataset_name in fo.list_datasets():\n", " print(f\"Dataset '{dataset_name}' exists. Loading...\")\n", " dataset = fo.load_dataset(dataset_name)\n", "else:\n", " print(f\"Dataset '{dataset_name}' does not exist. Creating a new one...\")\n", " # Clone the dataset with a new name and make it persistent\n", " dataset = dataset_.clone(dataset_name, persistent=True)" ] }, { "cell_type": "markdown", "id": "6bfd92b7", "metadata": {}, "source": [ "### Prompting with ground truth information in the 100 unique samples in the dataset\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b39652f9", "metadata": {}, "outputs": [], "source": [ "import fiftyone.brain as fob\n", "\n", "results = fob.compute_similarity(dataset, brain_key=\"img_sim2\")\n", "results.find_unique(100)" ] }, { "cell_type": "code", "execution_count": null, "id": "efe9823d", "metadata": {}, "outputs": [], "source": [ "unique_view = dataset.select(results.unique_ids)\n", "session.view = unique_view" ] }, { "cell_type": "markdown", "id": "7274544e", "metadata": {}, "source": [ "![unique_100](https://cdn.voxel51.com/getting_started_segmentation/notebook3/unique_100.webp)\n" ] }, { "cell_type": "markdown", "id": "c7e91b30", "metadata": {}, "source": [ "### Apply SAM2 just the 100 unique samples\n", "\n", "SAM 2 can also segment entire images without needing any bounding boxes or keypoints. This zero-input mode is useful for generating segmentation masks for general visual analysis or bootstrapping annotation workflows." ] }, { "cell_type": "code", "execution_count": null, "id": "dc1fd9d3", "metadata": {}, "outputs": [], "source": [ "import fiftyone.zoo as foz\n", "model = foz.load_zoo_model(\"segment-anything-2-hiera-tiny-image-torch\")\n", "\n", "# Full automatic segmentations\n", "unique_view.apply_model(model, label_field=\"sam2_results\")" ] }, { "cell_type": "markdown", "id": "9fb8fcaa", "metadata": {}, "source": [ "![unique_100_sam](https://cdn.voxel51.com/getting_started_segmentation/notebook3/unique_100_sam.webp)\n" ] }, { "cell_type": "markdown", "id": "03601790", "metadata": {}, "source": [ "In case you run out of memory, you can free up GPU space by clearing the cache with:" ] }, { "cell_type": "code", "execution_count": null, "id": "03875f91", "metadata": {}, "outputs": [], "source": [ "import torch\n", "torch.cuda.empty_cache()" ] }, { "cell_type": "markdown", "id": "2fe07ff2", "metadata": {}, "source": [ "## Bonus with SAM2" ] }, { "cell_type": "markdown", "id": "536ab8bf", "metadata": {}, "source": [ "### Prompting with Keypoints\n", "\n", "Keypoint prompts are a great alternative to bounding boxes when working with articulated objects like people. Here, we filter images to include only people, generate keypoints using a keypoint model, and then use those keypoints to prompt SAM 2 for segmentation." ] }, { "cell_type": "code", "execution_count": null, "id": "eae16d65", "metadata": {}, "outputs": [], "source": [ "from fiftyone import ViewField as F\n", "\n", "# Filter persons only\n", "dataset = foz.load_zoo_dataset(\"quickstart\")\n", "dataset = dataset.filter_labels(\"ground_truth\", F(\"label\") == \"person\")\n", "\n", "# Apply keypoint detection\n", "kp_model = foz.load_zoo_model(\"keypoint-rcnn-resnet50-fpn-coco-torch\")\n", "dataset.default_skeleton = kp_model.skeleton\n", "dataset.apply_model(kp_model, label_field=\"gt_keypoints\")\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "id": "b11990f9", "metadata": {}, "source": [ "![key_points](https://cdn.voxel51.com/getting_started_segmentation/notebook3/key_points.webp)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e6aa233a", "metadata": {}, "outputs": [], "source": [ "# Apply SAM 2 with keypoints\n", "model = foz.load_zoo_model(\"segment-anything-2-hiera-tiny-image-torch\")\n", "dataset.apply_model(model, label_field=\"segmentations\", prompt_field=\"gt_keypoints_keypoints\")\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "id": "8cc0cc02", "metadata": {}, "source": [ "![key_points_sam](https://cdn.voxel51.com/getting_started_segmentation/notebook3/key_points_sam.webp)\n" ] }, { "cell_type": "markdown", "id": "cf6a3045", "metadata": {}, "source": [ "## Using SAM 2 for Video\n", "\n", "SAM 2 brings game-changing capabilities to video understanding. It can track segmentations across frames from a single bounding box or keypoint prompt provided on the first frame. With this, you can propagate high-quality segmentation masks through entire sequences automatically." ] }, { "cell_type": "code", "execution_count": null, "id": "9f51f6af", "metadata": {}, "outputs": [], "source": [ "dataset = foz.load_zoo_dataset(\"quickstart-video\", max_samples=2)\n", "from fiftyone import ViewField as F\n", "\n", "# Remove boxes after first frame\n", "(\n", " dataset\n", " .match_frames(F(\"frame_number\") > 1)\n", " .set_field(\"frames.detections\", None)\n", " .save()\n", ")\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "code", "execution_count": null, "id": "a6ca5e33", "metadata": {}, "outputs": [], "source": [ "# Apply video model with first-frame prompt\n", "model = foz.load_zoo_model(\"segment-anything-2-hiera-tiny-video-torch\")\n", "dataset.apply_model(model, label_field=\"segmentations\", prompt_field=\"frames.detections\")\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "id": "100c8a76", "metadata": {}, "source": [ "![sam2_video](https://cdn.voxel51.com/getting_started_segmentation/notebook3/sam2_video.webp)" ] }, { "cell_type": "markdown", "id": "652223b9", "metadata": {}, "source": [ "## Available SAM 2 Models in FiftyOne\n", "\n", "**Image Models:**\n", "- `segment-anything-2-hiera-tiny-image-torch`\n", "- `segment-anything-2-hiera-small-image-torch`\n", "- `segment-anything-2-hiera-base-plus-image-torch`\n", "- `segment-anything-2-hiera-large-image-torch`\n", "\n", "**Video Models:**\n", "- `segment-anything-2-hiera-tiny-video-torch`\n", "- `segment-anything-2-hiera-small-video-torch`\n", "- `segment-anything-2-hiera-base-plus-video-torch`\n", "- `segment-anything-2-hiera-large-video-torch`" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }