--- name: fiftyone-model-evaluation description: Evaluate model predictions against ground truth using COCO, Open Images, or custom protocols. Use when computing mAP, precision, recall, confusion matrices, or analyzing TP/FP/FN examples for detection, classification, segmentation, or regression tasks. --- # Evaluate Model Predictions in FiftyOne ## Key Directives **ALWAYS follow these rules:** ### 1. Check if dataset exists and has required fields ```python list_datasets() set_context(dataset_name="my-dataset") dataset_summary(name="my-dataset") ``` Verify the dataset has both **prediction** and **ground truth** fields of compatible types. ### 2. Install evaluation plugin if not available ```python list_plugins() # If @voxel51/evaluation not listed: download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"]) enable_plugin(plugin_name="@voxel51/evaluation") ``` ### 3. Ask user for evaluation parameters Always confirm with the user: - Prediction field name - Ground truth field name - Evaluation key (unique identifier for this evaluation) - Evaluation method (coco, open-images, simple, top-k, binary) - Whether to compute mAP (for detection tasks) ### 4. Launch App for evaluation operators ```python launch_app(dataset_name="my-dataset") ``` ### 5. Close app when done ```python close_app() ``` ## Workflow ### Step 1: Verify Dataset and Fields ```python list_datasets() set_context(dataset_name="my-dataset") dataset_summary(name="my-dataset") ``` Review: - Sample count - Available label fields and their types - Identify prediction field (model outputs) - Identify ground truth field (annotations) **Label Types and Compatible Evaluations:** | Label Type | Evaluation Method | Supported Methods | |------------|-------------------|-------------------| | `Detections` | `evaluate_detections()` | coco, open-images | | `Polylines` | `evaluate_detections()` | coco, open-images | | `Keypoints` | `evaluate_detections()` | coco, open-images | | `TemporalDetections` | `evaluate_detections()` | activitynet | | `Classification` | `evaluate_classifications()` | simple, top-k, binary | | `Segmentation` | `evaluate_segmentations()` | simple | | `Regression` | `evaluate_regressions()` | simple | ### Step 2: Ensure Evaluation Plugin is Installed ```python list_plugins() ``` If `@voxel51/evaluation` is not in the list: ```python download_plugin( url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"] ) enable_plugin(plugin_name="@voxel51/evaluation") ``` ### Step 3: Launch App ```python launch_app(dataset_name="my-dataset") ``` ### Step 4: Run Evaluation Ask user for: - Prediction field (`pred_field`) - Ground truth field (`gt_field`) - Evaluation key (`eval_key`) - must be unique identifier - Evaluation method ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval", "method": "coco", "iou": 0.5, "compute_mAP": true } ) ``` ### Step 5: View Results After evaluation, the dataset will have new fields: - `{eval_key}_tp` - True positive count per sample - `{eval_key}_fp` - False positive count per sample - `{eval_key}_fn` - False negative count per sample **View only samples with false positives:** ```python set_view(filters={"eval_fp": {"$gt": 0}}) ``` **Use the Model Evaluation Panel in the App** to interactively explore: - Summary metrics (mAP, precision, recall) - Confusion matrices - Per-class performance - Scenario analysis ### Step 6: View Evaluation Patches (TP/FP/FN) To examine individual true positives, false positives, and false negatives, guide users to the Python SDK: ```python import fiftyone as fo dataset = fo.load_dataset("my-dataset") # Convert to evaluation patches view eval_patches = dataset.to_evaluation_patches("eval") # Count by type print(eval_patches.count_values("type")) # Output: {'fn': 246, 'fp': 4131, 'tp': 986} # View only false positives fp_view = eval_patches.match(F("type") == "fp") session = fo.launch_app(view=fp_view) ``` ### Step 7: Clean Up ```python close_app() ``` ## Evaluation Types ### Detection Evaluation For `Detections`, `Polylines`, `Keypoints` labels. **COCO-style (default):** ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_coco", "method": "coco", "iou": 0.5, "classwise": true, "compute_mAP": true } ) ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `iou` | float | 0.5 | IoU threshold for matching | | `classwise` | bool | true | Only match objects with same class | | `compute_mAP` | bool | false | Compute mAP, mAR, and PR curves | | `use_masks` | bool | false | Use instance masks for IoU (if available) | | `iscrowd` | string | null | Attribute name for crowd annotations | | `iou_threshs` | string | null | Comma-separated IoU thresholds for mAP | | `max_preds` | int | null | Max predictions per sample for mAP | **Open Images-style:** ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_oi", "method": "open-images", "iou": 0.5 } ) ``` Supports additional parameters: - `pos_label_field`: Classifications specifying which classes should be evaluated - `neg_label_field`: Classifications specifying which classes should NOT be evaluated **ActivityNet-style (temporal):** For `TemporalDetections` in video datasets: ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_temporal", "method": "activitynet", "compute_mAP": true } ) ``` ### Classification Evaluation For `Classification` labels. **Simple (default):** ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_cls", "method": "simple" } ) ``` Per-sample field `{eval_key}` stores boolean indicating if prediction was correct. **Top-k:** Requires predictions with `logits` field: ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_topk", "method": "top-k", "k": 5 } ) ``` **Binary:** For binary classifiers: ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_binary", "method": "binary" } ) ``` Per-sample field `{eval_key}` stores: "tp", "fp", "tn", or "fn". ### Segmentation Evaluation For `Segmentation` labels. ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_seg", "method": "simple", "bandwidth": 5 # Optional: evaluate only boundary pixels } ) ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `bandwidth` | int | null | Pixels along contours to evaluate (null = entire mask) | | `average` | string | "micro" | Averaging strategy: micro, macro, weighted, samples | Per-sample fields: - `{eval_key}_accuracy` - `{eval_key}_precision` - `{eval_key}_recall` ### Regression Evaluation For `Regression` labels. ```python execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_reg", "method": "simple", "metric": "squared_error" # or "absolute_error" } ) ``` Per-sample field `{eval_key}` stores the error value. Metrics available: - Mean Squared Error (MSE) - Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE) - Median Absolute Error - R² Score - Explained Variance Score - Max Error ## Managing Evaluations ### List Existing Evaluations ```python execute_operator( operator_uri="@voxel51/evaluation/get_evaluation_info", params={ "eval_key": "eval" } ) ``` ### Load Evaluation View Load the exact view on which an evaluation was performed: ```python execute_operator( operator_uri="@voxel51/evaluation/load_evaluation_view", params={ "eval_key": "eval", "select_fields": false } ) ``` ### Rename Evaluation ```python execute_operator( operator_uri="@voxel51/evaluation/rename_evaluation", params={ "eval_key": "eval", "new_eval_key": "eval_v2" } ) ``` ### Delete Evaluation ```python execute_operator( operator_uri="@voxel51/evaluation/delete_evaluation", params={ "eval_key": "eval" } ) ``` ## Common Use Cases ### Use Case 1: Evaluate Object Detection Model ```python # Verify dataset has detection fields set_context(dataset_name="my-dataset") dataset_summary(name="my-dataset") # Launch app launch_app(dataset_name="my-dataset") # Run COCO-style evaluation with mAP execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval", "method": "coco", "iou": 0.5, "compute_mAP": true } ) # View samples with most false positives set_view(filters={"eval_fp": {"$gt": 5}}) ``` ### Use Case 2: Compare Two Detection Models ```python set_context(dataset_name="my-dataset") launch_app(dataset_name="my-dataset") # Evaluate first model execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "model_a_predictions", "gt_field": "ground_truth", "eval_key": "eval_model_a", "method": "coco", "compute_mAP": true } ) # Evaluate second model execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "model_b_predictions", "gt_field": "ground_truth", "eval_key": "eval_model_b", "method": "coco", "compute_mAP": true } ) # Use the Model Evaluation Panel to compare results ``` ### Use Case 3: Evaluate Classification Model ```python set_context(dataset_name="my-classification-dataset") launch_app(dataset_name="my-classification-dataset") # Simple classification evaluation execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_cls", "method": "simple" } ) # View misclassified samples set_view(filters={"eval_cls": false}) ``` ### Use Case 4: Evaluate at Different IoU Thresholds ```python set_context(dataset_name="my-dataset") launch_app(dataset_name="my-dataset") # Strict evaluation (IoU 0.75) execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_strict", "method": "coco", "iou": 0.75, "compute_mAP": true } ) # Lenient evaluation (IoU 0.25) execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_lenient", "method": "coco", "iou": 0.25, "compute_mAP": true } ) ``` ### Use Case 5: Evaluate Segmentation Model ```python set_context(dataset_name="my-segmentation-dataset") launch_app(dataset_name="my-segmentation-dataset") # Full mask evaluation execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_seg", "method": "simple" } ) # Boundary-only evaluation (5 pixel bandwidth) execute_operator( operator_uri="@voxel51/evaluation/evaluate_model", params={ "pred_field": "predictions", "gt_field": "ground_truth", "eval_key": "eval_seg_boundary", "method": "simple", "bandwidth": 5 } ) ``` ## Python SDK Alternative For more control over evaluation and access to full results, guide users to the Python SDK: ```python import fiftyone as fo import fiftyone.zoo as foz # Load dataset dataset = fo.load_dataset("my-dataset") # Evaluate detections results = dataset.evaluate_detections( "predictions", gt_field="ground_truth", eval_key="eval", method="coco", iou=0.5, compute_mAP=True, ) # Print classification report results.print_report() # Get mAP value print(f"mAP: {results.mAP():.3f}") # Plot confusion matrix (interactive) plot = results.plot_confusion_matrix() plot.show() # Plot precision-recall curves plot = results.plot_pr_curves(classes=["person", "car", "dog"]) plot.show() # Convert to evaluation patches to view TP/FP/FN eval_patches = dataset.to_evaluation_patches("eval") print(eval_patches.count_values("type")) # View false positives in the App from fiftyone import ViewField as F fp_view = eval_patches.match(F("type") == "fp") session = fo.launch_app(view=fp_view) ``` **Python SDK evaluation methods:** - `dataset.evaluate_detections()` - Object detection - `dataset.evaluate_classifications()` - Classification - `dataset.evaluate_segmentations()` - Semantic segmentation - `dataset.evaluate_regressions()` - Regression **Results object methods:** - `results.print_report()` - Print classification report - `results.print_metrics()` - Print aggregate metrics - `results.mAP()` - Get mAP value (detection only) - `results.mAR()` - Get mAR value (detection only) - `results.plot_confusion_matrix()` - Interactive confusion matrix - `results.plot_pr_curves()` - Precision-recall curves - `results.plot_results()` - Scatter plot (regression only) ## Troubleshooting **Error: "No suitable label fields"** - Dataset must have label fields of compatible types - Use `dataset_summary()` to see available fields and types **Error: "No suitable ground truth fields"** - Ground truth field must be same type as prediction field - Cannot compare Detections predictions with Classification ground truth **Error: "Evaluation key already exists"** - Each evaluation must have a unique key - Delete existing evaluation or use a different key name **Error: "Plugin not found"** - Install the evaluation plugin: ```python download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"]) enable_plugin(plugin_name="@voxel51/evaluation") ``` **mAP is not computed** - Set `compute_mAP: true` in params - mAP requires multiple predictions per image to be meaningful **Evaluation is slow** - Large datasets take time - Consider evaluating a filtered view first - Use delegated execution for background processing ## Best Practices 1. **Use descriptive eval_keys** - `eval_yolov8_coco`, `eval_resnet_topk5` 2. **Don't overwrite evaluations** - Use unique keys for each evaluation run 3. **Compare at same IoU** - When comparing models, use consistent IoU thresholds 4. **Check field types first** - Ensure prediction and ground truth fields are compatible 5. **Use Model Evaluation Panel** - Interactive exploration is easier than scripting 6. **Examine patches** - Use `to_evaluation_patches()` to understand errors ## Resources - [FiftyOne Evaluation Guide](https://docs.voxel51.com/user_guide/evaluation.html) - [Detection Evaluation](https://docs.voxel51.com/user_guide/evaluation.html#detection-evaluation) - [Classification Evaluation](https://docs.voxel51.com/user_guide/evaluation.html#classification-evaluation) - [Segmentation Evaluation](https://docs.voxel51.com/user_guide/evaluation.html#semantic-segmentation-evaluation) - [Model Evaluation Panel](https://docs.voxel51.com/user_guide/app.html#app-model-evaluation-panel) - [Evaluation Plugin](https://github.com/voxel51/fiftyone-plugins/tree/main/plugins/evaluation)