--- name: huggingface-vision-trainer description: Trains and fine-tunes vision models for object detection (D-FINE, RT-DETR v2, DETR, YOLOS), image classification (timm models — MobileNetV3, MobileViT, ResNet, ViT/DINOv3 — plus any Transformers classifier), and SAM/SAM2 segmentation using Hugging Face Transformers on Hugging Face Jobs cloud GPUs. Covers COCO-format dataset preparation, Albumentations augmentation, mAP/mAR evaluation, accuracy metrics, SAM segmentation with bbox/point prompts, DiceCE loss, hardware selection, cost estimation, Trackio monitoring, and Hub persistence. Use when users mention training object detection, image classification, SAM, SAM2, segmentation, image matting, DETR, D-FINE, RT-DETR, ViT, timm, MobileNet, ResNet, bounding box models, or fine-tuning vision models on Hugging Face Jobs. --- # Vision Model Training on Hugging Face Jobs Train object detection, image classification, and SAM/SAM2 segmentation models on managed cloud GPUs. No local GPU setup required—results are automatically saved to the Hugging Face Hub. ## When to Use This Skill Use this skill when users want to: - Fine-tune object detection models (D-FINE, RT-DETR v2, DETR, YOLOS) on cloud GPUs or local - Fine-tune image classification models (timm: MobileNetV3, MobileViT, ResNet, ViT/DINOv3, or any Transformers classifier) on cloud GPUs or local - Fine-tune SAM or SAM2 models for segmentation / image matting using bbox or point prompts - Train bounding-box detectors on custom datasets - Train image classifiers on custom datasets - Train segmentation models on custom mask datasets with prompts - Run vision training jobs on Hugging Face Jobs infrastructure - Ensure trained vision models are permanently saved to the Hub ## Related Skills - **`hugging-face-jobs`** — General HF Jobs infrastructure: token authentication, hardware flavors, timeout management, cost estimation, secrets, environment variables, scheduled jobs, and result persistence. **Refer to the Jobs skill for any non-training-specific Jobs questions** (e.g., "how do secrets work?", "what hardware is available?", "how do I pass tokens?"). - **`hugging-face-model-trainer`** — TRL-based language model training (SFT, DPO, GRPO). Use that skill for text/language model fine-tuning. ## Local Script Execution Helper scripts use PEP 723 inline dependencies. Run them with `uv run`: ```bash uv run scripts/dataset_inspector.py --dataset username/dataset-name --split train uv run scripts/estimate_cost.py --help ``` ## Prerequisites Checklist Before starting any training job, verify: ### Account & Authentication - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan) - Authenticated login: Check with `hf_whoami()` (tool) or `hf auth whoami` (terminal) - Token has **write** permissions - **MUST pass token in job secrets** — see directive #3 below for syntax (MCP tool vs Python API) ### Dataset Requirements — Object Detection - Dataset must exist on Hub - Annotations must use the `objects` column with `bbox`, `category` (and optionally `area`) sub-fields - Bboxes can be in **xywh (COCO)** or **xyxy (Pascal VOC)** format — auto-detected and converted - Categories can be **integers or strings** — strings are auto-remapped to integer IDs - `image_id` column is **optional** — generated automatically if missing - **ALWAYS validate unknown datasets** before GPU training (see Dataset Validation section) ### Dataset Requirements — Image Classification - Dataset must exist on Hub - Must have an **`image` column** (PIL images) and a **`label` column** (integer class IDs or strings) - The label column can be `ClassLabel` type (with names) or plain integers/strings — strings are auto-remapped - Common column names auto-detected: `label`, `labels`, `class`, `fine_label` - **ALWAYS validate unknown datasets** before GPU training (see Dataset Validation section) ### Dataset Requirements — SAM/SAM2 Segmentation - Dataset must exist on Hub - Must have an **`image` column** (PIL images) and a **`mask` column** (binary ground-truth segmentation mask) - Must have a **prompt** — either: - A **`prompt` column** with JSON containing `{"bbox": [x0,y0,x1,y1]}` or `{"point": [x,y]}` - OR a dedicated **`bbox`** column with `[x0,y0,x1,y1]` values - OR a dedicated **`point`** column with `[x,y]` or `[[x,y],...]` values - Bboxes should be in **xyxy** format (absolute pixel coordinates) - Example dataset: `merve/MicroMat-mini` (image matting with bbox prompts) - **ALWAYS validate unknown datasets** before GPU training (see Dataset Validation section) ### Critical Settings - **Timeout must exceed expected training time** — Default 30min is TOO SHORT. See directive #6 for recommended values. - **Hub push must be enabled** — `push_to_hub=True`, `hub_model_id="username/model-name"`, token in `secrets` ## Dataset Validation **Validate dataset format BEFORE launching GPU training to prevent the #1 cause of training failures: format mismatches.** **ALWAYS validate for** unknown/custom datasets or any dataset you haven't trained with before. **Skip for** `cppe-5` (the default in the training script). ### Running the Inspector **Option 1: Via HF Jobs (recommended — avoids local SSL/dependency issues):** ```python hf_jobs("uv", { "script": "path/to/dataset_inspector.py", "script_args": ["--dataset", "username/dataset-name", "--split", "train"] }) ``` **Option 2: Locally:** ```bash uv run scripts/dataset_inspector.py --dataset username/dataset-name --split train ``` **Option 3: Via `HfApi().run_uv_job()` (if hf_jobs MCP unavailable):** ```python from huggingface_hub import HfApi api = HfApi() api.run_uv_job( script="scripts/dataset_inspector.py", script_args=["--dataset", "username/dataset-name", "--split", "train"], flavor="cpu-basic", timeout=300, ) ``` ### Reading Results - **`✓ READY`** — Dataset is compatible, use directly - **`✗ NEEDS FORMATTING`** — Needs preprocessing (mapping code provided in output) ## Automatic Bbox Preprocessing The object detection training script (`scripts/object_detection_training.py`) automatically handles bbox format detection (xyxy→xywh conversion), bbox sanitization, `image_id` generation, string category→integer remapping, and dataset truncation. **No manual preprocessing needed** — just ensure the dataset has `objects.bbox` and `objects.category` columns. ## Training workflow Copy this checklist and track progress: ``` Training Progress: - [ ] Step 1: Verify prerequisites (account, token, dataset) - [ ] Step 2: Validate dataset format (run dataset_inspector.py) - [ ] Step 3: Ask user about dataset size and validation split - [ ] Step 4: Prepare training script (OD: scripts/object_detection_training.py, IC: scripts/image_classification_training.py, SAM: scripts/sam_segmentation_training.py) - [ ] Step 5: Save script locally, submit job, and report details ``` **Step 1: Verify prerequisites** Follow the Prerequisites Checklist above. **Step 2: Validate dataset** Run the dataset inspector BEFORE spending GPU time. See "Dataset Validation" section above. **Step 3: Ask user preferences** ALWAYS use the AskUserQuestion tool with option-style format: ```python AskUserQuestion({ "questions": [ { "question": "Do you want to run a quick test with a subset of the data first?", "header": "Dataset Size", "options": [ {"label": "Quick test run (10% of data)", "description": "Faster, cheaper (~30-60 min, ~$2-5) to validate setup"}, {"label": "Full dataset (Recommended)", "description": "Complete training for best model quality"} ], "multiSelect": false }, { "question": "Do you want to create a validation split from the training data?", "header": "Split data", "options": [ {"label": "Yes (Recommended)", "description": "Automatically split 15% of training data for validation"}, {"label": "No", "description": "Use existing validation split from dataset"} ], "multiSelect": false }, { "question": "Which GPU hardware do you want to use?", "header": "Hardware Flavor", "options": [ {"label": "t4-small ($0.40/hr)", "description": "1x T4, 16 GB VRAM — sufficient for all OD models under 100M params"}, {"label": "l4x1 ($0.80/hr)", "description": "1x L4, 24 GB VRAM — more headroom for large images or batch sizes"}, {"label": "a10g-large ($1.50/hr)", "description": "1x A10G, 24 GB VRAM — faster training, more CPU/RAM"}, {"label": "a100-large ($2.50/hr)", "description": "1x A100, 80 GB VRAM — fastest, for very large datasets or image sizes"} ], "multiSelect": false } ] }) ``` **Step 4: Prepare training script** For object detection, use [scripts/object_detection_training.py](scripts/object_detection_training.py) as the production-ready template. For image classification, use [scripts/image_classification_training.py](scripts/image_classification_training.py). For SAM/SAM2 segmentation, use [scripts/sam_segmentation_training.py](scripts/sam_segmentation_training.py). All scripts use `HfArgumentParser` — all configuration is passed via CLI arguments in `script_args`, NOT by editing Python variables. For timm model details, see [references/timm_trainer.md](references/timm_trainer.md). For SAM2 training details, see [references/finetune_sam2_trainer.md](references/finetune_sam2_trainer.md). **Step 5: Save script, submit job, and report** 1. **Save the script locally** to `submitted_jobs/` in the workspace root (create if needed) with a descriptive name like `training__.py`. Tell the user the path. 2. **Submit** using `hf_jobs` MCP tool (preferred) or `HfApi().run_uv_job()` — see directive #1 for both methods. Pass all config via `script_args`. 3. **Report** the job ID (from `.id` attribute), monitoring URL, Trackio dashboard (`https://huggingface.co/spaces/{username}/trackio`), expected time, and estimated cost. 4. **Wait for user** to request status checks — don't poll automatically. Training jobs run asynchronously and can take hours. ## Critical directives These rules prevent common failures. Follow them exactly. ### 1. Job submission: `hf_jobs` MCP tool vs Python API **`hf_jobs()` is an MCP tool, NOT a Python function.** Do NOT try to import it from `huggingface_hub`. Call it as a tool: ``` hf_jobs("uv", {"script": training_script_content, "flavor": "a10g-large", "timeout": "4h", "secrets": {"HF_TOKEN": "$HF_TOKEN"}}) ``` **If `hf_jobs` MCP tool is unavailable**, use the Python API directly: ```python from huggingface_hub import HfApi, get_token api = HfApi() job_info = api.run_uv_job( script="path/to/training_script.py", # file PATH, NOT content script_args=["--dataset_name", "cppe-5", ...], flavor="a10g-large", timeout=14400, # seconds (4 hours) env={"PYTHONUNBUFFERED": "1"}, secrets={"HF_TOKEN": get_token()}, # MUST use get_token(), NOT "$HF_TOKEN" ) print(f"Job ID: {job_info.id}") ``` **Critical differences between the two methods:** | | `hf_jobs` MCP tool | `HfApi().run_uv_job()` | |---|---|---| | `script` param | Python code string or URL (NOT local paths) | File path to `.py` file (NOT content) | | Token in secrets | `"$HF_TOKEN"` (auto-replaced) | `get_token()` (actual token value) | | Timeout format | String (`"4h"`) | Seconds (`14400`) | **Rules for both methods:** - The training script MUST include PEP 723 inline metadata with dependencies - Do NOT use `image` or `command` parameters (those belong to `run_job()`, not `run_uv_job()`) ### 2. Authentication via job secrets + explicit hub_token injection **Job config** MUST include the token in secrets — syntax depends on submission method (see table above). **Training script requirement:** The Transformers `Trainer` calls `create_repo(token=self.args.hub_token)` during `__init__()` when `push_to_hub=True`. The training script MUST inject `HF_TOKEN` into `training_args.hub_token` AFTER parsing args but BEFORE creating the `Trainer`. The template `scripts/object_detection_training.py` already includes this: ```python hf_token = os.environ.get("HF_TOKEN") if training_args.push_to_hub and not training_args.hub_token: if hf_token: training_args.hub_token = hf_token ``` If you write a custom script, you MUST include this token injection before the `Trainer(...)` call. - Do NOT call `login()` in custom scripts unless replicating the full pattern from `scripts/object_detection_training.py` - Do NOT rely on implicit token resolution (`hub_token=None`) — unreliable in Jobs - See the `hugging-face-jobs` skill → *Token Usage Guide* for full details ### 3. JobInfo attribute Access the job identifier using `.id` (NOT `.job_id` or `.name` — these don't exist): ```python job_info = api.run_uv_job(...) # or hf_jobs("uv", {...}) job_id = job_info.id # Correct -- returns string like "687fb701029421ae5549d998" ``` ### 4. Required training flags and HfArgumentParser boolean syntax `scripts/object_detection_training.py` uses `HfArgumentParser` — all config is passed via `script_args`. Boolean arguments have two syntaxes: - **`bool` fields** (e.g., `push_to_hub`, `do_train`): Use as bare flags (`--push_to_hub`) or negate with `--no_` prefix (`--no_remove_unused_columns`) - **`Optional[bool]` fields** (e.g., `greater_is_better`): MUST pass explicit value (`--greater_is_better True`). Bare `--greater_is_better` causes `error: expected one argument` Required flags for object detection: ``` --no_remove_unused_columns # MUST: preserves image column for pixel_values --no_eval_do_concat_batches # MUST: images have different numbers of target boxes --push_to_hub # MUST: environment is ephemeral --hub_model_id username/model-name --metric_for_best_model eval_map --greater_is_better True # MUST pass "True" explicitly (Optional[bool]) --do_train --do_eval ``` Required flags for image classification: ``` --no_remove_unused_columns # MUST: preserves image column for pixel_values --push_to_hub # MUST: environment is ephemeral --hub_model_id username/model-name --metric_for_best_model eval_accuracy --greater_is_better True # MUST pass "True" explicitly (Optional[bool]) --do_train --do_eval ``` Required flags for SAM/SAM2 segmentation: ``` --remove_unused_columns False # MUST: preserves input_boxes/input_points --push_to_hub # MUST: environment is ephemeral --hub_model_id username/model-name --do_train --prompt_type bbox # or "point" --dataloader_pin_memory False # MUST: avoids pin_memory issues with custom collator ``` ### 5. Timeout management Default 30 min is TOO SHORT for object detection. Set minimum 2-4 hours. Add 30% buffer for model loading, preprocessing, and Hub push. | Scenario | Timeout | |----------|---------| | Quick test (100-200 images, 5-10 epochs) | 1h | | Development (500-1K images, 15-20 epochs) | 2-3h | | Production (1K-5K images, 30 epochs) | 4-6h | | Large dataset (5K+ images) | 6-12h | ### 6. Trackio monitoring Trackio is **always enabled** in the object detection training script — it calls `trackio.init()` and `trackio.finish()` automatically. No need to pass `--report_to trackio`. The project name is taken from `--output_dir` and the run name from `--run_name`. For image classification, pass `--report_to trackio` in `TrainingArguments`. Dashboard at: `https://huggingface.co/spaces/{username}/trackio` ## Model & hardware selection ### Recommended object detection models | Model | Params | Use case | |-------|--------|----------| | `ustc-community/dfine-small-coco` | 10.4M | Best starting point — fast, cheap, SOTA quality | | `PekingU/rtdetr_v2_r18vd` | 20.2M | Lightweight real-time detector | | `ustc-community/dfine-large-coco` | 31.4M | Higher accuracy, still efficient | | `PekingU/rtdetr_v2_r50vd` | 43M | Strong real-time baseline | | `ustc-community/dfine-xlarge-obj365` | 63.5M | Best accuracy (pretrained on Objects365) | | `PekingU/rtdetr_v2_r101vd` | 76M | Largest RT-DETR v2 variant | Start with `ustc-community/dfine-small-coco` for fast iteration. Move to D-FINE Large or RT-DETR v2 R50 for better accuracy. ### Recommended image classification models All `timm/` models work out of the box via `AutoModelForImageClassification` (loaded as `TimmWrapperForImageClassification`). See [references/timm_trainer.md](references/timm_trainer.md) for details. | Model | Params | Use case | |-------|--------|----------| | `timm/mobilenetv3_small_100.lamb_in1k` | 2.5M | Ultra-lightweight — mobile/edge, fastest training | | `timm/mobilevit_s.cvnets_in1k` | 5.6M | Mobile transformer — good accuracy/speed trade-off | | `timm/resnet50.a1_in1k` | 25.6M | Strong CNN baseline — reliable, well-studied | | `timm/vit_base_patch16_dinov3.lvd1689m` | 86.6M | Best accuracy — DINOv3 self-supervised ViT | Start with `timm/mobilenetv3_small_100.lamb_in1k` for fast iteration. Move to `timm/resnet50.a1_in1k` or `timm/vit_base_patch16_dinov3.lvd1689m` for better accuracy. ### Recommended SAM/SAM2 segmentation models | Model | Params | Use case | |-------|--------|----------| | `facebook/sam2.1-hiera-tiny` | 38.9M | Fastest SAM2 — good for quick experiments | | `facebook/sam2.1-hiera-small` | 46.0M | Best starting point — good quality/speed balance | | `facebook/sam2.1-hiera-base-plus` | 80.8M | Higher capacity for complex segmentation | | `facebook/sam2.1-hiera-large` | 224.4M | Best SAM2 accuracy — requires more VRAM | | `facebook/sam-vit-base` | 93.7M | Original SAM — ViT-B backbone | | `facebook/sam-vit-large` | 312.3M | Original SAM — ViT-L backbone | | `facebook/sam-vit-huge` | 641.1M | Original SAM — ViT-H, best SAM v1 accuracy | Start with `facebook/sam2.1-hiera-small` for fast iteration. SAM2 models are generally more efficient than SAM v1 at similar quality. Only the mask decoder is trained by default (vision and prompt encoders are frozen). ### Hardware recommendation All recommended OD and IC models are under 100M params — **`t4-small` (16 GB VRAM, $0.40/hr) is sufficient for all of them.** Image classification models are generally smaller and faster than object detection models — `t4-small` handles even ViT-Base comfortably. For SAM2 models up to `hiera-base-plus`, `t4-small` is sufficient since only the mask decoder is trained. For `sam2.1-hiera-large` or SAM v1 models, use `l4x1` or `a10g-large`. Only upgrade if you hit OOM from large batch sizes — reduce batch size first before switching hardware. Common upgrade path: `t4-small` → `l4x1` ($0.80/hr, 24 GB) → `a10g-large` ($1.50/hr, 24 GB). For full hardware flavor list: refer to the `hugging-face-jobs` skill. For cost estimation: run `scripts/estimate_cost.py`. ## Quick start — Object Detection The `script_args` below are the same for both submission methods. See directive #1 for the critical differences between them. ```python OD_SCRIPT_ARGS = [ "--model_name_or_path", "ustc-community/dfine-small-coco", "--dataset_name", "cppe-5", "--image_square_size", "640", "--output_dir", "dfine_finetuned", "--num_train_epochs", "30", "--per_device_train_batch_size", "8", "--learning_rate", "5e-5", "--eval_strategy", "epoch", "--save_strategy", "epoch", "--save_total_limit", "2", "--load_best_model_at_end", "--metric_for_best_model", "eval_map", "--greater_is_better", "True", "--no_remove_unused_columns", "--no_eval_do_concat_batches", "--push_to_hub", "--hub_model_id", "username/model-name", "--do_train", "--do_eval", ] ``` ```python from huggingface_hub import HfApi, get_token api = HfApi() job_info = api.run_uv_job( script="scripts/object_detection_training.py", script_args=OD_SCRIPT_ARGS, flavor="t4-small", timeout=14400, env={"PYTHONUNBUFFERED": "1"}, secrets={"HF_TOKEN": get_token()}, ) print(f"Job ID: {job_info.id}") ``` ### Key OD `script_args` - `--model_name_or_path` — recommended: `"ustc-community/dfine-small-coco"` (see model table above) - `--dataset_name` — the Hub dataset ID - `--image_square_size` — 480 (fast iteration) or 800 (better accuracy) - `--hub_model_id` — `"username/model-name"` for Hub persistence - `--num_train_epochs` — 30 typical for convergence - `--train_val_split` — fraction to split for validation (default 0.15), set if dataset lacks a validation split - `--max_train_samples` — truncate training set (useful for quick test runs, e.g. `"785"` for ~10% of a 7.8K dataset) - `--max_eval_samples` — truncate evaluation set ## Quick start — Image Classification ```python IC_SCRIPT_ARGS = [ "--model_name_or_path", "timm/mobilenetv3_small_100.lamb_in1k", "--dataset_name", "ethz/food101", "--output_dir", "food101_classifier", "--num_train_epochs", "5", "--per_device_train_batch_size", "32", "--per_device_eval_batch_size", "32", "--learning_rate", "5e-5", "--eval_strategy", "epoch", "--save_strategy", "epoch", "--save_total_limit", "2", "--load_best_model_at_end", "--metric_for_best_model", "eval_accuracy", "--greater_is_better", "True", "--no_remove_unused_columns", "--push_to_hub", "--hub_model_id", "username/food101-classifier", "--do_train", "--do_eval", ] ``` ```python from huggingface_hub import HfApi, get_token api = HfApi() job_info = api.run_uv_job( script="scripts/image_classification_training.py", script_args=IC_SCRIPT_ARGS, flavor="t4-small", timeout=7200, env={"PYTHONUNBUFFERED": "1"}, secrets={"HF_TOKEN": get_token()}, ) print(f"Job ID: {job_info.id}") ``` ### Key IC `script_args` - `--model_name_or_path` — any `timm/` model or Transformers classification model (see model table above) - `--dataset_name` — the Hub dataset ID - `--image_column_name` — column containing PIL images (default: `"image"`) - `--label_column_name` — column containing class labels (default: `"label"`) - `--hub_model_id` — `"username/model-name"` for Hub persistence - `--num_train_epochs` — 3-5 typical for classification (fewer than OD) - `--per_device_train_batch_size` — 16-64 (classification models use less memory than OD) - `--train_val_split` — fraction to split for validation (default 0.15), set if dataset lacks a validation split - `--max_train_samples` / `--max_eval_samples` — truncate for quick tests ## Quick start — SAM/SAM2 Segmentation ```python SAM_SCRIPT_ARGS = [ "--model_name_or_path", "facebook/sam2.1-hiera-small", "--dataset_name", "merve/MicroMat-mini", "--prompt_type", "bbox", "--prompt_column_name", "prompt", "--output_dir", "sam2-finetuned", "--num_train_epochs", "30", "--per_device_train_batch_size", "4", "--learning_rate", "1e-5", "--logging_steps", "1", "--save_strategy", "epoch", "--save_total_limit", "2", "--remove_unused_columns", "False", "--dataloader_pin_memory", "False", "--push_to_hub", "--hub_model_id", "username/sam2-finetuned", "--do_train", "--report_to", "trackio", ] ``` ```python from huggingface_hub import HfApi, get_token api = HfApi() job_info = api.run_uv_job( script="scripts/sam_segmentation_training.py", script_args=SAM_SCRIPT_ARGS, flavor="t4-small", timeout=7200, env={"PYTHONUNBUFFERED": "1"}, secrets={"HF_TOKEN": get_token()}, ) print(f"Job ID: {job_info.id}") ``` ### Key SAM `script_args` - `--model_name_or_path` — SAM or SAM2 model (see model table above); auto-detects SAM vs SAM2 - `--dataset_name` — the Hub dataset ID (e.g., `"merve/MicroMat-mini"`) - `--prompt_type` — `"bbox"` or `"point"` — type of prompt in the dataset - `--prompt_column_name` — column with JSON-encoded prompts (default: `"prompt"`) - `--bbox_column_name` — dedicated bbox column (alternative to JSON prompt column) - `--point_column_name` — dedicated point column (alternative to JSON prompt column) - `--mask_column_name` — column with ground-truth masks (default: `"mask"`) - `--hub_model_id` — `"username/model-name"` for Hub persistence - `--num_train_epochs` — 20-30 typical for SAM fine-tuning - `--per_device_train_batch_size` — 2-4 (SAM models use significant memory) - `--freeze_vision_encoder` / `--freeze_prompt_encoder` — freeze encoder weights (default: both frozen, only mask decoder trains) - `--train_val_split` — fraction to split for validation (default 0.1) ## Checking job status **MCP tool (if available):** ``` hf_jobs("ps") # List all jobs hf_jobs("logs", {"job_id": "your-job-id"}) # View logs hf_jobs("inspect", {"job_id": "your-job-id"}) # Job details ``` **Python API fallback:** ```python from huggingface_hub import HfApi api = HfApi() api.list_jobs() # List all jobs api.get_job_logs(job_id="your-job-id") # View logs api.get_job(job_id="your-job-id") # Job details ``` ## Common failure modes ### OOM (CUDA out of memory) Reduce `per_device_train_batch_size` (try 4, then 2), reduce `IMAGE_SIZE`, or upgrade hardware. ### Dataset format errors Run `scripts/dataset_inspector.py` first. The training script auto-detects xyxy vs xywh, converts string categories to integer IDs, and adds `image_id` if missing. Ensure `objects.bbox` contains 4-value coordinate lists in absolute pixels and `objects.category` contains either integer IDs or string labels. ### Hub push failures (401) Verify: (1) job secrets include token (see directive #2), (2) script sets `training_args.hub_token` BEFORE creating the `Trainer`, (3) `push_to_hub=True` is set, (4) correct `hub_model_id`, (5) token has write permissions. ### Job timeout Increase timeout (see directive #5 table), reduce epochs/dataset, or use checkpoint strategy with `hub_strategy="every_save"`. ### KeyError: 'test' (missing test split) The object detection training script handles this gracefully — it falls back to the `validation` split. Ensure you're using the latest `scripts/object_detection_training.py`. ### Single-class dataset: "iteration over a 0-d tensor" `torchmetrics.MeanAveragePrecision` returns scalar (0-d) tensors for per-class metrics when there's only one class. The template `scripts/object_detection_training.py` handles this by calling `.unsqueeze(0)` on these tensors. Ensure you're using the latest template. ### Poor detection performance (mAP < 0.15) Increase epochs (30-50), ensure 500+ images, check per-class mAP for imbalanced classes, try different learning rates (1e-5 to 1e-4), increase image size. For comprehensive troubleshooting: see [references/reliability_principles.md](references/reliability_principles.md) ## Reference files - [scripts/object_detection_training.py](scripts/object_detection_training.py) — Production-ready object detection training script - [scripts/image_classification_training.py](scripts/image_classification_training.py) — Production-ready image classification training script (supports timm models) - [scripts/sam_segmentation_training.py](scripts/sam_segmentation_training.py) — Production-ready SAM/SAM2 segmentation training script (bbox & point prompts) - [scripts/dataset_inspector.py](scripts/dataset_inspector.py) — Validate dataset format for OD, classification, and SAM segmentation - [scripts/estimate_cost.py](scripts/estimate_cost.py) — Estimate training costs for any vision model (includes SAM/SAM2) - [references/object_detection_training_notebook.md](references/object_detection_training_notebook.md) — Object detection training workflow, augmentation strategies, and training patterns - [references/image_classification_training_notebook.md](references/image_classification_training_notebook.md) — Image classification training workflow with ViT, preprocessing, and evaluation - [references/finetune_sam2_trainer.md](references/finetune_sam2_trainer.md) — SAM2 fine-tuning walkthrough with MicroMat dataset, DiceCE loss, and Trainer integration - [references/timm_trainer.md](references/timm_trainer.md) — Using timm models with HF Trainer (TimmWrapper, transforms, full example) - [references/hub_saving.md](references/hub_saving.md) — Detailed Hub persistence guide and verification checklist - [references/reliability_principles.md](references/reliability_principles.md) — Failure prevention principles from production experience ## External links - [Transformers Object Detection Guide](https://huggingface.co/docs/transformers/tasks/object_detection) - [Transformers Image Classification Guide](https://huggingface.co/docs/transformers/tasks/image_classification) - [DETR Model Documentation](https://huggingface.co/docs/transformers/model_doc/detr) - [ViT Model Documentation](https://huggingface.co/docs/transformers/model_doc/vit) - [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs) — Main Jobs documentation - [HF Jobs Configuration](https://huggingface.co/docs/hub/en/jobs-configuration) — Hardware, secrets, timeouts, namespaces - [HF Jobs CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) — Command line interface - [Object Detection Models](https://huggingface.co/models?pipeline_tag=object-detection) - [Image Classification Models](https://huggingface.co/models?pipeline_tag=image-classification) - [SAM2 Model Documentation](https://huggingface.co/docs/transformers/model_doc/sam2) - [SAM Model Documentation](https://huggingface.co/docs/transformers/model_doc/sam) - [Object Detection Datasets](https://huggingface.co/datasets?task_categories=task_categories:object-detection) - [Image Classification Datasets](https://huggingface.co/datasets?task_categories=task_categories:image-classification)