--- name: coordexp-infer-eval-workflow description: Use when launching, repairing, auditing, or summarizing CoordExp inference, confidence scoring, duplicate-control, COCO/LVIS-proxy evaluation, Oracle-K analysis, or benchmark artifact provenance. --- # CoordExp Infer Eval Workflow Use the repo's YAML-first production path. Do not invent one-off CLI flags when an existing config already captures the run. When running interactively, wrap noisy commands with `rtk` if useful, but keep `PYTHONPATH=.` and `conda run -n ms python`. ## Primary References - canonical docs: - `docs/eval/WORKFLOW.md` - `docs/eval/CONTRACT.md` - `docs/ARTIFACTS.md` - reusable runtime seams: - `src/infer/pipeline.py::run_pipeline` - `src/infer/engine.py::InferenceEngine.infer` - `src/infer/artifacts.py::build_infer_summary_payload` - `src/eval/detection.py::evaluate_and_save` - `src/eval/artifacts.py` - standard infer entrypoint: - `scripts/run_infer.py` - confidence scoring: - `scripts/postop_confidence.py` - single-view evaluation: - `scripts/evaluate_detection.py` - one-run proxy bundle evaluation: - `scripts/evaluate_proxy_detection_bundle.py` ## Default Flow ```text config + checkpoint + input JSONL -> inference -> gt_vs_pred.jsonl -> score materialization -> gt_vs_pred_scored.jsonl -> evaluation and optional duplicate-control guard -> metrics.json / metrics_guarded.json / per_image.json / matches.jsonl / summaries ``` Score materialization depends on the coordinate surface: - `bbox_format: xyxy` with coord tokens: run confidence post-op. - raw-text `xyxy` norm1000: run confidence post-op with numeric-text span alignment; set `infer.mode: text`, `infer.pred_coord_mode: norm1000`, and do not rely on `auto`. - `bbox_format: cxcy_logw_logh` or `cxcywh`: do not run confidence post-op; use the unified pipeline's deterministic constant-score compatibility artifact only for checkpoints trained on that serialization. ## Core Commands Always run from repo root. Use YAML configs, not ad hoc shell overrides, for stable workflows. Inference: ```bash PYTHONPATH=. conda run -n ms python scripts/run_infer.py \ --config ``` Confidence post-op: ```bash PYTHONPATH=. conda run -n ms python scripts/postop_confidence.py \ --config ``` Single evaluation: ```bash PYTHONPATH=. conda run -n ms python scripts/evaluate_detection.py \ --config ``` Oracle-K analysis: ```bash PYTHONPATH=. conda run -n ms python scripts/evaluate_oracle_k.py \ --config ``` One-run proxy bundle evaluation: ```bash PYTHONPATH=. conda run -n ms python scripts/evaluate_proxy_detection_bundle.py \ --config ``` ## COCO + LVIS Proxy Workflow For COCO runs trained with LVIS proxy supervision: 1. Infer once. 2. Score once. 3. Evaluate three GT views from the same scored artifact. The standard view labels are: - `coco_real`: original COCO GT only; this is the benchmark-aligned headline. - `coco_real_strict`: COCO GT plus strict same-extent LVIS proxies. - `coco_real_strict_plausible`: broad analysis view; useful for recall, least comparable to standard COCO. Use existing configs under `configs/infer/`, `configs/postop/`, `configs/eval/`, and `configs/bench/`. Only reach for old concrete configs such as the COCO-1024 `val_200_lvis_proxy_*` set when the user is working on that exact historical run: - `configs/infer/coco_1024/val_200_lvis_proxy_merged.yaml` - `configs/postop/coco_1024/val_200_lvis_proxy_merged.yaml` - `configs/eval/coco_1024/val_200_lvis_proxy_bundle.yaml` Expected bundle outputs: - `/eval_coco_real/` - `/eval_coco_real_strict/` - `/eval_coco_real_strict_plausible/` - `/proxy_eval_bundle_summary.json` ## What To Verify Before launch: - infer config points to the intended `gt_jsonl` - infer config uses the intended checkpoint or adapter shorthand - root image directories resolve from config/provenance, especially when running from `temp/` or sharded work dirs - prompt controls match training when required: - `infer.prompt_variant` - `infer.object_field_order` - `infer.object_ordering` - coordinate surface is intentional: - `infer.mode` - `infer.pred_coord_mode` - `infer.bbox_format` - benchmark scope is explicit: - dataset path - slice such as `val200`, `limit=200`, or full-val - decoding knobs and GPU launch shape when reporting timing After inference: - `/summary.json` exists - `/gt_vs_pred.jsonl` exists - `/resolved_config.json` exists when using the YAML pipeline - `resolved_config.path` is present next to `gt_vs_pred.jsonl` when downstream jobs need to recover the authoritative config - prompt/order settings in `summary.json` and `resolved_config.json` match the config After confidence post-op: - `/confidence_postop_summary.json` exists - `/pred_confidence.jsonl` exists for confidence-scored paths - `/gt_vs_pred_scored.jsonl` exists After non-canonical constant-score compatibility scoring: - `/gt_vs_pred_scored.jsonl` exists - do not expect `pred_confidence.jsonl` or `confidence_postop_summary.json` After evaluation: - raw metrics exist: - `metrics.json` - `per_image.json` - guarded companions exist when `duplicate_control.enabled: true`: - `metrics_guarded.json` - `per_image_guarded.json` - `duplicate_guard_report.json` - bundle summary exists when using proxy bundle eval: - `/proxy_eval_bundle_summary.json` - each expected eval directory exists - each eval directory contains `metrics.json` - use the summary JSON as the default source for reporting cross-view metrics For long or sharded runs: - verify top-level merged artifacts and summaries, not just per-shard logs - rerun only failed or missing shards when possible - preserve canonical image roots or rewrite them explicitly before launching from scratch space ## Reporting Guidance When the user asks for performance: - report `bbox_AP`, `bbox_AP50`, `bbox_AP75`, and the main F1-ish metric if available - lead with `coco_real` - clearly label `strict` and `strict_plausible` as additive proxy views - state scope before comparing numbers: `val200`, `limit=200`, first-200, full-val, proxy view, raw-text vs coord-token, checkpoint id, and repetition penalty if relevant - mention GT counts, kept/total prediction counts, or scorer repairs when they materially explain score shifts - compare throughput only across compatible launch shapes; GPU count differences can make timing non-comparable even when accuracy is comparable ## Failure Modes - If `metrics: both` on a COCO proxy artifact seems to trigger LVIS-federated assumptions, inspect `src/eval/detection.py` routing and verify dataset-policy detection before trusting the output. - If eval behavior is unclear, inspect `src/eval/detection.py::EvalOptions` and `src/eval/detection.py::evaluate_and_save`; `scripts/evaluate_detection.py` is a wrapper. - If infer behavior is unclear, inspect `src/infer/pipeline.py::run_pipeline` and `src/infer/engine.py::InferenceEngine.infer`; `scripts/run_infer.py` is a wrapper. - If images cannot be re-opened for visualization from derived artifacts, inspect `provenance.source_jsonl_dir` in the canonical visualization resource. - If proxy-expanded GT counts look wrong, validate `metadata.coordexp_proxy_supervision.object_supervision` and the proxy-tier split before blaming the evaluator. - If raw-text predictions appear to collapse after scoring, verify the confidence post-op used the numeric-text path instead of coord-token geometry alignment. - If a non-canonical bbox-format result looks strong or weak, confirm the checkpoint was trained against that exact serialization before treating it as evidence. ## Avoid - Do not re-run inference when only evaluation views changed. - Do not compare proxy-expanded numbers against standard COCO baselines without labeling them. - Do not compare `val200` or `limit=200` against full-val without saying so. - Do not treat guarded metrics as a replacement for raw model-output inspection; guarded artifacts are additive post-op views. - Do not override config semantics with ad hoc shell flags unless the user explicitly asks for a one-off debug run.