--- name: explore-dnn-model description: Manual invocation only; use only when the user explicitly requests `explore-dnn-model` by name. Explore how to run a given DNN model checkpoint in the current Python environment by locating weights + upstream source code, resolving dependencies with user confirmation, running reproducible experiments under `tmp/`, and producing reports about I/O contracts, timing, and profiling. --- # Explore DNN Model ## Minimum Required Inputs (Hard Requirement) To use this skill, the user must provide: - A model checkpoint / model file(s) as a **local** file or directory path (it may be outside the workspace). If the user provides only the checkpoint path (no model name, repo link, or source code), proceed by: 1) Attempting to identify the model name/family from the checkpoint file/dir itself (filenames, adjacent configs/README, embedded metadata, `state_dict` key patterns, etc.). 2) Searching for the implementation in the workspace and/or alongside the checkpoint directory (e.g., nearby Python packages, inference scripts, config files). 3) If still not found, using the best-guess model name/family to search online for the canonical implementation, then cloning the upstream source into `tmp//refs/` for investigation (prefer shallow clone; record URL + commit/tag used). ## Goals This skill has three goals: 1) Verify that the given DNN model can work (inference or training; default focus is **inference**) in the *current* Python environment of the workspace. 2) Determine how to use it (inference or training; default is **inference**) by reading the upstream source code and producing minimal, reproducible runs. 3) Produce two reports: - **Experiment report** (programmatic): generated from `tmp//outputs/` with minimal/no reasoning. - **Stakeholder report** (agent-written): generated by the agent from the experiment report + outputs/logs, with deeper analysis and recommendations. The reports cover: - Input and output contracts (formats, shapes, dtypes, preprocessing/postprocessing) - Benchmarks and performance profiling (latency/throughput/memory, device details) - User-provided metrics/targets (e.g., accuracy, mAP, IoU, F1, latency budget), and whether/how they are met Before changing anything, detect how the environment is managed by checking for: - `pixi.toml` and/or `pyproject.toml` (Pixi-managed project) - `.venv/` (venv-managed project) ## Dependency Policy (Ask Once, Then Apply) If any dependency is missing: - Do **not** install it automatically *without user confirmation*. - List the missing packages (and versions/constraints if known) and ask the developer how to proceed. - Provide clear options, let the developer choose, then proceed with the chosen approach. - Once the developer confirms an approach, apply it for **all** newly required packages (no need to ask approval per package). ### Version Strategy - First attempt: use the **latest versions** resolved by the selected package manager (`pixi`, `pip`, `uv`). - If that fails (import/runtime errors, incompatibilities): fall back to the **specific versions/constraints** documented by the model’s upstream source code or docs. ### Preferred Options (in order) **Pixi-managed env** - Ask the user to choose one: - Modify the current Pixi environment by adding deps to the relevant manifest (`pixi.toml` / `pyproject.toml`). - Create a new Pixi environment specifically to test this model. - Then use `pixi install`/`pixi run ...` to execute. - Prefer **PyPI** packages over **conda-forge** when both are available. - Avoid direct `pip install ...` into the Pixi environment unless the developer explicitly requests it. **`.venv`-managed env** Ask the user to choose one: - Install deps via `pip` (or `uv pip`) into the current `.venv`. - Create a new venv specifically for this model (keeps the repo venv clean). ## Inputs to Collect (ask if missing) - Model name and/or upstream repo link and/or source code path (optional but speeds up identification) - Model task/modality if unclear (classification/detection/segmentation/embedding/audio/video/etc.) - Checkpoint path (file/dir) and format (`.pt`, `.pth`, `.onnx`, `.engine`, etc.) - Any known I/O contract details (expected resolution, channel order, normalization, label mapping), if the user has them - CPU-only requirement (only if the user explicitly requests CPU-only) - Optional: user-provided metrics/targets to evaluate (quality and/or performance) Notes: - Determine framework/runtime automatically from checkpoint type + upstream code/docs + what’s available in the current Python environment. - If hardware is unspecified, default to using hardware acceleration when available (CUDA GPU, ROCm GPU, Apple MPS, etc.). Use CPU-only only if the user requested it. - If unspecified, the default objective is to confirm the model runs end-to-end from input → output (prefer real inputs found in the workspace; synthesize as a fallback) and record end-to-end timing. ## Core Workflow ### 0) Confirm artifacts and pick the target environment - Confirm the minimum required inputs are present: - Checkpoint/model path is accessible locally (file/dir exists). It may be outside the workspace. - If model name/repo/source path is not provided, start by inferring it from the checkpoint and nearby files; if needed, locate it online and clone into `tmp//refs/`. - Detect environment type: - If both Pixi and `.venv` exist, ask the user which one should be treated as the “current” environment for this exploration. - Device default: - If the user did not request CPU-only, use hardware acceleration when available (CUDA/ROCm/MPS/etc.). ### 1) Locate and read the upstream source code/docs - First try to find the implementation locally: - Search the workspace and the checkpoint directory for source code, inference scripts, configs, and docs. - Prefer local source if it appears to be the canonical/official implementation for the checkpoint. - If local source is not available or is clearly incomplete, use online search to find the canonical implementation: - Official GitHub repo, paper, model card, or vendor docs. - Check out the upstream repo under `tmp//refs/` using a shallow clone (`--depth=1`), pinning a tag/commit when possible. - Download/check out the relevant source code (pin a tag/commit when possible) and identify: - The exact inference entrypoints (scripts/modules), model class, preprocessing, postprocessing, and label mapping. - Any config files required to construct the model (YAML/JSON/TOML). - Do not “guess” preprocessing/postprocessing: confirm from code and/or reference examples. ### 2) Derive required dependencies Before running the model or changing the environment, determine the minimal dependencies required to run the model by using (in priority order): - Upstream source code (setup files, `requirements*.txt`, `pyproject.toml`, import graph). - Upstream docs/model card (pinned versions, known-good combos). - Checkpoint type (e.g., `.onnx` implies ONNX Runtime; `.pt/.pth` implies PyTorch; `.engine` implies TensorRT). Make a concise dependency list covering: - Runtime/framework (e.g., `torch`, `onnxruntime`, `opencv-python`) - Model-specific libs (e.g., `ultralytics`, `timm`, `transformers`, `mmengine`, etc.) - Utility deps used by the official inference path (e.g., `numpy`, `Pillow`, `pyyaml`) - Optional acceleration deps (CUDA/TensorRT) separated from the CPU baseline ### 3) Resolve missing dependencies (with user choice) - Check whether each required dependency is available in the current environment. - If anything is missing, ask the user which path to take: - **Pixi:** modify current manifest to add deps, or create a new Pixi env for this model. - **Venv:** install into current `.venv`, or create a new venv for this model. - After the user confirms, apply the decision for all required packages (no per-package prompts). - Use the **Version Strategy** above (latest first; fall back to pinned versions if needed). - After dependency changes, run a quick smoke test: - Imports for the core runtime stack - Minimal “load model” path (without a full benchmark yet) ### 4) Ensure the checkpoint exists locally - Do **not** download checkpoints automatically. - Developers must provide checkpoints/model files (local file/dir paths). - If the checkpoint is missing or only a URL is provided, ask the developer to download it and provide the local path. - If the developer wants a conventional location, prefer `checkpoints/` (gitignored). - Record provenance in a short note (based on what the developer provides): - Claimed source URL(s) or repo, version/commit/tag (if known), file size, and (if feasible) SHA256. ### 5) Create an experiment workspace under `tmp/` Default experiment directory: `/tmp/-` If the user specifies a different location/name, use the user-provided one instead. Create the standard directory layout: ``` tmp// README.md # experiment intent + directory guide (keep updated) refs/ # checked-out upstream repos (use shallow clone for online checkouts) README.md scripts/ # throwaway but reproducible scripts (committed if useful) README.md inputs/ # downloaded/synthesized test inputs README.md outputs/ # artifacts + machine-readable stats (e.g., `stats.json`) README.md logs/ # logs (stdout/stderr, profiling traces, command transcripts) README.md reports/ # markdown notes: what was tried, params, results README.md figures/ # images embedded in reports experiment-report.md stakeholder-report.md ``` Shell safety note (avoid accidental directory names): - Do **not** use bash brace expansion to create these folders (e.g., `mkdir -p "$exp"/{refs,scripts,...}`), because quoting/spacing mistakes can create literal directories like `{refs,scripts,...}`. - Prefer a simple loop or explicit `mkdir -p` calls, for example: ``` exp="tmp/" mkdir -p "$exp" for d in refs scripts inputs outputs logs reports reports/figures; do mkdir -p "$exp/$d" done ``` Conventions: - Use relative paths from `tmp/` in scripts so the folder is movable. - Keep scripts small and single-purpose (`01_download_inputs.py`, `10_infer.py`, `20_visualize.py`, …). - Run Python via the selected environment manager: - Pixi: `pixi run python ...` - Venv: use the venv’s Python (avoid system Python) README requirements: - Create `tmp//README.md` to describe: - The intention of the experiment (what model, what checkpoint, what question you’re answering) - How to reproduce (one-line pointer to the primary script(s)) - A brief map of what each top-level subdir contains - Each top-level subdir must have its own `README.md` that: - Describes what belongs in the folder - Notes any important changes (append a short “Changes” section as you iterate) ### 6) Collect or synthesize inputs - First try to find suitable inputs already present in the workspace (e.g., under `datasets/`, `downloads/`, or other project-specific data dirs) based on what you learned from the checkpoint/source code (task, modality, expected resolution, file types). - If no suitable inputs exist locally, synthesize minimal inputs that satisfy the model contract (e.g., generated images, random tensors saved in the expected container format, short synthetic video). - Save all chosen/generated inputs under `tmp//inputs/`. ### 7) Run minimal, traceable inference experiments (default: inference + end-to-end timing) - Start with a single known-good example (from upstream repo) if available. - Save every “input → output” mapping: - Inputs: the exact file(s) used + preprocessing parameters. - Outputs: raw model outputs + any decoded/visualized artifacts. - Command line + environment notes (device, precision, batch size). - Measure end-to-end timing by default: - At minimum: one cold run + a small number of warm runs (record mean/median). - Persist stats that will appear in the report: - For any timing/profiling/memory/throughput numbers you plan to put into the report, also write a JSON version under `tmp//outputs/` (e.g., `outputs/stats.json`). - Capture logs by default: - Save stdout/stderr and command transcripts under `tmp//logs/`. - If the model is accessed via HTTP/gRPC, save request/response payloads (sanitized) under `reports/` and/or `outputs/`. ### 7b) (Optional) Training sanity check If the user asks to validate training (or if inference is insufficient to validate “works”): - Start with a minimal configuration (single batch / tiny subset) to confirm the forward + backward pass runs. - Record key configs (optimizer, LR, batch size, mixed precision) and any dataset assumptions. - Do not run long trainings unless the user explicitly requests it. ### 8) Produce reports #### 8a) Ensure machine-readable report inputs exist (in `outputs/`) Write/collect machine-readable files in `tmp//outputs/` that the report generator can consume, at minimum: - `stats.json` (timing/throughput/memory/profile numbers) - A JSON describing key parameters used (preprocess/postprocess/runtime thresholds) - A JSON describing the I/O contract (input expectations + output structure) - A JSON listing key artifacts produced (paths to representative inputs/outputs) Keep these JSON files as the source of truth for anything that will appear as “final stats” in the experiment report. #### 8b) Generate `reports/experiment-report.md` programmatically - Generate `tmp//reports/experiment-report.md` by reading only `tmp//outputs/` (and optionally `logs/` for pointers), with minimal/no reasoning. - If images are part of the inputs/outputs, copy representative images into `tmp//reports/figures/` and embed them in the markdown via relative paths (e.g., `figures/.png`). #### 8c) Write `reports/stakeholder-report.md` (agent-written) - Read `reports/experiment-report.md` plus relevant `outputs/` and `logs/`. - Produce `tmp//reports/stakeholder-report.md` with deeper analysis that requires reasoning: - Interpret results vs expectations/targets - Call out risks, assumptions, and failure modes - Recommend next experiments and concrete integration guidance (if requested) - Summarize “go/no-go” criteria and what remains unknown Also include: - **Benchmark & profiling** results: - CPU/GPU model, RAM/VRAM, OS, Python version, key library versions - Latency breakdown if possible (preprocess / model / postprocess) - Throughput (items/s) and peak memory/VRAM - **Stats JSON**: - For any stats included in the report, ensure the same values exist in a JSON file under `tmp//outputs/` (e.g., `outputs/stats.json`). - **User metrics** (if provided): - The metric definition + measurement method - Results on the chosen evaluation inputs - Any deltas vs the user’s targets and suggested next experiments ## Guardrails - Do not commit large checkpoints or huge outputs; keep them under gitignored paths (`checkpoints/`, `tmp/`). - Respect upstream licenses; record the repo URL + commit/tag in `reports/`. - Avoid modifying runtime code under `src/` unless the user explicitly requests integration; keep exploration isolated to `tmp/`.