--- name: run-experiment description: Deploy and run ML experiments on local or remote GPU servers. Use when user says "run experiment", "deploy to server", "跑实验", or needs to launch training jobs. argument-hint: [experiment-description] allowed-tools: Bash(*), Read, Grep, Glob, Edit, Write, Agent --- # Run Experiment Deploy and run ML experiment: $ARGUMENTS ## Workflow ### Step 1: Detect Environment Read the project's `CLAUDE.md` to determine the experiment environment: - **Local GPU**: Look for local CUDA/MPS setup info - **Remote server**: Look for SSH alias, conda env, code directory If no server info is found in `CLAUDE.md`, ask the user. ### Step 2: Pre-flight Check Check GPU availability on the target machine: **Remote:** ```bash ssh nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader ``` **Local:** ```bash nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader # or for Mac MPS: python -c "import torch; print('MPS available:', torch.backends.mps.is_available())" ``` Free GPU = memory.used < 500 MiB. ### Step 3: Sync Code (Remote Only) Check the project's `CLAUDE.md` for a `code_sync` setting. If not specified, default to `rsync`. #### Option A: rsync (default) Only sync necessary files — NOT data, checkpoints, or large files: ```bash rsync -avz --include='*.py' --exclude='*' / :/ ``` #### Option B: git (when `code_sync: git` is set in CLAUDE.md) Push local changes to remote repo, then pull on the server: ```bash # 1. Push from local git add -A && git commit -m "sync: experiment deployment" && git push # 2. Pull on server ssh "cd && git pull" ``` Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed. ### Step 3.5: W&B Integration (when `wandb: true` in CLAUDE.md) **Skip this step entirely if `wandb` is not set or is `false` in CLAUDE.md.** Before deploying, ensure the experiment scripts have W&B logging: 1. **Check if wandb is already in the script** — look for `import wandb` or `wandb.init`. If present, skip to Step 4. 2. **If not present, add W&B logging** to the training script: ```python import wandb wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...}) # Inside training loop: wandb.log({"train/loss": loss, "train/lr": lr, "step": step}) # After eval: wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc}) # At end: wandb.finish() ``` 3. **Metrics to log** (add whichever apply to the experiment): - `train/loss` — training loss per step - `train/lr` — learning rate - `eval/loss`, `eval/ppl`, `eval/accuracy` — eval metrics per epoch - `gpu/memory_used` — GPU memory (via `torch.cuda.max_memory_allocated()`) - `speed/samples_per_sec` — throughput - Any custom metrics the experiment already computes 4. **Verify wandb login on the target machine:** ```bash ssh "wandb status" # should show logged in # If not logged in: ssh "wandb login " ``` > The W&B project name and API key come from `CLAUDE.md` (see example below). The experiment name is auto-generated from the script name + timestamp. ### Step 4: Deploy #### Remote (via SSH + screen) For each experiment, create a dedicated screen session with GPU binding: ```bash ssh "screen -dmS bash -c '\ eval \"\$(/conda shell.bash hook)\" && \ conda activate && \ CUDA_VISIBLE_DEVICES= python