# Orbit Examples

Launchable training recipes:

- `high_precision/`: BF16 and high-precision training launchers.
- `low_precision/`: int4, fp8, and nvfp4 training launchers.

Each launcher is an independent bash entrypoint that defines its argument
arrays inline. The only shared code is `scripts/lib/` utilities for CUDA setup,
private Ray lifecycle, W&B handling, eval toggles, and checkpoint preflight.
To change a recipe value (batch size, learning rate, etc.), edit the launcher
file directly.

## Running

```bash
bash examples/low_precision/run-qwen3-4b-int4-math-oft.sh
```

Cross-cutting orchestration knobs can still be overridden inline:

```bash
ENABLE_WANDB=0 bash examples/low_precision/run-qwen3-4b-int4-math-oft.sh
```

### Providing site paths

Release launchers do not assume a specific cluster filesystem layout. Provide
the site-specific paths through environment variables when launching:

| Variable | Required | Meaning |
|---|---:|---|
| `ORBIT_VENV` | Usually | Python environment containing Orbit, Megatron-Bridge, Megatron-LM, SGLang, and CUDA Python packages. |
| `CUDA_HOME` / `ORBIT_CUDA_HOME_DEFAULT` | Usually | CUDA toolkit root. |
| `TRAIN_JSONL` | Yes | Training prompt JSONL. |
| `HF_CKPT` | Yes | HuggingFace checkpoint directory. For low-precision recipes this is the quantized HF checkpoint. |
| `MEGATRON_LOAD` | Yes | Megatron torch distributed checkpoint root. `latest_checkpointed_iteration.txt` is resolved automatically. |
| `TEST_JSONL` | If eval is enabled | Eval JSONL. Set `DISABLE_EVAL=1` to skip eval. |
| `SAVE_DIR` | No | Output checkpoint directory. Defaults under `orbit_ckpts/`. |
| `RUN_LOG` | No | Launcher stdout/stderr log path. Defaults under `logs/`. |
| `STAGE_HF_CKPT_TO` | No | Optional node-local copy destination for `HF_CKPT`. |
| `STAGE_MEGATRON_CKPT_TO` | No | Optional node-local copy destination for `MEGATRON_LOAD`. |

Example:

```bash
cd /path/to/orbit

export ORBIT_VENV=/path/to/orbit/.venv
export CUDA_HOME=/path/to/cuda-13.2
source examples/load_cuda13_2_orbit_env.sh

TRAIN_JSONL=/path/to/data/math/train.jsonl \
TEST_JSONL=/path/to/data/math/test.jsonl \
HF_CKPT=/path/to/hf/Qwen3-4B-Instruct-2507-quantized.w4a16 \
MEGATRON_LOAD=/path/to/megatron/checkpoints/Qwen3-4B-Instruct-2507-w4a16 \
ENABLE_WANDB=0 \
bash examples/low_precision/run-qwen3-4b-int4-math-oft.sh
```

To inspect the final Python argv without starting Ray or loading the model:

```bash
ORBIT_DRY_RUN_ARGV=1 \
TRAIN_JSONL=/path/to/data/math/train.jsonl \
TEST_JSONL=/path/to/data/math/test.jsonl \
HF_CKPT=/path/to/hf/Qwen3-4B-Instruct-2507-quantized.w4a16 \
MEGATRON_LOAD=/path/to/megatron/checkpoints/Qwen3-4B-Instruct-2507-w4a16 \
ENABLE_WANDB=0 \
bash examples/low_precision/run-qwen3-4b-int4-math-oft.sh
```

For a one-step smoke run, override the recipe schedule:

```bash
NUM_ROLLOUT=1 TOTAL_EPOCHS=1 TRAIN_ROWS=1 \
ROLLOUT_BATCH_SIZE=1 N_SAMPLES_PER_PROMPT=1 GLOBAL_BATCH_SIZE=1 \
DISABLE_EVAL=1 ENABLE_WANDB=0 \
TRAIN_JSONL=/path/to/data/math/train.jsonl \
HF_CKPT=/path/to/hf/Qwen3-4B-Instruct-2507-quantized.w4a16 \
MEGATRON_LOAD=/path/to/megatron/checkpoints/Qwen3-4B-Instruct-2507-w4a16 \
bash examples/low_precision/run-qwen3-4b-int4-math-oft.sh
```

Smoke-test overrides that shrink a run to a single step:

```bash
NUM_ROLLOUT=1 TOTAL_EPOCHS=1 TRAIN_ROWS=1 \
ROLLOUT_BATCH_SIZE=1 N_SAMPLES_PER_PROMPT=1 GLOBAL_BATCH_SIZE=1 \
DISABLE_EVAL=1 ENABLE_WANDB=0 \
    bash examples/high_precision/run-qwen3-4b-instruct-2507-bf16-math-oft.sh
```

## Async PEFT double buffering

Async PEFT runs can opt into two-slot adapter double buffering for distributed
SGLang rollout engines. This is controlled by `ADAPTER_DOUBLE_BUFFER=1` in the
launchers, which passes `--adapter-double-buffer` to Orbit.

Requirements:

- Use an async, non-colocated launcher: `ORBIT_ENTRYPOINT=train_async.py` and
  `ORBIT_COLOCATE=0`.
- Use PEFT: `PEFT_METHOD=oft` or `PEFT_METHOD=lora`.
- Do not enable it for colocated/sync IPC runs; colocated behavior is
  intentionally unchanged.

Example:

```bash
ADAPTER_DOUBLE_BUFFER=1 \
    bash examples/high_precision/run-qwen3-4b-instruct-2507-bf16-math-oft-async.sh
```

The default is off. In the current benchmark results, OFT async benefits from
double buffering on the tested 4B config, while LoRA async is roughly neutral.

## Env-knob reference

> **Note:** Recipe values (batch sizes, learning rates, parallelism degrees,
> checkpoint paths, etc.) are now inlined directly in each launcher file.
> Edit the launcher to change them. The table below documents only
> cross-cutting orchestration knobs that are read by `scripts/lib/` helpers
> or that override launcher behaviour at call-time (e.g. `ENABLE_WANDB`,
> `DISABLE_EVAL`, `ORBIT_DRY_RUN_ARGV`). Knobs like `ROLLOUT_BATCH_SIZE`,
> `LR`, `OFT_BLOCK_SIZE`, etc. are still valid environment overrides but
> their defaults now live in each launcher, not in a shared env-knob ladder.

Knobs are grouped by domain. Defaults shown are the launcher-level fallback;
each leaf launcher may pin its own values for the recipe.

### Recipe identity

| Knob | Meaning |
|---|---|
| `LAUNCHER_NAME` | Human-readable run name; used in W&B group and log path. |
| `PRECISION_PROFILE` | One of `bf16`, `fp8`, `int4`, `nvfp4`. Selects the preflight checks and quantization defaults applied by `scripts/lib/preflight.sh`. |
| `REQUIRE_MEGATRON_LOAD` | `1` = refuse to start without a valid Megatron distributed checkpoint at `MEGATRON_LOAD`. `0` = allow HF-only startup. |

### Model spec

| Knob | Meaning |
|---|---|
| `MODEL_ARGS_FILE` | Path to the per-model arg shim under `orbit_plugins/model_args/`. Sets `MODEL_ARGS=(...)` consumed by Megatron-Bridge. |
| `MODEL_ARGS_ROTARY_BASE` | Rotary base (theta). Qwen3 family: `1e4`. Qwen3-Instruct-2507: `5e6`. |

### Data

| Knob | Meaning |
|---|---|
| `DATASET` | Dataset folder name under `DATA_ROOT`. |
| `DATA_ROOT` | Base directory holding `<dataset>/{train,test}.jsonl`. |
| `TRAIN_JSONL` / `TEST_JSONL` | Direct overrides for the train/test jsonl paths. Slice syntax `path@[:N]` is supported. |
| `PROMPT_FILE` | Prompt file consumed by the runtime parity preflight. |

### Checkpoints

| Knob | Meaning |
|---|---|
| `HF_CKPT` | HuggingFace source checkpoint; the trainer reads its tokenizer and config from here. |
| `MEGATRON_LOAD` | Megatron distributed checkpoint root. The trainer resolves `latest_checkpointed_iteration.txt` to `iter_*` automatically. |
| `LOAD_CKPT` | What `--load` actually receives. Defaults to `MEGATRON_LOAD`, then falls back to `HF_CKPT`. |
| `SAVE_DIR` | Output checkpoint root. Created if missing. |
| `SAVE_INTERVAL` | Save (and retain) every N rollout steps. |
| `STAGE_HF_CKPT_TO` | If set, rsync `HF_CKPT` to this path on the local node before training. Useful for slow shared filesystems. |
| `STAGE_MEGATRON_CKPT_TO` | Same as above, for the Megatron checkpoint. |
| `FORCE_STAGE_HF_CKPT` / `FORCE_STAGE_MEGATRON_CKPT` | `1` = always re-rsync; `0` = reuse the staged copy when it looks valid. |
| `RUN_LOG` | tee target for the launcher's stdout/stderr. |

### Resources

| Knob | Meaning |
|---|---|
| `GPUS_PER_NODE` | Number of GPUs the actor occupies. |
| `RAY_NUM_CPUS` | CPU resource handed to the Ray head. |

### Parallelism

| Knob | Meaning |
|---|---|
| `TENSOR_MODEL_PARALLEL_SIZE` | TP degree. |
| `PIPELINE_MODEL_PARALLEL_SIZE` | PP degree. |
| `CONTEXT_PARALLEL_SIZE` | CP degree (sequence parallel for long contexts). |
| `EXPERT_MODEL_PARALLEL_SIZE` | EP degree for MoE models. |
| `EXPERT_TENSOR_PARALLEL_SIZE` | TP within each expert. |
| `SEQUENCE_PARALLEL` | Truthy = pass `--sequence-parallel` to Megatron. |
| `MAX_TOKENS_PER_GPU` | Dynamic micro-batch budget per GPU; controls memory headroom vs. throughput. |

### Training schedule

| Knob | Meaning |
|---|---|
| `ROLLOUT_BATCH_SIZE` | Prompts sampled per rollout step. |
| `N_SAMPLES_PER_PROMPT` | Completions generated per prompt (group size for GRPO/GSPO). |
| `ROLLOUT_MAX_PROMPT_LEN` | Hard cap on prompt token count. |
| `ROLLOUT_MAX_RESPONSE_LEN` | Max generated tokens per completion. |
| `ROLLOUT_MAX_CONTEXT_LEN` | Combined prompt+response cap. |
| `ROLLOUT_TEMPERATURE` | Sampling temperature for rollouts. |
| `GLOBAL_BATCH_SIZE` | Total prompts × samples consumed per gradient step. |
| `TOTAL_EPOCHS` | Passes over the prompt dataset. |
| `NUM_ROLLOUT` | Override the auto-computed rollout-step count. |

### Optimizer

| Knob | Meaning |
|---|---|
| `LR` | Adam learning rate. |
| `LR_DECAY_STYLE` | One of `constant`, `cosine`, `linear`, ... |
| `WEIGHT_DECAY` | AdamW weight decay. |
| `ADAM_BETA1` / `ADAM_BETA2` | Adam betas. |
| `OPTIMIZER_CPU_OFFLOAD` | Truthy = offload optimizer state to host RAM, with overlapped D2H/H2D. |
| `USE_PRECISION_AWARE_OPTIMIZER` | Truthy = use Megatron's precision-aware optimizer. |

### RL

| Knob | Meaning |
|---|---|
| `ADVANTAGE_ESTIMATOR` | One of `grpo` (default), `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation`. |
| `USE_KL_LOSS` | Truthy = add KL-to-reference loss term. PEFT KL is computed with adapters disabled, so no separate ref-load is needed. |
| `KL_LOSS_COEF` | KL term weight. |
| `KL_LOSS_TYPE` | `low_var_kl`, `kl`, etc. |
| `ENTROPY_COEF` | Entropy bonus weight. |
| `EPS_CLIP` / `EPS_CLIP_HIGH` | PPO/GRPO clipping ranges. |
| `GAMMA` / `LAMBD` | PPO discount + GAE lambda (PPO only). |
| `DISABLE_GRPO_STD_NORMALIZATION` | Truthy = skip per-group std normalization in GRPO/GSPO. |

### Rollout backend (SGLang)

| Knob | Meaning |
|---|---|
| `ROLLOUT_NUM_GPUS_PER_ENGINE` | GPUs per SGLang engine instance. |
| `SGLANG_MEM_FRACTION_STATIC` | Fraction of free GPU memory reserved for the static KV cache. |
| `SGLANG_CONTEXT_LENGTH` | Per-engine max context length. |
| `SGLANG_MAX_RUNNING_REQUESTS` | Concurrency cap. |
| `SGLANG_MAX_PREFILL_TOKENS` / `SGLANG_CHUNKED_PREFILL_SIZE` | Prefill scheduling knobs. |
| `SGLANG_ATTENTION_BACKEND` | `flashinfer` (default for INT4), `triton`, `fa3`, ... |
| `SGLANG_QUANTIZATION` | Engine-level quant scheme (`compressed-tensors`, etc.). |
| `SGLANG_EXPERT_PARALLEL_SIZE` | EP within the SGLang engine. |
| `SGLANG_DISABLE_CUDA_GRAPH` / `SGLANG_ENFORCE_EAGER` | Disable CUDA graph capture (debugging). |
| `SGLANG_FP8_GEMM_BACKEND` | FP8 GEMM kernel selection (FP8 recipes). |
| `SGLANG_TORCHAO_CONFIG` | Path to a torchao quant config (if applicable). |

### Eval

| Knob | Meaning |
|---|---|
| `DISABLE_EVAL` | `1` skips eval entirely. |
| `SKIP_EVAL_BEFORE_TRAIN` | `1` skips the step-0 eval (saves a cold-start pass). |
| `EVAL_INTERVAL` | Eval every N rollout steps. |
| `N_SAMPLES_PER_EVAL_PROMPT` | Completions per eval prompt. |
| `EVAL_MAX_RESPONSE_LEN` | Eval generation cap. |
| `EVAL_TOP_K` | Eval sampling top-k (`1` = greedy). |
| `EVAL_GENERATE_MAX_CONCURRENCY` | Concurrency cap for eval generation. |

### PEFT (OFT / LoRA)

| Knob | Meaning |
|---|---|
| `PEFT_METHOD` | `oft` or `lora`. |
| `TARGET_MODULES` | Comma-separated Megatron module names to wrap (e.g. `linear_qkv,linear_proj,linear_fc1,linear_fc2`) or `all-linear`. |
| `OFT_BLOCK_SIZE` | Rotation block size. Smaller = more parameters, more expressive. Must divide each linear's hidden dim. |
| `OFT_EPS` | Numerical stability constant for OFT. |
| `OFT_COFT` | `1` enables Cayley-OFT (orthogonality via Cayley transform). |
| `OFT_BLOCK_SHARE` | `1` ties the rotation matrix across blocks. |
| `LORA_RANK` / `LORA_ALPHA` / `LORA_DROPOUT` | LoRA hyperparameters. |
| `ADAPTER_DOUBLE_BUFFER` | `1` enables `--adapter-double-buffer` for async distributed PEFT rollout engines. Default is `0`. |

### Quantization (FP8)

| Knob | Meaning |
|---|---|
| `MEGATRON_OFT_FP8_ACTIVATION_QUANT` | Activation quant scheme for OFT under FP8 (`w8a8`, `none`). |
| `MEGATRON_KEEP_NATIVE_FP8_WEIGHTS` | Keep FP8 weights in their native dtype across save/load. |
| `MEGATRON_ACTIVATION_FP8` | Forward-pass activation FP8 mode. |
| `ROLLOUT_QUANTIZATION` | Quant scheme used by the rollout engine (`fp8`, ...). |

### Quantization (INT4)

| Knob | Meaning |
|---|---|
| `OPEN_TRAINING_INT4_FAKE_QAT_FLAG` | `1` = train with fake-quant straight-through; `0` = train in full precision and requantize at save-time. |
| `OPEN_TRAINING_INT4_GROUP_SIZE` | Group size for W4A16 (must match the HF checkpoint). |

### Parity preflight gates

| Knob | Meaning |
|---|---|
| `SKIP_FP8_CHECKPOINT_PREFLIGHT` / `SKIP_INT4_CHECKPOINT_PREFLIGHT` / `SKIP_NVFP4_CHECKPOINT_PREFLIGHT` | `1` skips the checkpoint parity preflight (use only for command-path smoke checks). |
| `RUN_FP8_RUNTIME_PREFLIGHT` / `RUN_INT4_RUNTIME_PREFLIGHT` | `1` runs the step-0 logprob parity check between Megatron and SGLang. |
| `MAX_LOGPROB_ABS_DIFF` | Per-token logprob tolerance for the runtime parity check. |
| `MIN_TOPK_AGREEMENT` | Minimum top-k token agreement fraction. |

### Megatron / TE flags

| Knob | Meaning |
|---|---|
| `MEGATRON_PATH` | Path to the Megatron-LM checkout used for `PYTHONPATH`. |
| `ATTENTION_BACKEND` | `flash`, `fused`, `unfused`. |
| `ATTENTION_SOFTMAX_IN_FP32` | Truthy = pass `--attention-softmax-in-fp32`. |
| `TRUST_REMOTE_CODE` | Truthy = allow custom HF model code. |
| `OFFLOAD_TRAIN` / `OFFLOAD_TRAIN_ASYNC` / `OFFLOAD_ROLLOUT` | Memory offload toggles. Each has an explicit `--offload-...` / `--no-offload-...` form. |
| `UPDATE_WEIGHT_BUFFER_SIZE` | Optional byte size for weight-update buckets; larger values reduce adapter sync chunk count at the cost of more temporary memory. |
| `CUDA_GRAPH_IMPL` / `CUDA_GRAPH_SCOPE` | CUDA graph capture controls. |
| `USE_TE_RNG_TRACKER` | Truthy = use TransformerEngine's RNG tracker. |
| `SKIP_NAN_CHECK_IN_LOSS_AND_GRAD` | Truthy = pass `--no-check-for-nan-in-loss-and-grad`. |

### Ray lifecycle

| Knob | Meaning |
|---|---|
| `ORBIT_RAY_LIFECYCLE` | `private` (default; reserves its own port range and tmp dir) or `legacy` (clobbers any existing Ray cluster on the node). |
| `RAY_TEMP_DIR` | Override the Ray tmp dir. |
| `RAY_HEAD_PORT` | Override the head port (private lifecycle picks one in `24000-29999` if unset). |
| `RAY_WORKER_PORT_RANGE_SIZE` | Width of the worker port range. |
| `MASTER_ADDR` | Defaults to `127.0.0.1` for single-node runs. |
| `ORBIT_PORT_LOCK_DIR` | Lock-file directory used to serialize port reservation across concurrent launchers. |

### Logging / observability

| Knob | Meaning |
|---|---|
| `ENABLE_WANDB` | `auto` = enabled iff `$HOME/.wandb_key` exists; `0` disables. |
| `WANDB_PROJECT` / `WANDB_GROUP` | W&B project + group names. |
| `ORBIT_LAUNCHER_XTRACE` | `1` enables `set -x` for the launcher (very noisy; for debugging only). |
| `ORBIT_DEBUG_MODE` | `rollout` runs only the rollout phase; `train` runs only the train phase. |

### Tool-script knobs

These are read by `scripts/lib/tool_env.sh` (sourced by every leaf launcher
and standalone tool/parity script):

| Knob | Meaning |
|---|---|
| `ORBIT_LOAD_CUDA_MODULES` | `0` skips `module load cuda/13.2 nccl` (use when `module` is unavailable). |
| `CUDA_HOME` | Override the auto-detected CUDA root. |
| `MEGATRON_BRIDGE_ROOT` | If set, prepended to `PYTHONPATH` before `ORBIT_ROOT`. |
| `DEFAULT_OUTPUT_ROOT` | Base directory for `default_output_path` (used by conversion scripts). |