--- name: torchforge-rl-training description: Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan. version: 1.0.0 author: Orchestra Research license: MIT tags: [Reinforcement Learning, PyTorch, GRPO, SFT, Monarch, TorchTitan, Meta] dependencies: [torch>=2.9.0, torchtitan>=0.2.0, vllm, monarch] --- # torchforge: PyTorch-Native Agentic RL Library torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically. ## When to Use torchforge **Choose torchforge when you need:** - Clean separation between RL algorithms and infrastructure - PyTorch-native abstractions (no Ray dependency) - Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines) - Scalable training with Monarch actor system - Integration with TorchTitan for model parallelism **Consider alternatives when:** - You need production-ready stability → use **miles** or **verl** - You want Megatron-native training → use **slime** - torchforge is experimental and APIs may change ## Key Features - **Algorithm isolation**: Implement RL algorithms without touching infrastructure - **Scalability**: From single GPU to thousands via Monarch - **Modern stack**: TorchTitan (training), vLLM (inference), TorchStore (sync) - **Loss functions**: GRPO, DAPO, CISPO, GSPO, SAPO built-in ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────┐ │ Application Layer (Your Code) │ │ - Define reward models, loss functions, sampling │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Forge API Layer │ │ - Episode, Group dataclasses │ │ - Service interfaces (async/await) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Distributed Services (Monarch) │ │ ├── Trainer (TorchTitan FSDP) │ │ ├── Generator (vLLM inference) │ │ ├── Reference Model (frozen KL baseline) │ │ └── Reward Actors (compute rewards) │ └─────────────────────────────────────────────────────────┘ ``` ## Installation ```bash # Create environment conda create -n forge python=3.12 conda activate forge # Install (handles PyTorch nightly + dependencies) ./scripts/install.sh # Verify python -c "import torch, forge, vllm; print('OK')" ``` ### ROCm Installation ```bash ./scripts/install_rocm.sh ``` ## Quick Start ### SFT Training (2+ GPUs) ```bash python -m apps.sft.main --config apps/sft/llama3_8b.yaml ``` ### GRPO Training (3+ GPUs) ```bash python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml ``` --- ## Workflow 1: GRPO Training for Math Reasoning Use this workflow for training reasoning models with group-relative advantages. ### Prerequisites Checklist - [ ] 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator) - [ ] Model from HuggingFace Hub - [ ] Training dataset (GSM8K, MATH, etc.) ### Step 1: Create Configuration ```yaml # config/grpo_math.yaml model: "Qwen/Qwen2.5-7B-Instruct" dataset: path: "openai/gsm8k" split: "train" streaming: true training: batch_size: 4 learning_rate: 1e-6 seq_len: 4096 dtype: bfloat16 gradient_accumulation_steps: 4 grpo: n_samples: 8 # Responses per prompt clip_low: 0.2 clip_high: 0.28 beta: 0.1 # KL penalty coefficient temperature: 0.7 services: generator: procs: 1 num_replicas: 1 with_gpus: true trainer: procs: 1 num_replicas: 1 with_gpus: true ref_model: procs: 1 num_replicas: 1 with_gpus: true ``` ### Step 2: Define Reward Function ```python # rewards.py # Reward functions are in forge.data.rewards from forge.data.rewards import MathReward, ThinkingReward import re # Or define your own reward function class CustomMathReward: def __call__(self, prompt: str, response: str, target: str) -> float: # Extract answer from response match = re.search(r'\\boxed{([^}]+)}', response) if not match: return 0.0 answer = match.group(1).strip() return 1.0 if answer == target else 0.0 ``` ### Step 3: Launch Training ```bash python -m apps.grpo.main --config config/grpo_math.yaml ``` ### Step 4: Monitor Progress - [ ] Check W&B dashboard for loss curves - [ ] Verify entropy is decreasing (policy becoming more deterministic) - [ ] Monitor KL divergence (should stay bounded) --- ## Workflow 2: Custom Loss Function Use this workflow to implement new RL algorithms. ### Step 1: Create Loss Class ```python # src/forge/losses/custom_loss.py import torch import torch.nn as nn class CustomLoss(nn.Module): def __init__(self, clip_range: float = 0.2, beta: float = 0.1): super().__init__() self.clip_range = clip_range self.beta = beta def forward( self, logprobs: torch.Tensor, ref_logprobs: torch.Tensor, advantages: torch.Tensor, padding_mask: torch.Tensor, ) -> torch.Tensor: # Compute importance ratio ratio = torch.exp(logprobs - ref_logprobs) # Clipped policy gradient clipped_ratio = torch.clamp( ratio, 1 - self.clip_range, 1 + self.clip_range ) pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages) # KL penalty kl = ref_logprobs - logprobs # Apply mask and aggregate masked_loss = (pg_loss + self.beta * kl) * padding_mask loss = masked_loss.sum() / padding_mask.sum() return loss ``` ### Step 2: Integrate into Application ```python # apps/custom/main.py from forge.losses.custom_loss import CustomLoss loss_fn = CustomLoss(clip_range=0.2, beta=0.1) # In training loop loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask, ) ``` --- ## Workflow 3: Multi-GPU Distributed Training Use this workflow for scaling to multiple GPUs or nodes. ### Configuration for Distributed ```yaml # config/distributed.yaml model: "meta-llama/Meta-Llama-3.1-8B-Instruct" parallelism: tensor_parallel_degree: 2 # Split model across GPUs pipeline_parallel_degree: 1 data_parallel_shard_degree: 2 services: generator: procs: 2 # 2 processes for TP=2 num_replicas: 1 with_gpus: true trainer: procs: 2 num_replicas: 1 with_gpus: true ``` ### Launch with SLURM ```bash # Submit job sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh ``` ### Launch Locally (Multi-GPU) ```bash # 8 GPU setup python -m apps.grpo.main \ --config config/distributed.yaml \ --trainer.procs 4 \ --generator.procs 4 ``` --- ## Core API Reference ### Training Batch Format torchforge uses dictionary-based batches for training: ```python # inputs: list of dicts with torch.Tensor values inputs = [{"tokens": torch.Tensor}] # targets: list of dicts with training signals targets = [{ "response": torch.Tensor, "ref_logprobs": torch.Tensor, "advantages": torch.Tensor, "padding_mask": torch.Tensor }] # train_step returns loss as float loss = trainer.train_step(inputs, targets) ``` ### Completion Generated output from vLLM: ```python @dataclass class Completion: text: str # Generated text token_ids: list[int] # Token IDs logprobs: list[float] # Log probabilities metadata: dict # Custom metadata ``` --- ## Built-in Loss Functions ### Loss Functions Loss functions are in the `forge.losses` module: ```python from forge.losses import SimpleGRPOLoss, ReinforceLoss # SimpleGRPOLoss for GRPO training loss_fn = SimpleGRPOLoss(beta=0.1) # Forward pass loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask ) ``` ### ReinforceLoss ```python from forge.losses.reinforce_loss import ReinforceLoss # With optional importance ratio clipping loss_fn = ReinforceLoss(clip_ratio=0.2) ``` --- ## Common Issues and Solutions ### Issue: Not Enough GPUs **Symptoms**: "Insufficient GPU resources" error **Solutions**: ```yaml # Reduce service requirements services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true # Remove ref_model (uses generator weights) ``` Or use CPU for reference model: ```yaml ref_model: with_gpus: false ``` ### Issue: OOM During Generation **Symptoms**: CUDA OOM in vLLM **Solutions**: ```yaml # Reduce batch size grpo: n_samples: 4 # Reduce from 8 # Or reduce sequence length training: seq_len: 2048 ``` ### Issue: Slow Weight Sync **Symptoms**: Long pauses between training and generation **Solutions**: ```bash # Enable RDMA (if available) export TORCHSTORE_USE_RDMA=1 # Or reduce sync frequency training: sync_interval: 10 # Sync every 10 steps ``` ### Issue: Policy Collapse **Symptoms**: Entropy drops to zero, reward stops improving **Solutions**: ```yaml # Increase KL penalty grpo: beta: 0.2 # Increase from 0.1 # Or add entropy bonus training: entropy_coef: 0.01 ``` --- ## Resources - **Documentation**: https://meta-pytorch.org/torchforge - **GitHub**: https://github.com/meta-pytorch/torchforge - **Discord**: https://discord.gg/YsTYBh6PD9 - **TorchTitan**: https://github.com/pytorch/torchtitan - **Monarch**: https://github.com/meta-pytorch/monarch