--- name: practitioner-guide-multi-turn-agentic-rl title: "A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2510.01132" keywords: [multi-turn RL, agent training, policy optimization, reward engineering, environment design] description: "Train LLM agents via multi-turn reinforcement learning by systematically optimizing environment complexity, reward signals, and policy initialization. Use curriculum learning, dense verified rewards, and domain-specific SFT for reliable agent convergence across TextWorld, ALFWorld, and SWE-Gym benchmarks." --- # A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning ## Core Concept Multi-turn agentic RL trains language models as interactive agents through reinforcement learning across extended task sequences. The critical insight is that performance depends on coordinated design choices across three pillars: environment specification, reward formulation, and policy initialization—not isolated optimizations of any single component. ## Architecture Overview - **POMDP Formulation**: Agents generate natural language commands executed at episode boundaries (`` tokens), creating multi-step trajectories with sparse rewards - **Token-Level Credit Assignment**: TD errors and GAE advantages flow through all trajectory tokens despite rewards only appearing at completion, enabling value bootstrapping - **Multi-Domain Evaluation**: TextWorld (navigation), ALFWorld (household tasks), SWE-Gym (code generation) reveal algorithm-environment interactions - **Curriculum-Aware Training**: Skill transfer from simple to complex environments outperforms single-complexity training on harder tasks ## Implementation Steps ### 1. Environment Design with Curriculum Learning Start with reduced complexity environments to establish foundational agent behaviors before increasing difficulty. The paper identifies three independent complexity dimensions: spatial (room count), object (entity count), and solution (quest length). ```python # Configure environment complexity levels environments = { 'simple': {'rooms': 2, 'objects': 3, 'quest_length': 2}, 'medium': {'rooms': 4, 'objects': 8, 'quest_length': 4}, 'hard': {'rooms': 8, 'objects': 12, 'quest_length': 6} } # Curriculum strategy: begin on simple, evaluate on complex training_curriculum = ['simple', 'medium'] # 50% on each evaluation_curriculum = ['hard'] # Object complexity proves harder than spatial in LLM domains # Prioritize object-manipulation curriculum ``` ### 2. Reward Engineering for Multi-turn Tasks Dense, verified rewards significantly accelerate training compared to sparse binary signals. Prioritize execution-based feedback (unit tests, environment validation) over learned reward models. ```python # Sparse reward baseline: only at episode boundary def sparse_reward(episode_success): return 1.0 if episode_success else 0.0 # Dense reward: ratio-based per-turn signals def dense_reward(partial_progress, total_steps): # Intermediate progress rewards smooth learning completion_ratio = partial_progress / total_steps return completion_ratio # Signal every step # Verified reward > model-based reward def verified_reward(episode_output, unit_tests): # Execute unit tests, measure coverage passed_tests = run_tests(episode_output) return len(passed_tests) / len(unit_tests) # Ratio-verified ``` Benchmark results show dense rewards achieve 58% success vs. 41% sparse (TextWorld, PPO). For coding tasks, ratio-verified execution tests (22% success) outperform learned CodeRM models (7.2%). ### 3. Policy Initialization with SFT-RL Trade-off Under fixed compute budgets, balance supervised fine-tuning (SFT) for behavioral grounding with RL for task adaptation. Optimal allocation preserves domain knowledge while enabling RL learning. ```python # Compute budget: 1000 units (SFT costs 10× RL episodes) # Optimal strategy: 60 SFT demos + 400 RL episodes sft_budget = 60 # demonstrations rl_budget = 400 # episodes total_cost = sft_budget * 10 + rl_budget * 1 # 1000 units # In-domain: 85%, Generalization: 59% # vs 100% SFT alone: 95%/55%, or pure RL: 54% in-domain # Critical: domain-specific SFT essential for stability # Cross-domain initialization causes rapid policy collapse ``` ### 4. Algorithm and Hyperparameter Selection Different domains favor different algorithms. PPO excels on state-based tasks (88% TextWorld); GRPO handles sparse-reward coding efficiently. Hyperparameters matter significantly. ```python # Algorithm selection per domain algorithms = { 'navigation': 'PPO', # State-based, dense observations 'coding': 'GRPO', # Sparse rewards, sparse examples 'household': 'PPO' # Mixed-horizon, intermediate signals } # Optimal hyperparameter set (systematic ablation) hyperparams = { 'kl_coeff': 0.01, # Higher values (>0.001) improve stability 'temperature': 0.75, # 0.7-1.0 range balances exploration 'actor_lr': 1e-6, 'critic_lr': 1e-5, 'gamma': 1.0, # No discounting for episode rewards 'batch_size': 256, 'gae_lambda': 0.95 } # Token-level credit assignment def gae_advantage(rewards, values, gamma=1.0, lambda_=0.95): # TD Error: δ_t^i = r_t^i + γV(h_t^(i+1)) - V(h_t^i) # GAE: Â_t^i = Σ(γλ)^l δ_t^(i+l) # Apply to all tokens despite sparse rewards advantages = [] for t in range(len(rewards)): td_error = rewards[t] + gamma * values[t+1] - values[t] advantage = td_error for l in range(1, len(rewards) - t): advantage += (gamma * lambda_)**l * (rewards[t+l] + gamma * values[t+l+1] - values[t+l]) advantages.append(advantage) return advantages ``` ## Practical Guidance **Environment Complexity**: Start training on simpler environments to develop reusable skills that generalize. Object manipulation represents harder curriculum than spatial navigation. Evaluate on hardest environment to assess true generalization. **Reward Signal Priority**: (1) Verified execution rewards > (2) Ratio-based intermediate signals > (3) Model-based proxies. Skip expensive reward model training; direct environment feedback works better. **Policy Stability**: Maintain domain-specific SFT initialization (60 demos under 1000-unit budget) rather than pure RL, which shows rapid collapse. γ=1.0 avoids discounting long-horizon episode rewards. **Model Scale Trade-offs**: 7B+ models handle complex reasoning and long horizons reliably. 1.5B models work on simpler tasks but plateau at ~55%. Use smaller models only for constrained deployments. ## When to Use / When NOT to Use **Use When**: - Training LLM agents on extended interactive tasks (TextWorld, ALFWorld, SWE-Gym style problems) - You have access to verified reward signals (unit tests, environment APIs, ground-truth checkers) - Task has multi-step structure with intermediate milestones - Deployment requires robust agent behavior across diverse scenarios **NOT For**: - Single-turn tasks or pure language generation (standard RLHF sufficient) - Domains without reliable reward signals or where reward model is unavoidable - When compute budget is extremely limited (<1000 unit equivalent) - Tasks requiring perfect safety guarantees without verification infrastructure ## Reference This skill encodes findings from "A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning" (arXiv:2510.01132), which systematized empirical ablations across TextWorld, ALFWorld, and SWE-Gym. Code and reproducibility details are available via the veRL framework.