--- name: verltool-agentic-rl-tool-use title: "VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2509.01055" keywords: [reinforcement learning, tool use, agentic AI, multi-turn reasoning, modular framework] description: "Train agents to leverage external tools across domains using VerlTool's unified RL framework. Coordinate code execution, search, SQL queries, and vision utilities in multi-turn interactions without domain-specific redesign. 2× faster asynchronous rollouts on mathematical reasoning, knowledge QA, and software engineering tasks." --- ## Train Agentic Systems to Solve Complex Tasks with Tools **Outcome:** Build agents that iteratively reason, call external tools, observe results, and adapt across diverse problem domains using a single unified framework. ### Problem Context Existing approaches to agentic AI fragment tool use into domain-specific systems. A knowledge QA system handles search differently than a code execution system; SQL agents use different APIs than visual reasoning systems. When researchers want agents to solve multi-domain tasks—or when practitioners need to add new tool capabilities—the cost of integration is high: custom pipelines, reimplemented coordination logic, and repeated infrastructure investment. Single-turn language models also cannot naturally handle the sequential decision-making required for tool use: acting, observing results from that action, and choosing the next action based on new information. Reinforcement learning (RL) can optimize these multi-turn trajectories, but extending RL frameworks to support diverse tools requires careful trajectory representation, observation tokenization, and reward alignment across modalities. VerlTool solves this by introducing a unified, modular framework that extends Reinforcement Learning with Verifiable Rewards (RLVR) to multi-turn agentic settings with tool use. ### Core Concept VerlTool treats tool use in RL as a trajectory of alternating actions and observations: the agent selects an action (tool call with arguments), the tool executes and returns an observation, and this cycle repeats until the task is solved. The key insight is that observations in tool-use trajectories are environment-generated facts outside the agent's control—they should not influence the policy gradient during training, only provide information for subsequent decisions. The framework is "holistic" because it unifies: - Multiple tool types (code, search, SQL, vision) under one API - Multiple modalities (text, images, video) in observation tokens - Multi-turn RL training with asynchronous execution for efficiency - Upstream compatibility with VeRL for seamless maintenance Instead of building separate agents for code execution, search, and SQL, teams develop a single agent that learns to compose any combination of these tools effectively. ### Architecture Overview VerlTool splits into two primary subsystems that communicate via standardized APIs: - **VeRL Workflow (Training Side):** Orchestrates policy training, reward computation, and model updates. Inherits VeRL as a submodule to stay aligned with upstream improvements and avoid duplicating core RL logic. - **Tool Server (Execution Side):** Manages execution of tool calls. Processes rollouts asynchronously on a trajectory-by-trajectory basis instead of enforcing synchronous batch alignment, eliminating idle waiting and achieving approximately 2× speedup during rollout phases. **Modular Tool Registration:** Each tool implements a common BaseTool interface with methods for parsing actions, managing environment state, and executing operations. New tools are added by creating lightweight Python definition files without modifying training code. This design separates concerns and enables domain experts to contribute tools without deep RL knowledge. **Tokenization Strategy:** A critical detail for stability: action and observation strings are tokenized separately, then concatenated. This prevents boundary-related token mismatches that can cause training instability during multi-turn rollouts. The framework ensures consistent token alignment across all turns. **Observation Masking in Policy Optimization:** Since observations are environment-generated and off-policy with respect to the model being trained, the framework masks observation tokens during gradient computation. Only action tokens and the final reward contribute to policy updates. This prevents the model from learning spurious correlations with stale observation data and maintains RL stability. ### Implementation #### 1. Set Up VerlTool Modular Architecture Begin by inheriting VeRL and structuring the tool server as a separate component. This ensures training logic remains decoupled from tool execution infrastructure. ```python # verltool/config.py # Configuration for dual-component architecture from typing import Dict, List, Optional from dataclasses import dataclass, field from verl.config import BaseConfig @dataclass class ToolServerConfig(BaseConfig): """Configuration for asynchronous tool execution server.""" host: str = "localhost" port: int = 8888 max_workers: int = 32 # Async worker threads for tool calls timeout_seconds: int = 300 enable_async_rollout: bool = True # Critical for 2× speedup @dataclass class VerlToolConfig(BaseConfig): """Top-level VerlTool configuration.""" verl_config: Dict = field(default_factory=dict) # Inherited VeRL settings tool_server: ToolServerConfig = field(default_factory=ToolServerConfig) modalities: List[str] = field(default_factory=lambda: ["text"]) # text, image, video observation_masking_enabled: bool = True # Mask off-policy observations tokenization_mode: str = "separate" # Separate action/observation tokens ``` #### 2. Implement Unified Tool Interface Define the BaseTool abstraction that all tools inherit from. This enables dynamic registration and consistent handling of diverse tools. ```python # verltool/tools/base.py # Unified tool interface for consistent behavior across domains from abc import ABC, abstractmethod from typing import Any, Dict, Optional, Union from dataclasses import dataclass @dataclass class ToolAction: """Structured representation of a tool invocation.""" tool_name: str arguments: Dict[str, Any] timestamp: Optional[float] = None @dataclass class ToolObservation: """Result returned by tool execution.""" content: Union[str, bytes] modality: str = "text" # text, image, video token_count: Optional[int] = None error: Optional[str] = None class BaseTool(ABC): """Base class for all tools in VerlTool ecosystem.""" def __init__(self, name: str, modalities: List[str] = None): self.name = name self.modalities = modalities or ["text"] self.state = {} # Maintains context across turns @abstractmethod def parse_action(self, action_string: str) -> ToolAction: """Convert raw action string to structured ToolAction.""" pass @abstractmethod def execute(self, action: ToolAction) -> ToolObservation: """Execute the tool and return observation.""" pass def update_state(self, observation: ToolObservation): """Update internal state based on execution result.""" self.state['last_result'] = observation def reset(self): """Clear state for new trajectory.""" self.state = {} ``` #### 3. Register Tools Dynamically Create a registry that discovers and manages tools without hardcoding them into training logic. ```python # verltool/tools/registry.py # Dynamic tool registration system for extensibility from typing import Dict, Type, Optional from verltool.tools.base import BaseTool import importlib import os class ToolRegistry: """Registry for tool discovery and management.""" def __init__(self, tool_dir: str = "verltool/tools"): self.tools: Dict[str, BaseTool] = {} self.tool_dir = tool_dir def register(self, tool_class: Type[BaseTool]): """Explicitly register a tool class.""" instance = tool_class() self.tools[instance.name] = instance return self def discover_from_directory(self): """Auto-discover tools from tool_dir by importing modules.""" for filename in os.listdir(self.tool_dir): if filename.endswith('_tool.py') and not filename.startswith('_'): module_name = filename[:-3] try: module = importlib.import_module(f'verltool.tools.{module_name}') # Assume each module exports TOOL_CLASS if hasattr(module, 'TOOL_CLASS'): self.register(module.TOOL_CLASS) except Exception as e: print(f"Warning: Could not load tool {module_name}: {e}") return self def get_tool(self, name: str) -> Optional[BaseTool]: """Retrieve a tool by name.""" return self.tools.get(name) def list_tools(self) -> Dict[str, BaseTool]: """Return all registered tools.""" return self.tools.copy() ``` #### 4. Implement Code Execution Tool Demonstrate a concrete tool implementation for executing Python code, a common requirement in reasoning and software engineering tasks. ```python # verltool/tools/code_execution_tool.py # Code execution tool with sandboxed environment support from verltool.tools.base import BaseTool, ToolAction, ToolObservation from typing import Dict, Any import subprocess import tempfile import os class CodeExecutionTool(BaseTool): """Execute Python code and capture output.""" def __init__(self, timeout_seconds: int = 30): super().__init__( name="code_execution", modalities=["text"] ) self.timeout = timeout_seconds self.execution_history = [] def parse_action(self, action_string: str) -> ToolAction: """Extract Python code from action string.""" # Expected format: python_code_here start = action_string.find('') end = action_string.find('') if start == -1 or end == -1: return ToolAction( tool_name=self.name, arguments={"code": action_string} ) code = action_string[start + 6:end].strip() return ToolAction( tool_name=self.name, arguments={"code": code} ) def execute(self, action: ToolAction) -> ToolObservation: """Run code in subprocess and capture output.""" code = action.arguments.get("code", "") if not code: return ToolObservation( content="Error: No code provided", modality="text", error="empty_code" ) try: with tempfile.NamedTemporaryFile( mode='w', suffix='.py', delete=False ) as f: f.write(code) temp_file = f.name result = subprocess.run( ['python', temp_file], capture_output=True, text=True, timeout=self.timeout ) output = result.stdout if result.stderr: output += "\nStderr:\n" + result.stderr self.execution_history.append({ "code": code, "output": output, "returncode": result.returncode }) return ToolObservation( content=output or "(No output)", modality="text", error=None if result.returncode == 0 else "execution_error" ) except subprocess.TimeoutExpired: return ToolObservation( content=f"Error: Code execution timeout (>{self.timeout}s)", modality="text", error="timeout" ) except Exception as e: return ToolObservation( content=f"Error: {str(e)}", modality="text", error="execution_failed" ) finally: if os.path.exists(temp_file): os.remove(temp_file) TOOL_CLASS = CodeExecutionTool ``` #### 5. Build Multi-Turn Trajectory Handling with Observation Masking Implement trajectory construction that alternates actions and observations, with proper masking for off-policy observations. ```python # verltool/training/trajectory.py # Multi-turn trajectory representation with observation masking from typing import List, Optional, Dict, Any from dataclasses import dataclass, field import torch @dataclass class MultiTurnTrajectory: """Represents a complete multi-turn agent trajectory.""" actions: List[str] = field(default_factory=list) # Agent actions/tool calls observations: List[str] = field(default_factory=list) # Environment observations action_tokens: List[torch.Tensor] = field(default_factory=list) observation_tokens: List[torch.Tensor] = field(default_factory=list) observation_masks: List[bool] = field(default_factory=list) # True = mask out reward: float = 0.0 episode_done: bool = False def add_turn( self, action: str, observation: str, action_token_ids: torch.Tensor, observation_token_ids: torch.Tensor, mask_observation: bool = True ): """Add a single action-observation pair to trajectory.""" self.actions.append(action) self.observations.append(observation) self.action_tokens.append(action_token_ids) self.observation_tokens.append(observation_token_ids) # Observation tokens are off-policy and should be masked self.observation_masks.append(mask_observation) def get_concatenated_tokens(self) -> torch.Tensor: """Interleave action and observation tokens.""" sequence = [] for i in range(len(self.actions)): sequence.append(self.action_tokens[i]) if i < len(self.observation_tokens): sequence.append(self.observation_tokens[i]) return torch.cat(sequence, dim=0) def get_loss_mask(self) -> torch.Tensor: """Create mask: True where loss should be computed, False otherwise.""" mask = [] for i in range(len(self.actions)): # Action tokens contribute to loss mask.append(torch.ones_like(self.action_tokens[i], dtype=torch.bool)) # Observation tokens are masked (do not contribute to loss) if i < len(self.observation_masks): is_masked = self.observation_masks[i] obs_mask = torch.zeros_like( self.observation_tokens[i], dtype=torch.bool ) if is_masked else torch.ones_like( self.observation_tokens[i], dtype=torch.bool ) mask.append(obs_mask) return torch.cat(mask, dim=0) class TrajectoryBuilder: """Construct trajectories from agent rollouts.""" def __init__(self, tokenizer): self.tokenizer = tokenizer def build_from_rollout( self, actions: List[str], observations: List[str], reward: float, mask_all_observations: bool = True ) -> MultiTurnTrajectory: """Convert rollout data into a trajectory with proper tokenization.""" trajectory = MultiTurnTrajectory(reward=reward) for i, action in enumerate(actions): # Tokenize separately to avoid boundary issues action_tokens = self.tokenizer.encode(action) action_tensor = torch.tensor(action_tokens) if i < len(observations): obs = observations[i] obs_tokens = self.tokenizer.encode(obs) obs_tensor = torch.tensor(obs_tokens) trajectory.add_turn( action=action, observation=obs, action_token_ids=action_tensor, observation_token_ids=obs_tensor, mask_observation=mask_all_observations ) return trajectory ``` #### 6. Implement Asynchronous Tool Server for 2× Speedup Deploy tool execution asynchronously so rollouts don't block waiting for tool completion. ```python # verltool/server/async_tool_server.py # Asynchronous tool execution with trajectory-level batching import asyncio from typing import List, Dict, Any, Callable from concurrent.futures import ThreadPoolExecutor import queue from verltool.tools.registry import ToolRegistry from verltool.tools.base import ToolAction, ToolObservation class AsyncToolServer: """Non-blocking tool execution server for efficient rollouts.""" def __init__(self, tool_registry: ToolRegistry, max_workers: int = 32): self.registry = tool_registry self.executor = ThreadPoolExecutor(max_workers=max_workers) self.pending_tasks: Dict[str, asyncio.Future] = {} self.task_counter = 0 def execute_tool_async( self, tool_name: str, action_string: str ) -> str: """Submit tool execution without blocking, return task ID.""" tool = self.registry.get_tool(tool_name) if not tool: return None task_id = f"task_{self.task_counter}" self.task_counter += 1 def _run(): action = tool.parse_action(action_string) observation = tool.execute(action) tool.update_state(observation) return observation future = asyncio.get_event_loop().run_in_executor( self.executor, _run ) self.pending_tasks[task_id] = future return task_id async def wait_for_result(self, task_id: str, timeout: int = 300) -> ToolObservation: """Await tool completion and retrieve result.""" if task_id not in self.pending_tasks: raise ValueError(f"Unknown task: {task_id}") try: result = await asyncio.wait_for( self.pending_tasks[task_id], timeout=timeout ) del self.pending_tasks[task_id] return result except asyncio.TimeoutError: return ToolObservation( content=f"Timeout waiting for {task_id}", modality="text", error="timeout" ) async def process_trajectory_batch( self, trajectories: List[Dict[str, Any]] ) -> List[List[ToolObservation]]: """Execute all tool calls in a batch of trajectories concurrently.""" all_observations = [] for traj in trajectories: tool_calls = traj.get("tool_calls", []) tasks = [ self.execute_tool_async(call["tool"], call["action"]) for call in tool_calls ] observations = [] for task_id in tasks: result = await self.wait_for_result(task_id) observations.append(result) all_observations.append(observations) return all_observations ``` #### 7. Configure Reward and Loss Computation Define how rewards are computed for tool-use trajectories and how the loss respects observation masking. ```python # verltool/training/reward.py # Reward computation for multi-turn agentic trajectories from typing import Optional, Dict, Any import torch import torch.nn as nn class VerlToolRewardComputer: """Compute rewards from verifiable task outcomes.""" def __init__(self, reward_fn: callable, use_per_step_rewards: bool = False): """ reward_fn: Function taking (final_output, expected_output) -> float use_per_step_rewards: If True, grant intermediate rewards for progress """ self.reward_fn = reward_fn self.use_per_step_rewards = use_per_step_rewards def compute( self, final_output: str, expected_output: Optional[str] = None, intermediate_outputs: Optional[Dict[int, str]] = None ) -> float: """ Compute scalar reward for trajectory. Follows RLVR principle: reward only depends on verifiable task outcome. """ if expected_output is None: # Fallback: treat any valid output as success return 1.0 if final_output else 0.0 base_reward = self.reward_fn(final_output, expected_output) if self.use_per_step_rewards and intermediate_outputs: # Optional: add small bonuses for intermediate progress bonus = 0.0 for step, output in intermediate_outputs.items(): if output and step < len(intermediate_outputs) - 1: bonus += 0.05 # Small step reward return min(base_reward + bonus, 1.0) return base_reward class PolicyLoss(nn.Module): """Compute policy gradient loss with observation masking.""" def __init__(self, model: nn.Module): super().__init__() self.model = model def forward( self, trajectory, # MultiTurnTrajectory instance logits: torch.Tensor ) -> torch.Tensor: """ Compute policy loss, masking observation tokens. logits: Model output logits for entire sequence """ loss_mask = trajectory.get_loss_mask() # Bool tensor # Shift logits for language modeling objective shift_logits = logits[:-1] target_tokens = trajectory.get_concatenated_tokens()[1:] # Cross-entropy loss only on unmasked (action) tokens ce_loss = torch.nn.functional.cross_entropy( shift_logits.view(-1, logits.shape[-1]), target_tokens.view(-1), reduction='none' ) # Apply mask to loss ce_loss = ce_loss * loss_mask[1:].float() # Weight by trajectory reward trajectory_reward = max(trajectory.reward, 0.0) weighted_loss = ce_loss.mean() * trajectory_reward return weighted_loss ``` ### Practical Guidance **Hyperparameter Configuration:** | Parameter | Domain | Recommended | Rationale | |-----------|--------|-------------|-----------| | `max_turns` | Math/SQL | 8–12 | Allows sufficient tool calls without excessive sequences | | `max_turns` | Code execution | 5–8 | Shorter: limited debugging iterations needed | | `max_turns` | Web search | 4–6 | Search queries return quick results | | `observation_masking` | All | True | Prevents off-policy observation leakage into gradients | | `async_workers` | All | 32–64 | Balances concurrency; 2× speedup with ≥32 | | `tokenization_mode` | All | "separate" | Avoids token boundary mismatches across turns | | `timeout_seconds` | Code/search | 30–60 | Prevents runaway tool calls; adjust by domain | | `learning_rate` | All | 1e-5 to 5e-5 | Tool-use RL requires careful tuning | **When to Use VerlTool:** - Multi-domain agents where code, search, SQL, and vision tools must coexist - Organizations with shared RL infrastructure that multiple teams extend - Tasks requiring multi-turn reasoning where outcomes are verifiable (rewards are computable) - Scenarios demanding fast rollout throughput (async execution provides 2× improvement) - Environments where observation tokens are abundant and masking prevents noise in gradients **When NOT to Use VerlTool:** - Single-task systems where domain-specific optimization is critical (specialized agents may outperform) - Scenarios where tool observations directly determine rewards and policy (observation masking would discard signal) - Real-time systems with strict latency budgets (asynchronous design adds queueing overhead) - Environments with non-verifiable rewards (RLVR extension assumes ground-truth task outcomes) - Teams without RL expertise managing training infrastructure (requires careful reward engineering and convergence tuning) **Pitfalls to Avoid:** 1. **Neglecting Tool Timeout Configuration:** Set realistic timeouts per tool. Code execution might need 60s; search should timeout faster. Timeouts that are too long create training bottlenecks; too short causes spurious failures. 2. **Improper Reward Alignment:** Verify that reward functions reflect the actual task goal. Weak reward signals lead to high variance in RL and poor convergence. Use verification functions that match your evaluation metric. 3. **Forgetting to Reset Tool State:** Each trajectory should start fresh. Stale state from previous episodes corrupts observations. Always call `tool.reset()` between trajectories. 4. **Mixing Observation Masking Strategies:** Once you enable observation masking, apply it consistently across training. Inconsistent masking destabilizes gradients and causes sudden performance drops. 5. **Inadequate Batch Size for Async Server:** The speedup assumes enough concurrent trajectories to saturate workers. Small batch sizes (< 8) negate async benefits. Use batch sizes ≥ 16 when enabling async rollout. 6. **Skipping Tool Registry Discovery:** Hardcoding tools into training loops defeats modularity. Use the registry's `discover_from_directory()` method to enable true plug-and-play tool integration. ### Reference **Paper:** VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use. Jiang, D., Lu, Y., et al. (2025). arXiv:2509.01055 [cs.AI]. **Full Citation:** https://arxiv.org/abs/2509.01055 **Key Contributions:** The framework demonstrates that unified, modular tool-use RL matches or exceeds domain-specific systems across six diverse benchmarks (mathematical reasoning at 62.2%, knowledge QA at 45.9%, SQL generation matching SkyRL-SQL, visual reasoning at 82.7%, web search at 34.0% GAIA accuracy, and software engineering at 19.5% SWE-Verified). Its 2× rollout speedup via async execution and observation masking strategy provide both efficiency and training stability for multi-turn agentic RL.