--- name: deep-agent-reasoning title: "DeepAgent: A General Reasoning Agent with Scalable Toolsets" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2510.21618" keywords: [Agent, Reasoning, Tool Learning, RL, Memory Management] description: "Enables autonomous reasoning agents to discover and invoke tools efficiently through end-to-end training. Uses autonomous memory folding to compress interaction history and ToolPO to learn general-purpose tool use, applicable across diverse benchmarks from QA to web automation." --- # DeepAgent: Unified Autonomous Reasoning with Tool Learning Existing reasoning agents struggle with two key limitations: they accumulate errors across long-horizon tasks through verbose interaction histories, and they require task-specific tool interfaces rather than learning generalizable tool use patterns. DeepAgent solves this by integrating autonomous thinking, tool discovery, and action execution into a single end-to-end reasoning process. The system combines memory compression with learned tool invocation, enabling agents to handle complex multi-step tasks efficiently. ## Core Concept DeepAgent operates through three integrated mechanisms: - **Autonomous Memory Folding**: Compresses past interactions into structured episodic, working, and tool memories, reducing error propagation - **ToolPO (Tool Policy Optimization)**: An end-to-end RL strategy using simulated APIs and fine-grained tool-call advantage attribution - **Tool Retrieval**: Handles both labeled-tool and open-set discovery scenarios ## Architecture Overview - Memory compression captures essential interaction patterns without verbose history - Tool-call advantage attribution isolates credit signals to tool invocation tokens - Memory types (episodic, working, tool) serve different reasoning stages - End-to-end training enables discovery of effective tool combinations ## Implementation Steps The memory folding mechanism selectively summarizes interactions at each step. Rather than maintaining full conversation history, compress past state and actions into dense representations: ```python class MemoryFolder: def fold_interaction(self, history, current_state): # Compress episodic memory: factual outcomes from past steps episodic = self.compress_facts(history) # Working memory: intermediate reasoning state working = self.compress_reasoning(current_state) # Tool memory: effective tool patterns tools = self.extract_tool_patterns(history) return {episodic, working, tools} def compress_facts(self, history): # Extract key outcomes and state changes return [fact for fact in history if is_critical(fact)] def extract_tool_patterns(self, history): # Track which tools succeeded in which contexts return {(context, goal): tool for context, goal, tool in history} ``` ToolPO applies advantage attribution at the token level for tool calls. Rather than assigning credit to entire generation steps, focus reward signals on the tokens that invoke tools: ```python class ToolPO: def compute_advantage(self, trajectory, reward): # Identify tool-call tokens in the generation tool_tokens = [idx for idx, token in enumerate(trajectory) if is_tool_invocation(token)] # Assign advantage only to tool-invocation tokens advantage = {} for idx in tool_tokens: # Fine-grained credit based on outcome advantage[idx] = compute_token_advantage(trajectory, idx, reward) return advantage ``` ## Practical Guidance | Aspect | Recommendation | |--------|-----------------| | Memory compression ratio | 4:1 to 8:1 (reduce interaction sequences by 75-87%) | | Tool-call token weighting | 2-5x higher than other tokens during RL training | | Episodic memory retention | Keep last N=10 critical facts per domain | | Simulated API complexity | Match target environment sophistication | **When to use DeepAgent:** - Multi-step reasoning tasks requiring tool invocation - Long-horizon problems where error accumulation matters - Scenarios with large, diverse tool libraries to explore **When NOT to use:** - Single-step tasks without tool requirements - Domains with strictly defined tool interfaces (use API-specific agents) - Real-time systems where memory compression adds latency **Common pitfalls:** - Over-compressing memory and losing critical context - Under-weighting tool-specific advantage signals - Insufficient diversity in simulated API trajectories during training Reference: [DeepAgent on arXiv](https://arxiv.org/abs/2510.21618)