--- name: hive-test description: Iterative agent testing with session recovery. Execute, analyze, fix, resume from checkpoints. Use when testing an agent, debugging test failures, or verifying fixes without re-running from scratch. --- # Agent Testing Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat. ## When to Use - Testing a newly built agent against its goal - Debugging a failing agent iteratively - Verifying fixes without re-running expensive early nodes - Running final regression tests before deployment ## Prerequisites 1. Agent package at `exports/{agent_name}/` (built with `/hive-create`) 2. Credentials configured (`/hive-credentials`) 3. `ANTHROPIC_API_KEY` set (or appropriate LLM provider key) **Path distinction** (critical — don't confuse these): - `exports/{agent_name}/` — agent source code (edit here) - `~/.hive/agents/{agent_name}/` — runtime data: sessions, checkpoints, logs (read here) --- ## The Iterative Test Loop This is the core workflow. Don't re-run the entire agent when a late node fails — analyze, fix, and resume from the last clean checkpoint. ``` ┌──────────────────────────────────────┐ │ PHASE 1: Generate Test Scenarios │ │ Goal → synthetic test inputs + tests │ └──────────────┬───────────────────────┘ ↓ ┌──────────────────────────────────────┐ │ PHASE 2: Execute │◄────────────────┐ │ Run agent (CLI or pytest) │ │ └──────────────┬───────────────────────┘ │ ↓ │ Pass? ──yes──► PHASE 6: Final Verification │ │ │ no │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 3: Analyze │ │ │ Session + runtime logs + checkpoints │ │ └──────────────┬───────────────────────┘ │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 4: Fix │ │ │ Prompt / code / graph / goal │ │ └──────────────┬───────────────────────┘ │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 5: Recover & Resume │─────────────────┘ │ Checkpoint resume OR fresh re-run │ └──────────────────────────────────────┘ ``` --- ### Phase 1: Generate Test Scenarios Create synthetic tests from the agent's goal, constraints, and success criteria. #### Step 1a: Read the goal ```python # Read goal from agent.py Read(file_path="exports/{agent_name}/agent.py") # Extract the Goal definition and convert to JSON string ``` #### Step 1b: Get test guidelines ```python # Get constraint test guidelines generate_constraint_tests( goal_id="your-goal-id", goal_json='{"id": "...", "constraints": [...]}', agent_path="exports/{agent_name}" ) # Get success criteria test guidelines generate_success_tests( goal_id="your-goal-id", goal_json='{"id": "...", "success_criteria": [...]}', node_names="intake,research,review,report", tool_names="web_search,web_scrape", agent_path="exports/{agent_name}" ) ``` These return `file_header`, `test_template`, `constraints_formatted`/`success_criteria_formatted`, and `test_guidelines`. They do NOT generate test code — you write the tests. #### Step 1c: Write tests ```python Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code ) ``` #### Test writing rules - Every test MUST be `async` with `@pytest.mark.asyncio` - Every test MUST accept `runner, auto_responder, mock_mode` fixtures - Use `await auto_responder.start()` before running, `await auto_responder.stop()` in `finally` - Use `await runner.run(input_dict)` — this goes through AgentRunner → AgentRuntime → ExecutionStream - Access output via `result.output.get("key")` — NEVER `result.output["key"]` - `result.success=True` means no exception, NOT goal achieved — always check output - Write 8-15 tests total, not 30+ - Each real test costs ~3 seconds + LLM tokens - NEVER use `default_agent.run()` — it bypasses the runtime (no sessions, no logs, client-facing nodes hang) #### Step 1d: Check existing tests Before generating, check if tests already exist: ```python list_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" ) ``` --- ### Phase 2: Execute Two execution paths, use the right one for your situation. #### Iterative debugging (for complex agents) Run the agent via CLI. This creates sessions with checkpoints at `~/.hive/agents/{agent_name}/sessions/`: ```bash uv run hive run exports/{agent_name} --input '{"query": "test topic"}' ``` Sessions and checkpoints are saved automatically. **Client-facing nodes**: Agents with `client_facing=True` nodes (interactive conversation) work in headless mode when run from a real terminal — the agent streams output to stdout and reads user input from stdin via a `>>> ` prompt. In non-interactive shells (like Claude Code's Bash tool), client-facing nodes will hang because there is no stdin. For testing interactive agents from Claude Code, use `run_tests` with mock mode or have the user run the agent manually in their terminal. #### Automated regression (for CI or final verification) Use the `run_tests` MCP tool to run all pytest tests: ```python run_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" ) ``` Returns structured results: ```json { "overall_passed": false, "summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"}, "test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}], "failures": [{"test_name": "test_success_source_diversity", "details": "..."}] } ``` **Options:** ```python # Run only constraint tests run_tests(goal_id, agent_path, test_types='["constraint"]') # Stop on first failure run_tests(goal_id, agent_path, fail_fast=True) # Parallel execution run_tests(goal_id, agent_path, parallel=4) ``` **Note:** `run_tests` uses `AgentRunner` with `tmp_path` storage, so sessions are isolated per test run. For checkpoint-based recovery with persistent sessions, use CLI execution. Use `run_tests` for quick regression checks and final verification. --- ### Phase 3: Analyze Failures When a test fails, drill down systematically. Don't guess — use the tools. #### Step 3a: Get error category ```python debug_test( goal_id="your-goal-id", test_name="test_success_source_diversity", agent_path="exports/{agent_name}" ) ``` Returns error category (`IMPLEMENTATION_ERROR`, `ASSERTION_FAILURE`, `TIMEOUT`, `IMPORT_ERROR`, `API_ERROR`) plus full traceback and suggestions. #### Step 3b: Find the failed session ```python list_agent_sessions( agent_work_dir="~/.hive/agents/{agent_name}", status="failed", limit=5 ) ``` Returns session list with IDs, timestamps, current_node (where it failed), execution_quality. #### Step 3c: Inspect session state ```python get_agent_session_state( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345" ) ``` Returns execution path, which node was current, step count, timestamps — but excludes memory values (to avoid context bloat). Shows `memory_keys` and `memory_size` instead. #### Step 3d: Examine runtime logs (L2/L3) ```python # L2: Per-node success/failure, retry counts query_runtime_log_details( agent_work_dir="~/.hive/agents/{agent_name}", run_id="session_20260209_143022_abc12345", needs_attention_only=True ) # L3: Exact LLM responses, tool call inputs/outputs query_runtime_log_raw( agent_work_dir="~/.hive/agents/{agent_name}", run_id="session_20260209_143022_abc12345", node_id="research" ) ``` #### Step 3e: Inspect memory data ```python # See what data a node actually produced get_agent_session_memory( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", key="research_results" ) ``` #### Step 3f: Find recovery points ```python list_agent_checkpoints( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", is_clean="true" ) ``` Returns checkpoint summaries with IDs, types (`node_start`, `node_complete`), which node, and `is_clean` flag. Clean checkpoints are safe resume points. #### Step 3g: Compare checkpoints (optional) To understand what changed between two points in execution: ```python compare_agent_checkpoints( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", checkpoint_id_before="cp_node_complete_research_143030", checkpoint_id_after="cp_node_complete_review_143115" ) ``` Returns memory diff (added/removed/changed keys) and execution path diff. --- ### Phase 4: Fix Based on Root Cause Use the analysis from Phase 3 to determine what to fix and where. | Root Cause | What to Fix | Where to Edit | |------------|------------|---------------| | **Prompt issue** — LLM produces wrong output format, misses instructions | Node `system_prompt` | `exports/{agent}/nodes/__init__.py` | | **Code bug** — TypeError, KeyError, logic error in Python | Agent code | `exports/{agent}/agent.py`, `nodes/__init__.py` | | **Graph issue** — wrong routing, missing edge, bad condition_expr | Edges, node config | `exports/{agent}/agent.py` | | **Tool issue** — MCP tool fails, wrong config, missing credential | Tool config | `exports/{agent}/mcp_servers.json`, `/hive-credentials` | | **Goal issue** — success criteria too strict/vague, wrong constraints | Goal definition | `exports/{agent}/agent.py` (goal section) | | **Test issue** — test expectations don't match actual agent behavior | Test code | `exports/{agent}/tests/test_*.py` | #### Fix strategies by error category **IMPLEMENTATION_ERROR** (TypeError, AttributeError, KeyError): ```python # Read the failing code Read(file_path="exports/{agent_name}/nodes/__init__.py") # Fix the bug Edit( file_path="exports/{agent_name}/nodes/__init__.py", old_string="results.get('videos')", new_string="(results or {}).get('videos', [])" ) ``` **ASSERTION_FAILURE** (test assertions fail but agent ran successfully): - Check if the agent's output is actually wrong → fix the prompt - Check if the test's expectations are unrealistic → fix the test - Use `get_agent_session_memory` to see what the agent actually produced **TIMEOUT / STALL** (agent runs too long): - Check `node_visit_counts` for feedback loops hitting max_node_visits - Check L3 logs for tool calls that hang - Reduce `max_iterations` in loop_config or fix the prompt to converge faster **API_ERROR** (connection, rate limit, auth): - Verify credentials with `/hive-credentials` - Check MCP server configuration --- ### Phase 5: Recover & Resume After fixing the agent, decide whether to resume or re-run. #### When to resume from checkpoint Resume when ALL of these are true: - The fix is to a node that comes AFTER existing clean checkpoints - Clean checkpoints exist (from a CLI execution with checkpointing) - The early nodes are expensive (web scraping, API calls, long LLM chains) ```bash # Resume from the last clean checkpoint before the failing node uv run hive run exports/{agent_name} \ --resume-session session_20260209_143022_abc12345 \ --checkpoint cp_node_complete_research_143030 ``` This skips all nodes before the checkpoint and only re-runs the fixed node onward. #### When to re-run from scratch Re-run when ANY of these are true: - The fix is to the entry node or an early node - No checkpoints exist (e.g., agent was run via `run_tests`) - The agent is fast (2-3 nodes, completes in seconds) - You changed the graph structure (added/removed nodes/edges) ```bash uv run hive run exports/{agent_name} --input '{"query": "test topic"}' ``` #### Inspecting a checkpoint before resuming ```python get_agent_checkpoint( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", checkpoint_id="cp_node_complete_research_143030" ) ``` Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean. #### Loop back to Phase 2 After resuming or re-running, check if the fix worked. If not, go back to Phase 3. --- ### Phase 6: Final Verification Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite: ```python run_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" ) ``` All tests should pass. If not, repeat the loop for remaining failures. --- ## Credential Requirements **CRITICAL: Testing requires ALL credentials the agent depends on.** This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.). ### Prerequisites Before running agent tests, you MUST collect ALL required credentials from the user. **Step 1: LLM API Key (always required)** ```bash export ANTHROPIC_API_KEY="your-key-here" ``` **Step 2: Tool-specific credentials (depends on agent's tools)** Inspect the agent's `mcp_servers.json` and tool configuration to determine which tools the agent uses, then check for all required credentials: ```python from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS creds = CredentialManager() # Determine which tools the agent uses (from agent.json or mcp_servers.json) agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...] # Find all missing credentials for those tools missing = creds.get_missing_for_tools(agent_tools) ``` Common tool credentials: | Tool | Env Var | Help URL | |------|---------|----------| | HubSpot CRM | `HUBSPOT_ACCESS_TOKEN` | https://developers.hubspot.com/docs/api/private-apps | | Brave Search | `BRAVE_SEARCH_API_KEY` | https://brave.com/search/api/ | | Google Search | `GOOGLE_SEARCH_API_KEY` + `GOOGLE_SEARCH_CX` | https://developers.google.com/custom-search | **Why ALL credentials are required:** - Tests need to execute the agent's LLM nodes to validate behavior - Tools with missing credentials will return error dicts instead of real data - Mock mode bypasses everything, providing no confidence in real-world performance ### Mock Mode Limitations Mock mode (`--mock` flag or `MOCK_MODE=1`) is **ONLY for structure validation**: - Validates graph structure (nodes, edges, connections) - Validates that `AgentRunner.load()` succeeds and the agent is importable - Does NOT execute event_loop agents — MockLLMProvider never calls `set_output`, so event_loop nodes loop forever - Does NOT test LLM reasoning, content quality, or constraint validation - Does NOT test real API integrations or tool use **Bottom line:** If you're testing whether an agent achieves its goal, you MUST use real credentials. ### Enforcing Credentials in Tests When writing tests, **ALWAYS include credential checks**: ```python import os import pytest from aden_tools.credentials import CredentialManager pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1." ) @pytest.fixture(scope="session", autouse=True) def check_credentials(): """Ensure ALL required credentials are set for real testing.""" creds = CredentialManager() mock_mode = os.environ.get("MOCK_MODE") if not creds.is_available("anthropic"): if mock_mode: print("\nRunning in MOCK MODE - structure validation only") else: pytest.fail( "\nANTHROPIC_API_KEY not set!\n" "Set API key: export ANTHROPIC_API_KEY='your-key-here'\n" "Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/" ) if not mock_mode: agent_tools = [] # Update per agent missing = creds.get_missing_for_tools(agent_tools) if missing: lines = ["\nMissing tool credentials!"] for name in missing: spec = creds.specs.get(name) if spec: lines.append(f" {spec.env_var} - {spec.description}") pytest.fail("\n".join(lines)) ``` ### User Communication When the user asks to test an agent, **ALWAYS check for ALL credentials first**: 1. **Identify the agent's tools** from `mcp_servers.json` 2. **Check ALL required credentials** using `CredentialManager` 3. **Ask the user to provide any missing credentials** before proceeding 4. Collect ALL missing credentials in a single prompt — not one at a time --- ## Safe Test Patterns ### OutputCleaner The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues. ### Safe Access (REQUIRED) ```python # UNSAFE - will crash on missing keys approval = result.output["approval_decision"] category = result.output["analysis"]["category"] # SAFE - use .get() with defaults output = result.output or {} approval = output.get("approval_decision", "UNKNOWN") # SAFE - type check before operations analysis = output.get("analysis", {}) if isinstance(analysis, dict): category = analysis.get("category", "unknown") # SAFE - handle JSON parsing trap (LLM response as string) import json recommendation = output.get("recommendation", "{}") if isinstance(recommendation, str): try: parsed = json.loads(recommendation) if isinstance(parsed, dict): approval = parsed.get("approval_decision", "UNKNOWN") except json.JSONDecodeError: approval = "UNKNOWN" elif isinstance(recommendation, dict): approval = recommendation.get("approval_decision", "UNKNOWN") # SAFE - type check before iteration items = output.get("items", []) if isinstance(items, list): for item in items: ... ``` ### Helper Functions for conftest.py ```python import json import re def _parse_json_from_output(result, key): """Parse JSON from agent output (framework may store full LLM response as string).""" response_text = result.output.get(key, "") json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip() try: return json.loads(json_text) except (json.JSONDecodeError, AttributeError, TypeError): return result.output.get(key) def safe_get_nested(result, key_path, default=None): """Safely get nested value from result.output.""" output = result.output or {} current = output for key in key_path: if isinstance(current, dict): current = current.get(key) elif isinstance(current, str): try: json_text = re.sub(r'```json\s*|\s*```', '', current).strip() parsed = json.loads(json_text) if isinstance(parsed, dict): current = parsed.get(key) else: return default except json.JSONDecodeError: return default else: return default return current if current is not None else default # Make available in tests pytest.parse_json_from_output = _parse_json_from_output pytest.safe_get_nested = safe_get_nested ``` ### ExecutionResult Fields **`result.success=True` means NO exception, NOT goal achieved** ```python # WRONG assert result.success # RIGHT assert result.success, f"Agent failed: {result.error}" output = result.output or {} approval = output.get("approval_decision") assert approval == "APPROVED", f"Expected APPROVED, got {approval}" ``` All fields: - `success: bool` — Completed without exception (NOT goal achieved!) - `output: dict` — Complete memory snapshot (may contain raw strings) - `error: str | None` — Error message if failed - `steps_executed: int` — Number of nodes executed - `total_tokens: int` — Cumulative token usage - `total_latency_ms: int` — Total execution time - `path: list[str]` — Node IDs traversed (may repeat in feedback loops) - `paused_at: str | None` — Node ID if paused - `session_state: dict` — State for resuming - `node_visit_counts: dict[str, int]` — Visit counts per node (feedback loop testing) - `execution_quality: str` — "clean", "degraded", or "failed" ### Test Count Guidance **Write 8-15 tests, not 30+** - 2-3 tests per success criterion - 1 happy path test - 1 boundary/edge case test - 1 error handling test (optional) Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12. --- ## Test Patterns ### Happy Path ```python @pytest.mark.asyncio async def test_happy_path(runner, auto_responder, mock_mode): """Test normal successful execution.""" await auto_responder.start() try: result = await runner.run({"query": "python tutorials"}) finally: await auto_responder.stop() assert result.success, f"Agent failed: {result.error}" output = result.output or {} assert output.get("report"), "No report produced" ``` ### Boundary Condition ```python @pytest.mark.asyncio async def test_minimum_sources(runner, auto_responder, mock_mode): """Test at minimum source threshold.""" await auto_responder.start() try: result = await runner.run({"query": "niche topic"}) finally: await auto_responder.stop() assert result.success, f"Agent failed: {result.error}" output = result.output or {} sources = output.get("sources", []) if isinstance(sources, list): assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}" ``` ### Error Handling ```python @pytest.mark.asyncio async def test_empty_input(runner, auto_responder, mock_mode): """Test graceful handling of empty input.""" await auto_responder.start() try: result = await runner.run({"query": ""}) finally: await auto_responder.stop() # Agent should either fail gracefully or produce an error message output = result.output or {} assert not result.success or output.get("error"), "Should handle empty input" ``` ### Feedback Loop ```python @pytest.mark.asyncio async def test_feedback_loop_terminates(runner, auto_responder, mock_mode): """Test that feedback loops don't run forever.""" await auto_responder.start() try: result = await runner.run({"query": "test"}) finally: await auto_responder.stop() visits = result.node_visit_counts or {} for node_id, count in visits.items(): assert count <= 5, f"Node {node_id} visited {count} times — possible infinite loop" ``` --- ## MCP Tool Reference ### Phase 1: Test Generation ```python # Check existing tests list_tests(goal_id, agent_path) # Get constraint test guidelines (returns templates, NOT generated tests) generate_constraint_tests(goal_id, goal_json, agent_path) # Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines # Get success criteria test guidelines generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path) # Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines ``` ### Phase 2: Execution ```python # Automated regression (no checkpoints, fresh runs) run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False) # Run only specific test types run_tests(goal_id, agent_path, test_types='["constraint"]') run_tests(goal_id, agent_path, test_types='["success"]') ``` ```bash # Iterative debugging with checkpoints (via CLI) uv run hive run exports/{agent_name} --input '{"query": "test"}' ``` ### Phase 3: Analysis ```python # Debug a specific failed test debug_test(goal_id, test_name, agent_path) # Find failed sessions list_agent_sessions(agent_work_dir, status="failed", limit=5) # Inspect session state (excludes memory values) get_agent_session_state(agent_work_dir, session_id) # Inspect memory data get_agent_session_memory(agent_work_dir, session_id, key="research_results") # Runtime logs: L1 summaries query_runtime_logs(agent_work_dir, status="needs_attention") # Runtime logs: L2 per-node details query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True) # Runtime logs: L3 tool/LLM raw data query_runtime_log_raw(agent_work_dir, run_id, node_id="research") # Find clean checkpoints list_agent_checkpoints(agent_work_dir, session_id, is_clean="true") # Compare checkpoints (memory diff) compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after) ``` ### Phase 5: Recovery ```python # Inspect checkpoint before resuming get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id) # Empty checkpoint_id = latest checkpoint ``` ```bash # Resume from checkpoint via CLI (headless) uv run hive run exports/{agent_name} \ --resume-session {session_id} --checkpoint {checkpoint_id} ``` --- ## Anti-Patterns | Don't | Do Instead | |-------|-----------| | Use `default_agent.run()` in tests | Use `runner.run()` with `auto_responder` fixtures (goes through AgentRuntime) | | Re-run entire agent when a late node fails | Resume from last clean checkpoint | | Treat `result.success` as goal achieved | Check `result.output` for actual criteria | | Access `result.output["key"]` directly | Use `result.output.get("key")` | | Fix random things hoping tests pass | Analyze L2/L3 logs to find root cause first | | Write 30+ tests | Write 8-15 focused tests | | Skip credential check | Use `/hive-credentials` before testing | | Confuse `exports/` with `~/.hive/agents/` | Code in `exports/`, runtime data in `~/.hive/` | | Use `run_tests` for iterative debugging | Use headless CLI with checkpoints for iterative debugging | | Use headless CLI for final regression | Use `run_tests` for automated regression | | Use `--tui` from Claude Code | Use headless `run` command — TUI hangs in non-interactive shells | | Test client-facing nodes from Claude Code | Use mock mode, or have the user run the agent in their terminal | | Run tests without reading goal first | Always understand the goal before writing tests | | Skip Phase 3 analysis and guess | Use session + log tools to identify root cause | --- ## Example Walkthrough: Deep Research Agent A complete iteration showing the test loop for an agent with nodes: `intake → research → review → report`. ### Phase 1: Generate tests ```python # Read the goal Read(file_path="exports/deep_research_agent/agent.py") # Get success criteria test guidelines result = generate_success_tests( goal_id="rigorous-interactive-research", goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}', node_names="intake,research,review,report", tool_names="web_search,web_scrape", agent_path="exports/deep_research_agent" ) # Write tests Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + test_code ) ``` ### Phase 2: First execution ```python run_tests( goal_id="rigorous-interactive-research", agent_path="exports/deep_research_agent", fail_fast=True ) ``` Result: `test_success_source_diversity` fails — agent only found 2 sources instead of 5. ### Phase 3: Analyze ```python # Debug the failing test debug_test( goal_id="rigorous-interactive-research", test_name="test_success_source_diversity", agent_path="exports/deep_research_agent" ) # → ASSERTION_FAILURE: Expected >= 5 sources, got 2 # Find the session list_agent_sessions( agent_work_dir="~/.hive/agents/deep_research_agent", status="completed", limit=1 ) # → session_20260209_150000_abc12345 # See what the research node produced get_agent_session_memory( agent_work_dir="~/.hive/agents/deep_research_agent", session_id="session_20260209_150000_abc12345", key="research_results" ) # → Only 2 web_search calls made, each returned 1 source # Check the LLM's behavior in the research node query_runtime_log_raw( agent_work_dir="~/.hive/agents/deep_research_agent", run_id="session_20260209_150000_abc12345", node_id="research" ) # → LLM called web_search only twice, then called set_output ``` Root cause: The research node's prompt doesn't tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches. ### Phase 4: Fix the prompt ```python Read(file_path="exports/deep_research_agent/nodes/__init__.py") Edit( file_path="exports/deep_research_agent/nodes/__init__.py", old_string='system_prompt="Search for information on the user\'s topic."', new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."' ) ``` ### Phase 5: Resume from checkpoint For this example, the fix is to the `research` node. If we had run via CLI with checkpointing, we could resume from the checkpoint after `intake` to skip re-running intake: ```bash # Check if clean checkpoint exists after intake list_agent_checkpoints( agent_work_dir="~/.hive/agents/deep_research_agent", session_id="session_20260209_150000_abc12345", is_clean="true" ) # → cp_node_complete_intake_150005 # Resume from after intake, re-run research with fixed prompt uv run hive run exports/deep_research_agent \ --resume-session session_20260209_150000_abc12345 \ --checkpoint cp_node_complete_intake_150005 ``` Or for this simple case (intake is fast), just re-run: ```bash uv run hive run exports/deep_research_agent --input '{"topic": "test"}' ``` ### Phase 6: Final verification ```python run_tests( goal_id="rigorous-interactive-research", agent_path="exports/deep_research_agent" ) # → All 12 tests pass ``` --- ## Test File Structure ``` exports/{agent_name}/ ├── agent.py ← Agent to test (goal, nodes, edges) ├── nodes/__init__.py ← Node implementations (prompts, config) ├── config.py ← Agent configuration ├── mcp_servers.json ← Tool server config └── tests/ ├── conftest.py ← Shared fixtures + safe access helpers ├── test_constraints.py ← Constraint tests ├── test_success_criteria.py ← Success criteria tests └── test_edge_cases.py ← Edge case tests ``` ## Integration with Other Skills | Scenario | From | To | Action | |----------|------|----|--------| | Agent built, ready to test | `/hive-create` | `/hive-test` | Generate tests, start loop | | Prompt fix needed | `/hive-test` Phase 4 | Direct edit | Edit `nodes/__init__.py`, resume | | Goal definition wrong | `/hive-test` Phase 4 | `/hive-create` | Update goal, may need rebuild | | Missing credentials | `/hive-test` Phase 3 | `/hive-credentials` | Set up credentials | | Complex runtime failure | `/hive-test` Phase 3 | `/hive-debugger` | Deep L1/L2/L3 analysis | | All tests pass | `/hive-test` Phase 6 | Done | Agent validated |