---
name: hive-debugger
type: utility
description: Interactive debugging companion for Hive agents - identifies runtime issues and proposes solutions
version: 1.0.0
requires:
  - hive-concepts
tags:
  - debugging
  - runtime-logs
  - agent-development
---

# Hive Debugger

An interactive debugging companion that helps developers identify and fix runtime issues in Hive agents. The debugger analyzes runtime logs at three levels (L1/L2/L3), categorizes issues, and provides actionable fix recommendations.

## When to Use This Skill

Use `/hive-debugger` when:
- Your agent is failing or producing unexpected results
- You need to understand why a specific node is retrying repeatedly
- Tool calls are failing and you need to identify the root cause
- Agent execution is stalled or taking too long
- You want to monitor agent behavior in real-time during development

This skill works alongside agents running in TUI mode and provides supervisor-level insights into execution behavior.

### Forever-Alive Agent Awareness

Some agents use `terminal_nodes=[]` (the "forever-alive" pattern), meaning they loop indefinitely and never enter a "completed" execution state. For these agents:
- Sessions with status "in_progress" or "paused" are **normal**, not failures
- High step counts, long durations, and many node visits are expected behavior
- The agent stops only when the user explicitly exits — there is no graph-driven completion
- Debug focus should be on **quality of individual node visits and iterations**, not whether the session reached a terminal state
- Conversation memory accumulates across loops — watch for context overflow and stale data issues

**How to identify forever-alive agents:** Check `agent.py` or `agent.json` for `terminal_nodes=[]` (empty list). If empty, the agent is forever-alive.

---

## Prerequisites

Before using this skill, ensure:
1. You have an exported agent in `exports/{agent_name}/`
2. The agent has been run at least once (logs exist)
3. Runtime logging is enabled (default in Hive framework)
4. You have access to the agent's working directory at `~/.hive/agents/{agent_name}/`

---

## Workflow

### Stage 1: Setup & Context Gathering

**Objective:** Understand the agent being debugged

**What to do:**

1. **Ask the developer which agent needs debugging:**
   - Get agent name (e.g., "deep_research_agent", "deep_research_agent")
   - Confirm the agent exists in `exports/{agent_name}/`

2. **Determine agent working directory:**
   - Calculate: `~/.hive/agents/{agent_name}/`
   - Verify this directory exists and contains session logs

3. **Read agent configuration:**
   - Read file: `exports/{agent_name}/agent.json`
   - Extract goal information from the JSON:
     - `goal.id` - The goal identifier
     - `goal.success_criteria` - What success looks like
     - `goal.constraints` - Rules the agent must follow
   - Extract graph information:
     - List of node IDs from `graph.nodes`
     - List of edges from `graph.edges`

4. **Store context for the debugging session:**
   - agent_name
   - agent_work_dir (e.g., `/home/user/.hive/deep_research_agent`)
   - goal_id
   - success_criteria
   - constraints
   - node_ids

**Example:**
```
Developer: "My deep_research_agent agent keeps failing"

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context gathered:
- Agent: deep_research_agent
- Goal: deep-research
- Working Directory: /home/user/.hive/deep_research_agent
- Success Criteria: ["Produce a comprehensive research report with cited sources"]
- Constraints: ["Must cite all sources", "Must cover multiple perspectives"]
- Nodes: ["intake", "research", "analysis", "report-writer"]
```

---

### Stage 2: Mode Selection

**Objective:** Choose the debugging approach that best fits the situation

**What to do:**

Ask the developer which debugging mode they want to use. Use AskUserQuestion with these options:

1. **Real-time Monitoring Mode**
   - Description: Monitor active TUI session continuously, poll logs every 5-10 seconds, alert on new issues immediately
   - Best for: Live debugging sessions where you want to catch issues as they happen
   - Note: Requires agent to be currently running

2. **Post-Mortem Analysis Mode**
   - Description: Analyze completed or failed runs in detail, deep dive into specific session
   - Best for: Understanding why a past execution failed
   - Note: Most common mode for debugging

3. **Historical Trends Mode**
   - Description: Analyze patterns across multiple runs, identify recurring issues
   - Best for: Finding systemic problems that happen repeatedly
   - Note: Useful for agents that have run many times

**Implementation:**
```
Use AskUserQuestion to present these options and let the developer choose.
Store the selected mode for the session.
```

---

### Stage 3: Triage (L1 Analysis)

**Objective:** Identify which sessions need attention

**What to do:**

1. **Query high-level run summaries** using the MCP tool:
   ```
   query_runtime_logs(
       agent_work_dir="{agent_work_dir}",
       status="needs_attention",
       limit=20
   )
   ```

2. **Analyze the results:**
   - Look for runs with `needs_attention: true`
   - Check `attention_summary.categories` for issue types
   - Note the `run_id` of problematic sessions
   - Check `status` field: "degraded", "failure", "in_progress"
   - **For forever-alive agents:** Sessions with status "in_progress" or "paused" are normal — these agents never reach "completed". Only flag sessions with `needs_attention: true` or actual error indicators (tool failures, retry loops, missing outputs). High step counts alone do not indicate a problem.

3. **Attention flag triggers to understand:**
   From runtime_logger.py, runs are flagged when:
   - retry_count > 3
   - escalate_count > 2
   - latency_ms > 60000
   - tokens_used > 100000
   - total_steps > 20

4. **Present findings to developer:**
   - Summarize how many runs need attention
   - List the most recent problematic runs
   - Show attention categories for each
   - Ask which run they want to investigate (if multiple)

**Example Output:**
```
Found 2 runs needing attention:

1. session_20260206_115718_e22339c5 (30 minutes ago)
   Status: degraded
   Categories: missing_outputs, retry_loops

2. session_20260206_103422_9f8d1b2a (2 hours ago)
   Status: failure
   Categories: tool_failures, high_latency

Which run would you like to investigate?
```

---

### Stage 4: Diagnosis (L2 Analysis)

**Objective:** Identify which nodes failed and what patterns exist

**What to do:**

1. **Query per-node details** using the MCP tool:
   ```
   query_runtime_log_details(
       agent_work_dir="{agent_work_dir}",
       run_id="{selected_run_id}",
       needs_attention_only=True
   )
   ```

2. **Categorize issues** using the Issue Taxonomy:

   **10 Issue Categories:**

   | Category | Detection Pattern | Meaning |
   |----------|------------------|---------|
   | **Missing Outputs** | `exit_status != "success"`, `attention_reasons` contains "missing_outputs" | Node didn't call set_output with required keys |
   | **Tool Errors** | `tool_error_count > 0`, `attention_reasons` contains "tool_failures" | Tool calls failed (API errors, timeouts, auth issues) |
   | **Retry Loops** | `retry_count > 3`, `verdict_counts.RETRY > 5` | Judge repeatedly rejecting outputs |
   | **Guard Failures** | `guard_reject_count > 0` | Output validation failed (wrong types, missing keys) |
   | **Stalled Execution** | `total_steps > 20`, `verdict_counts.CONTINUE > 10` | EventLoopNode not making progress. **Caveat:** Forever-alive agents may legitimately have high step counts — check if agent is blocked at a client-facing node (normal) vs genuinely stuck in a loop |
   | **High Latency** | `latency_ms > 60000`, `avg_step_latency > 5000` | Slow tool calls or LLM responses |
   | **Client-Facing Issues** | `client_input_requested` but no `user_input_received` | Premature set_output before user input |
   | **Edge Routing Errors** | `exit_status == "no_valid_edge"`, `attention_reasons` contains "routing_issue" | No edges match current state |
   | **Memory/Context Issues** | `tokens_used > 100000`, `context_overflow_count > 0` | Conversation history too long |
   | **Constraint Violations** | Compare output against goal constraints | Agent violated goal-level rules |

   **Forever-Alive Agent Caveat:** If the agent uses `terminal_nodes=[]`, sessions will never reach "completed" status. This is by design. When debugging these agents, focus on:
   - Whether individual node visits succeed (not whether the graph "finishes")
   - Quality of each loop iteration — are outputs improving or degrading across loops?
   - Whether client-facing nodes are correctly blocking for user input
   - Memory accumulation issues: stale data from previous loops, context overflow across many iterations
   - Conversation compaction behavior: is the conversation growing unbounded?

3. **Analyze each flagged node:**
   - Node ID and name
   - Exit status
   - Retry count
   - Verdict distribution (ACCEPT/RETRY/ESCALATE/CONTINUE)
   - Attention reasons
   - Total steps executed

4. **Present diagnosis to developer:**
   - List problematic nodes
   - Categorize each issue
   - Highlight the most severe problems
   - Show evidence (retry counts, error types)

**Example Output:**
```
Diagnosis for session_20260206_115718_e22339c5:

Problem Node: research
├─ Exit Status: escalate
├─ Retry Count: 5 (HIGH)
├─ Verdict Counts: {RETRY: 5, ESCALATE: 1}
├─ Attention Reasons: ["high_retry_count", "missing_outputs"]
├─ Total Steps: 8
└─ Categories: Missing Outputs + Retry Loops

Root Issue: The research node is stuck in a retry loop because it's not setting required outputs.
```

---

### Stage 5: Root Cause Analysis (L3 Analysis)

**Objective:** Understand exactly what went wrong by examining detailed logs

**What to do:**

1. **Query detailed tool/LLM logs** using the MCP tool:
   ```
   query_runtime_log_raw(
       agent_work_dir="{agent_work_dir}",
       run_id="{run_id}",
       node_id="{problem_node_id}"
   )
   ```

2. **Analyze based on issue category:**

   **For Missing Outputs:**
   - Check `step.tool_calls` for set_output usage
   - Look for conditional logic that skipped set_output
   - Check if LLM is calling other tools instead

   **For Tool Errors:**
   - Check `step.tool_results` for error messages
   - Identify error types: rate limits, auth failures, timeouts, network errors
   - Note which specific tool is failing

   **For Retry Loops:**
   - Check `step.verdict_feedback` from judge
   - Look for repeated failure reasons
   - Identify if it's the same issue every time

   **For Guard Failures:**
   - Check `step.guard_results` for validation errors
   - Identify missing keys or type mismatches
   - Compare actual output to expected schema

   **For Stalled Execution:**
   - Check `step.llm_response_text` for repetition
   - Look for LLM stuck in same action loop
   - Check if tool calls are succeeding but not progressing

3. **Extract evidence:**
   - Specific error messages
   - Tool call arguments and results
   - LLM response text
   - Judge feedback
   - Step-by-step progression

4. **Formulate root cause explanation:**
   - Clearly state what is happening
   - Explain why it's happening
   - Show evidence from logs

**Example Output:**
```
Root Cause Analysis for research:

Step-by-step breakdown:

Step 3:
- Tool Call: web_search(query="latest AI regulations 2026")
- Result: Found relevant articles and sources
- Verdict: RETRY
- Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4:
- Tool Call: web_search(query="AI regulation policy 2026")
- Result: Found additional policy information
- Verdict: RETRY
- Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Similar pattern continues...

ROOT CAUSE: The node is successfully finding research sources via web_search, but the LLM is not calling set_output to save the results. It keeps searching for more information instead of completing the task.
```

---

### Stage 6: Fix Recommendations

**Objective:** Provide actionable solutions the developer can implement

**What to do:**

Based on the issue category identified, provide specific fix recommendations using these templates:

#### Template 1: Missing Outputs (Client-Facing Nodes)

```markdown
## Issue: Premature set_output in Client-Facing Node

**Root Cause:** Node called set_output before receiving user input

**Fix:** Use STEP 1/STEP 2 prompt pattern

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Changes:**
1. Update the system_prompt to include explicit step guidance:
   ```python
   system_prompt = """
   STEP 1: Analyze the user input and decide what action to take.
   DO NOT call set_output in this step.

   STEP 2: After receiving feedback or completing analysis,
   ONLY THEN call set_output with your results.
   """
   ```

2. If some inputs are optional (like feedback on retry edges), add nullable_output_keys:
   ```python
   nullable_output_keys=["feedback"]
   ```

**Verification:**
- Run the agent with test input
- Verify the client-facing node waits for user input before calling set_output
```

#### Template 2: Retry Loops

```markdown
## Issue: Judge Repeatedly Rejecting Outputs

**Root Cause:** {Insert specific reason from verdict_feedback}

**Fix Options:**

**Option A - If outputs are actually correct:** Adjust judge evaluation rules
- File: `exports/{agent_name}/agent.json`
- Update `evaluation_rules` section to accept the current output format
- Example: If judge expects list but gets string, update rule to accept both

**Option B - If prompt is ambiguous:** Clarify node instructions
- File: `exports/{agent_name}/nodes/{node_name}.py`
- Make system_prompt more explicit about output format and requirements
- Add examples of correct outputs

**Option C - If tool is unreliable:** Add retry logic with fallback
- Consider using alternative tools
- Add manual fallback option
- Update prompt to handle tool failures gracefully

**Verification:**
- Run the node with test input
- Confirm judge accepts output on first try
- Check that retry_count stays at 0
```

#### Template 3: Tool Errors

```markdown
## Issue: {tool_name} Failing with {error_type}

**Root Cause:** {Insert specific error message from logs}

**Fix Strategy:**

**If API rate limit:**
1. Add exponential backoff in tool retry logic
2. Reduce API call frequency
3. Consider caching results

**If auth failure:**
1. Check credentials using:
   ```bash
   /hive-credentials --agent {agent_name}
   ```
2. Verify API key environment variables
3. Update `mcp_servers.json` if needed

**If timeout:**
1. Increase timeout in `mcp_servers.json`:
   ```json
   {
     "timeout_ms": 60000
   }
   ```
2. Consider using faster alternative tools
3. Break large requests into smaller chunks

**Verification:**
- Test tool call manually
- Confirm successful response
- Monitor for recurring errors
```

#### Template 4: Edge Routing Errors

```markdown
## Issue: No Valid Edge from Node {node_id}

**Root Cause:** No edge condition matched the current state

**File to edit:** `exports/{agent_name}/agent.json`

**Analysis:**
- Current node output: {show actual output keys}
- Existing edge conditions: {list edge conditions}
- Why no match: {explain the mismatch}

**Fix:**
Add the missing edge to the graph:
```json
{
  "edge_id": "{node_id}_to_{target_node}",
  "source": "{node_id}",
  "target": "{target_node}",
  "condition": "on_success"
}
```

**Alternative:** Update existing edge condition to cover this case

**Verification:**
- Run agent with same input
- Verify edge is traversed successfully
- Check that execution continues to next node
```

#### Template 5: Stalled Execution

```markdown
## Issue: EventLoopNode Not Making Progress

**Root Cause:** {Insert analysis - e.g., "LLM repeating same failed action"}

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Fix:** Update system_prompt to guide LLM out of loops

**Add this guidance:**
```python
system_prompt = """
{existing prompt}

IMPORTANT: If a tool call fails multiple times:
1. Try an alternative approach or different tool
2. If no alternatives work, call set_output with partial results
3. DO NOT retry the same failed action more than 3 times

Progress is more important than perfection. Move forward even with incomplete data.
"""
```

**Additional fix:** Lower max_iterations to prevent infinite loops
```python
# In node configuration
max_node_visits=3  # Prevent getting stuck
```

**Verification:**
- Run node with same input that caused stall
- Verify it exits after reasonable attempts (< 10 steps)
- Confirm it calls set_output eventually
```

#### Template 6: Checkpoint Recovery (Post-Fix Resume)

```markdown
## Recovery Strategy: Resume from Last Clean Checkpoint

**Situation:** You've fixed the issue, but the failed session is stuck mid-execution

**Solution:** Resume execution from a checkpoint before the failure

### Option A: Auto-Resume from Latest Checkpoint (Recommended)

Use CLI arguments to auto-resume when launching TUI:

```bash
PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id}
```

This will:
- Load session state from `state.json`
- Continue from where it paused/failed
- Apply your fixes immediately

### Option B: Resume from Specific Checkpoint (Time-Travel)

If you need to go back to an earlier point:

```bash
PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id} \
    --checkpoint {checkpoint_id}
```

Example:
```bash
PYTHONPATH=core:exports python -m deep_research_agent --tui \
    --resume-session session_20260208_143022_abc12345 \
    --checkpoint cp_node_complete_intake_143030
```

### Option C: Use TUI Commands

Alternatively, launch TUI normally and use commands:

```bash
# Launch TUI
PYTHONPATH=core:exports python -m {agent_name} --tui

# In TUI, use commands:
/resume {session_id}                    # Resume from session state
/recover {session_id} {checkpoint_id}   # Recover from specific checkpoint
```

### When to Use Each Option:

**Use `/resume` (or --resume-session) when:**
- You fixed credentials and want to retry
- Agent paused and you want to continue
- Agent failed and you want to retry from last state

**Use `/recover` (or --resume-session + --checkpoint) when:**
- You need to go back to an earlier checkpoint
- You want to try a different path from a specific point
- Debugging requires time-travel to earlier state

### Find Available Checkpoints:

Use MCP tools to programmatically find and inspect checkpoints:

```
# List all sessions to find the failed one
list_agent_sessions(agent_work_dir="~/.hive/agents/{agent_name}", status="failed")

# Inspect session state
get_agent_session_state(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}")

# Find clean checkpoints to resume from
list_agent_checkpoints(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", is_clean="true")

# Compare checkpoints to understand what changed
compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="{session_id}",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

# Inspect memory at a specific checkpoint
get_agent_checkpoint(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", checkpoint_id="cp_node_complete_intake_143030")
```

Or in TUI:
```bash
/sessions {session_id}
```

**Verification:**
- Use `--resume-session` to test your fix immediately
- No need to re-run from the beginning
- Session continues with your code changes applied
```

**Selecting the right template:**
- Match the issue category from Stage 4
- Customize with specific details from Stage 5
- Include actual error messages and code snippets
- Provide file paths and line numbers when possible
- **Always include recovery commands** (Template 6) after providing fix recommendations

---

### Stage 7: Verification Support

**Objective:** Help the developer confirm their fixes work

**What to do:**

1. **Suggest appropriate tests based on fix type:**

   **For node-level fixes:**
   ```bash
   # Use hive-test to run goal-based tests
   /hive-test --agent {agent_name} --goal {goal_id}

   # Or run specific test scenarios
   /hive-test --agent {agent_name} --scenario {specific_input}
   ```

   **For quick manual tests:**
   ```bash
   # Launch the interactive TUI dashboard
   hive tui
   ```
   Then use arrow keys to select the agent from the list and press Enter to run it.

2. **Provide MCP tool queries to validate the fix:**

   **Check if issue is resolved:**
   ```
   query_runtime_logs(
       agent_work_dir="~/.hive/agents/{agent_name}",
       status="needs_attention",
       limit=5
   )
   # Should show 0 results if fully fixed
   ```

   **Verify specific node behavior:**
   ```
   query_runtime_log_details(
       agent_work_dir="~/.hive/agents/{agent_name}",
       run_id="{new_run_id}",
       node_id="{fixed_node_id}"
   )
   # Should show exit_status="success", retry_count=0
   ```

3. **Monitor for regression:**
   - Run the agent multiple times
   - Check for similar issues reappearing
   - Verify fix works across different inputs

4. **Provide verification checklist:**
   ```
   Verification Checklist:
   □ Applied recommended fix to code
   □ Ran agent with test input
   □ Checked runtime logs show no attention flags
   □ Verified specific node completes successfully
   □ Tested with multiple inputs
   □ No regression of original issue
   □ Agent meets success criteria
   ```

**Example interaction:**
```
Developer: "I applied the fix to research. How do I verify it works?"

You: "Great! Let's verify the fix with these steps:

1. Launch the TUI dashboard:
   hive tui
   Then select your agent from the list and press Enter to run it.

2. After it completes, check the logs:
   [Use query_runtime_logs to check for attention flags]

3. Verify the specific node:
   [Use query_runtime_log_details for research]

Expected results:
- No 'needs_attention' flags
- research shows exit_status='success'
- retry_count should be 0

Let me know when you've run it and I'll help check the logs!"
```

---

## MCP Tool Usage Guide

### Three Levels of Observability

**L1: query_runtime_logs** - Session-level summaries
- **When to use:** Initial triage, identifying problematic runs, monitoring trends
- **Returns:** List of runs with status, attention flags, timestamps
- **Example:**
  ```
  query_runtime_logs(
      agent_work_dir="/home/user/.hive/deep_research_agent",
      status="needs_attention",
      limit=20
  )
  ```

**L2: query_runtime_log_details** - Node-level details
- **When to use:** Diagnosing which nodes failed, understanding retry patterns
- **Returns:** Per-node completion details, retry counts, verdicts
- **Example:**
  ```
  query_runtime_log_details(
      agent_work_dir="/home/user/.hive/deep_research_agent",
      run_id="session_20260206_115718_e22339c5",
      needs_attention_only=True
  )
  ```

**L3: query_runtime_log_raw** - Step-level details
- **When to use:** Root cause analysis, understanding exact failures
- **Returns:** Full tool calls, LLM responses, judge feedback
- **Example:**
  ```
  query_runtime_log_raw(
      agent_work_dir="/home/user/.hive/deep_research_agent",
      run_id="session_20260206_115718_e22339c5",
      node_id="research"
  )
  ```

### Session & Checkpoint Tools

**list_agent_sessions** - Browse sessions with filtering
- **When to use:** Finding resumable sessions, identifying failed sessions, Stage 3 triage
- **Returns:** Session list with status, timestamps, is_resumable, current_node, quality
- **Example:**
  ```
  list_agent_sessions(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      status="failed",
      limit=10
  )
  ```

**get_agent_session_state** - Load full session state (excludes memory values)
- **When to use:** Inspecting session progress, checking is_resumable, examining path
- **Returns:** Full state with memory_keys/memory_size instead of memory values
- **Example:**
  ```
  get_agent_session_state(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      session_id="session_20260208_143022_abc12345"
  )
  ```

**get_agent_session_memory** - Get memory contents from a session
- **When to use:** Stage 5 root cause analysis, inspecting produced data
- **Returns:** All memory keys+values, or a single key's value
- **Example:**
  ```
  get_agent_session_memory(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      session_id="session_20260208_143022_abc12345",
      key="twitter_handles"
  )
  ```

**list_agent_checkpoints** - List checkpoints for a session
- **When to use:** Stage 6 recovery, finding clean checkpoints to resume from
- **Returns:** Checkpoint summaries with type, node, clean status
- **Example:**
  ```
  list_agent_checkpoints(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      session_id="session_20260208_143022_abc12345",
      is_clean="true"
  )
  ```

**get_agent_checkpoint** - Load a specific checkpoint with full state
- **When to use:** Inspecting exact state at a checkpoint, comparing to current state
- **Returns:** Full checkpoint: memory snapshot, execution path, metrics
- **Example:**
  ```
  get_agent_checkpoint(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      session_id="session_20260208_143022_abc12345",
      checkpoint_id="cp_node_complete_intake_143030"
  )
  ```

**compare_agent_checkpoints** - Diff memory between two checkpoints
- **When to use:** Understanding data flow, finding where state diverged
- **Returns:** Memory diff (added/removed/changed keys) + execution path diff
- **Example:**
  ```
  compare_agent_checkpoints(
      agent_work_dir="/home/user/.hive/agents/twitter_outreach",
      session_id="session_20260208_143022_abc12345",
      checkpoint_id_before="cp_node_complete_intake_143030",
      checkpoint_id_after="cp_node_complete_research_143115"
  )
  ```

### Query Patterns

**Pattern 1: Top-Down Investigation** (Most common)
```
1. L1: Find problematic runs
2. L2: Identify failing nodes
3. L3: Analyze specific failures
```

**Pattern 2: Node-Specific Debugging**
```
1. L2: Get details for specific node across all runs
2. L3: Deep dive into worst failures
```

**Pattern 3: Real-time Monitoring**
```
Loop every 10 seconds:
  1. L1: Check for new needs_attention runs
  2. If found: Alert and drill into L2
```

**Pattern 4: Session State + Checkpoint Recovery**
```
1. list_agent_sessions: Find failed/paused sessions
2. get_agent_session_state: Check is_resumable, see execution path
3. get_agent_session_memory: Inspect what data was produced
4. list_agent_checkpoints: Find clean checkpoints before failure
5. compare_agent_checkpoints: Understand what changed between checkpoints
6. Recommend resume command with specific checkpoint
```

---

## Complete Example Walkthrough

### Scenario: deep_research_agent Agent Failing

**Developer says:** "My deep_research_agent keeps failing. It's not saving research findings correctly."

#### Stage 1: Setup
```
You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context:
- Agent: deep_research_agent
- Goal: deep-research
- Working Dir: ~/.hive/agents/deep_research_agent
- Success Criteria: ["Successfully send 5 personalized outreach messages"]
- Constraints: ["Must verify handle exists", "Must personalize message"]
- Nodes: research, profile-analyzer, message-composer, outreach-sender
```

#### Stage 2: Mode Selection
```
You: "Which debugging approach would you prefer?"

[Present 3 mode options using AskUserQuestion]

Developer selects: "Post-Mortem Analysis Mode"
```

#### Stage 3: Triage (L1)
```
You: "Querying runtime logs for problematic runs..."

[Call query_runtime_logs with status="needs_attention"]

Results:
- Found 1 run needing attention
- session_20260206_115718_e22339c5 (30 min ago)
- Status: degraded
- Categories: missing_outputs, retry_loops
```

#### Stage 4: Diagnosis (L2)
```
You: "Analyzing node-level details..."

[Call query_runtime_log_details for the session]

Results:
- Problem Node: research
- Exit Status: escalate
- Retry Count: 5
- Verdict Counts: {RETRY: 5, ESCALATE: 1}
- Attention Reasons: ["high_retry_count", "missing_outputs"]
- Total Steps: 8

Diagnosis: research has retry loop + missing outputs issue
Category: Missing Outputs + Retry Loops
```

#### Stage 5: Root Cause (L3)
```
You: "Drilling into detailed logs for research..."

[Call query_runtime_log_raw for node research]

Evidence:
Step 3: web_search("latest AI regulations 2026") → Found articles
  Verdict: RETRY
  Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4: web_search("AI regulation policy 2026") → Found more info
  Verdict: RETRY
  Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Same pattern repeats...

ROOT CAUSE: Node successfully finds handles but LLM doesn't call set_output. It keeps searching instead of completing.
```

#### Stage 6: Fix Recommendation
```
You: "Here's the fix for this issue:

## Issue: research Node Not Setting Outputs

**Root Cause:** Node successfully finds research sources but doesn't call set_output

**Category:** Missing Outputs

**File to edit:** exports/deep_research_agent/nodes/research.py

**Changes:**
Update the system_prompt to emphasize the output requirement:

```python
system_prompt = """
You are the Research node for the deep research agent.

Your task:
1. Search for relevant sources on the research topic
2. Extract key findings from each source
3. **CRITICAL: Call set_output(research_findings=[...]) immediately after gathering sufficient findings**

IMPORTANT: You MUST call set_output even if you found findings on the first search.
Do not continue searching indefinitely - set output and finish once you have enough data.

Output format:
{
  "research_findings": [{"source": "...", "summary": "...", "url": "..."}, ...]
}
"""
```

**Why this works:** The explicit instruction to call set_output immediately prevents the LLM from continuing to search after finding sufficient sources.
```

#### Stage 7: Verification
```
Developer: "I updated the prompt. How do I test it?"

You: "Let's verify the fix:

1. Launch the TUI dashboard:
   hive tui
   Then select your agent and press Enter to run it.

2. I'll check the logs after it runs..."

[Developer runs agent]

Developer: "It completed! Can you check if it worked?"

You: "Checking now..."

[Call query_runtime_log_details for the new run]

Results:
- Node: research
- Exit Status: success ✓
- Retry Count: 0 ✓
- Total Steps: 2 ✓
- No attention flags ✓

"Perfect! The fix worked. The research now:
- Completes successfully on first try
- No retries needed
- Calls set_output properly

Your agent should now work correctly!"
```

---

## Tips for Effective Debugging

1. **Always start with L1 logs** - Don't jump straight to detailed logs
2. **Focus on attention flags** - They highlight the real issues
3. **Compare verdict_feedback across steps** - Patterns reveal root causes
4. **Check tool error messages carefully** - They often contain the exact problem
5. **Consider the agent's goal** - Fixes should align with success criteria
6. **Test fixes immediately** - Quick verification prevents wasted effort
7. **Look for patterns across multiple runs** - One-time failures might be transient

## Common Pitfalls to Avoid

1. **Don't recommend code you haven't verified exists** - Always read files first
2. **Don't assume tool capabilities** - Check MCP server configs
3. **Don't ignore edge conditions** - Missing edges cause routing failures
4. **Don't overlook judge configuration** - Mismatched expectations cause retry loops
5. **Don't forget nullable_output_keys** - Optional inputs need explicit marking
6. **Don't diagnose "in_progress" as a failure for forever-alive agents** - Agents with `terminal_nodes=[]` are designed to never enter "completed" state. This is intentional. Focus on quality of individual node visits, not session completion status
7. **Don't ignore conversation memory issues in long-running sessions** - In continuous conversation mode, history grows across node transitions and loop iterations. Watch for context overflow (tokens_used > 100K), stale data from previous loops affecting edge conditions, and compaction failures that cause the LLM to lose important context
8. **Don't confuse "waiting for user" with "stalled"** - Client-facing nodes in forever-alive agents block for user input by design. A session paused at a client-facing node is working correctly, not stalled

---

## Storage Locations Reference

**New unified storage (default):**
- Logs: `~/.hive/agents/{agent_name}/sessions/session_YYYYMMDD_HHMMSS_{uuid}/logs/`
- State: `~/.hive/agents/{agent_name}/sessions/{session_id}/state.json`
- Conversations: `~/.hive/agents/{agent_name}/sessions/{session_id}/conversations/`

**Old storage (deprecated, still supported):**
- Logs: `~/.hive/agents/{agent_name}/runtime_logs/runs/{run_id}/`

The MCP tools automatically check both locations.

---

**Remember:** Your role is to be a debugging companion and thought partner. Guide the developer through the investigation, explain what you find, and provide actionable fixes. Don't just report errors - help understand and solve them.