---
name: multi-agent-observability
description: Build observability interfaces for multi-agent systems. Use when monitoring multi-agent execution, tracking agent metrics, implementing logging for parallel agents, or debugging agent workflows.
allowed-tools: Read, Grep, Glob
---

# Multi-Agent Observability Skill

Build observability interfaces for monitoring and measuring multi-agent systems.

## Purpose

Guide the design and implementation of observability layers that provide real-time visibility into multi-agent execution.

## When to Use

- Designing monitoring for agent fleets
- Building metrics dashboards
- Implementing logging architecture
- Creating cost tracking systems

## Prerequisites

- Understanding of the Three Pillars (@three-pillars-orchestration.md)
- Familiarity with results-oriented patterns (@results-oriented-engineering.md)
- Access to Claude Agent SDK documentation

## SDK Requirement

> **Implementation Note**: Full observability requires Claude Agent SDK with custom MCP tools and UI components. This skill provides design patterns.

## The Critical Principle

> "If you can't measure it, you can't improve it. If you can't measure it, you can't scale it."

## What to Observe

### Per-Agent Metrics

| Metric | Purpose | How to Track |
| --- | --- | --- |
| Status | Know state | Agent state enum |
| Context usage | Token consumption | API response |
| Cost | Financial impact | API usage data |
| Tool calls | What it's doing | Hook logging |
| Results | Output verification | Result parsing |
| Duration | Execution time | Timestamps |

### Aggregate Metrics

| Metric | Purpose | Calculation |
| --- | --- | --- |
| Total agents | Scale | Count active |
| Total duration | End-to-end time | First to last |
| Total cost | Financial total | Sum per-agent |
| Success rate | Reliability | Success / total |
| Coverage | Scope | Files touched |

## Observability Components

### 1. Agent Cards

Real-time status for each agent:

```text
┌─────────────────────────────────────┐
│ scout_1                 [EXECUTING] │
├─────────────────────────────────────┤
│ Template: scout-fast                │
│ Model: haiku                        │
│ Context: 12,500 / 100,000 tokens    │
│ Cost: $0.05                         │
│ Duration: 45s                       │
│ Tool calls: 15                      │
└─────────────────────────────────────┘
```

**Required fields**:

- Agent ID and template
- Status (idle, executing, complete, error)
- Model being used
- Context usage (current / max)
- Running cost
- Execution duration
- Tool call count

### 2. Event Stream

Real-time log of all activities:

```text
[10:30:00] scout_1 created (template: scout-fast)
[10:30:01] scout_1 commanded: "Analyze auth module"
[10:30:05] scout_1 Read: src/auth/login.ts
[10:30:08] scout_1 Grep: "password" in src/auth/
[10:30:15] scout_1 completed (duration: 14s)
[10:30:16] scout_1 deleted
```

**Event types**:

- Agent lifecycle (create, delete)
- Commands sent
- Tool calls
- Status changes
- Errors

### 3. Cost Tracking

Track spend per agent and total:

```text
Cost Summary
────────────────────────────────────
scout_1 (haiku)      $0.05
scout_2 (haiku)      $0.04
builder_1 (sonnet)   $0.35
reviewer_1 (sonnet)  $0.12
────────────────────────────────────
Total                $0.56
Budget remaining     $4.44 (89%)
```

**Cost components**:

- Input tokens
- Output tokens
- Per-agent breakdown
- Running total
- Budget tracking

### 4. Result Inspector

View consumed and produced assets:

```text
Agent: builder_1

Consumed Assets:
├── Scout report (summary)
├── src/auth/middleware.ts
└── package.json

Produced Assets:
├── src/auth/rate-limit.ts (created)
├── src/auth/middleware.ts (modified)
└── tests/rate-limit.test.ts (created)

Summary: "Implemented rate limiting middleware"
Status: completed
```

### 5. Log Viewer

Filterable activity history:

```text
Filters: [agent: all] [level: all] [tool: all]

10:30:00 INFO  scout_1   Created from template
10:30:01 INFO  scout_1   Received command
10:30:05 DEBUG scout_1   Read: src/auth/login.ts (1,200 tokens)
10:30:08 DEBUG scout_1   Grep: found 5 matches
10:30:12 WARN  scout_1   Context at 80% capacity
10:30:15 INFO  scout_1   Completed successfully
```

## Implementation Patterns

### Logging Architecture

```python
# Event types
class AgentEvent:
    timestamp: datetime
    agent_id: str
    event_type: str  # create, command, tool, status, error
    details: dict

# Log collector
def log_event(event: AgentEvent):
    # Store to database
    db.events.insert(event)
    # Emit to WebSocket
    ws.broadcast(event)
    # Update metrics
    metrics.update(event)
```

### Real-Time Updates

```python
# WebSocket for live updates
async def agent_status_stream(agent_id):
    while agent_active(agent_id):
        status = get_agent_status(agent_id)
        yield status
        await asyncio.sleep(1)
```

### Cost Calculation

```python
def calculate_cost(usage):
    input_cost = usage.input_tokens * MODEL_INPUT_PRICE
    output_cost = usage.output_tokens * MODEL_OUTPUT_PRICE
    return input_cost + output_cost
```

## UI Components

### Minimal CLI View

```text
Orchestration: Add rate limiting
────────────────────────────────────
Agents: 3 active | 2 complete | 0 error
Cost: $0.56 / $5.00 budget
Progress: ████████░░ 80%

[scout_1] ✓ complete (14s)
[scout_2] ✓ complete (12s)
[builder] ⚡ executing (45s)
```

### Rich Dashboard View

```text
┌─────────────────────────────────────────────────────────────┐
│                    Orchestration Dashboard                    │
├─────────────────────────────────────────────────────────────┤
│ Task: Add rate limiting to authentication                    │
│ Started: 10:30:00 | Duration: 2m 15s | Cost: $0.56          │
├─────────────────────────────────────────────────────────────┤
│ Agent Fleet                          │ Event Stream          │
│ ┌─────────────────────────────────┐  │ [10:32:15] builder   │
│ │ scout_1        [✓ complete]    │  │   Write: rate-limit  │
│ │ scout_2        [✓ complete]    │  │ [10:32:10] builder   │
│ │ builder        [⚡ executing]   │  │   Read: middleware   │
│ │ reviewer       [○ pending]     │  │ [10:30:15] scout_2   │
│ └─────────────────────────────────┘  │   completed          │
├─────────────────────────────────────────────────────────────┤
│ Cost Breakdown     │ Results Summary                         │
│ haiku:  $0.09     │ Files read: 8                           │
│ sonnet: $0.47     │ Files written: 3                        │
│ Total:  $0.56     │ Tests: 5/5 passing                      │
└─────────────────────────────────────────────────────────────┘
```

## Design Checklist

- [ ] Per-agent metrics defined
- [ ] Aggregate metrics calculated
- [ ] Event logging implemented
- [ ] Real-time updates via WebSocket
- [ ] Cost tracking per agent
- [ ] Result inspection available
- [ ] Log filtering supported
- [ ] UI components designed

## Output Format

When designing observability, provide:

```markdown
## Observability Design

### Metrics

**Per-Agent:**
[List with tracking method]

**Aggregate:**
[List with calculation]

### Components

**Agent Cards:** [fields and update frequency]
**Event Stream:** [event types and storage]
**Cost Tracking:** [breakdown and budgets]
**Result Inspector:** [consumed/produced format]
**Log Viewer:** [filters and retention]

### Implementation

**Logging:** [architecture]
**Real-Time:** [WebSocket design]
**Storage:** [database schema]
**UI:** [component specifications]
```

## Anti-Patterns

| Anti-Pattern | Problem | Solution |
| --- | --- | --- |
| No metrics | Flying blind | Track everything |
| Delayed updates | Stale status | Real-time WebSocket |
| No cost tracking | Budget overruns | Per-agent costs |
| Missing logs | Can't debug | Log all events |
| No aggregation | Can't summarize | Calculate totals |

## Cross-References

- @three-pillars-orchestration.md - Observability pillar
- @results-oriented-engineering.md - Result patterns
- @agent-lifecycle-crud.md - Agent state tracking
- @orchestrator-design skill - System architecture

## Version History

- **v1.0.0** (2025-12-26): Initial release

---

## Last Updated

**Date:** 2025-12-26
**Model:** claude-opus-4-5-20251101