--- name: witness-observer description: PMU observation pattern for agent health monitoring. Runs patrol loops, detects stalled agents, sends nudges, and escalates persistent failures to the mayor. Never modifies agent work or resolves conflicts. type: skill category: orchestration status: stable origin: tibsfox modified: false first_seen: 2026-03-06 first_path: .claude/skills/witness-observer/SKILL.md superseded_by: null --- # Witness Observer Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies. ## Activation Triggers This skill activates when: - The agent is assigned to monitor a rig's worker agents - Multiple polecats are running and health monitoring is needed - Stall detection is required for long-running work items - The mayor needs a supervisory agent to watch active polecats ## Core Capabilities ### Patrol Loop The witness runs a periodic patrol that checks all active agents in its rig for health indicators. **Patrol cycle:** ``` SCAN EVALUATE ACT WAIT | | | | v v v v list agents -> check each -> nudge/escalate -> sleep interval (active ones) for stalls if needed (default 5 min) ``` **Implementation:** ```typescript const state = new StateManager({ stateDir: '.chipset/state/' }); const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable) const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable) async function patrol(): Promise { // Get all agents that should be working const agents = await state.listAgents({ role: 'polecat' }); const active = agents.filter(a => a.status === 'active'); for (const agent of active) { const hook = await state.getHook(agent.id); if (!hook || hook.status !== 'active') continue; // Check last activity timestamp const lastActivity = new Date(hook.lastActivity).getTime(); const elapsed = Date.now() - lastActivity; if (elapsed > stallThreshold) { await handleStall(agent, hook, elapsed); } } } ``` ### Stall Detection A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes). **Stall indicators:** | Indicator | What It Means | |-----------|--------------| | Hook active, no activity for 30+ min | Agent may be stuck, crashed, or idle | | Agent status is 'active' but hook timestamp stale | Session may have ended without cleanup | | Multiple consecutive patrol cycles with no change | Persistent stall, needs escalation | **Stall classification:** ```typescript type StallSeverity = 'warning' | 'alert' | 'critical'; function classifyStall(elapsed: number, nudgesSent: number): StallSeverity { if (nudgesSent >= 2) return 'critical'; // Nudged twice, still stalled if (elapsed > 60 * 60 * 1000) return 'alert'; // Over 1 hour return 'warning'; // First detection } ``` ### Nudge Protocol When a stall is detected, the witness follows a graduated escalation protocol. **Step 1 -- Send nudge to stalled agent:** ```typescript async function handleStall( agent: AgentIdentity, hook: HookState, elapsed: number ): Promise { const severity = classifyStall(elapsed, getNudgeCount(agent.id)); if (severity === 'warning') { // First nudge: ask agent if it's still working const nudge: AgentMessage = { from: witnessId, to: agent.id, channel: 'nudge', payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`, timestamp: new Date().toISOString(), durable: false, }; // Write nudge file recordNudge(agent.id); return; } if (severity === 'alert' || severity === 'critical') { // Escalate to mayor await escalateToMayor(agent, hook, severity, elapsed); } } ``` **Step 2 -- Wait for response (next patrol cycle):** If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed. **Step 3 -- Escalate if unresolved:** ```typescript async function escalateToMayor( agent: AgentIdentity, hook: HookState, severity: StallSeverity, elapsed: number ): Promise { const escalation: AgentMessage = { from: witnessId, to: 'mayor', channel: 'mail', payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`, timestamp: new Date().toISOString(), durable: true, }; // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json } ``` ### Health Reporting The witness provides aggregate health summaries when queried by the mayor. ```typescript interface RigHealthReport { rigName: string; timestamp: string; totalAgents: number; activeAgents: number; stalledAgents: number; idleAgents: number; terminatedAgents: number; stalledDetails: Array<{ agentId: string; beadId: string; stalledMinutes: number; nudgesSent: number; }>; } ``` ## Communication Protocol ### Messages the Witness SENDS | Channel | Target | Purpose | Durability | |---------|--------|---------|------------| | `nudge` | Stalled polecats | "Are you still working?" health check | Non-durable | | `mail` | Mayor | Stall alerts (warning, alert, critical) | Durable | | `mail` | Mayor | Health report summaries | Durable | ### Messages the Witness RECEIVES | Channel | Source | Content | |---------|--------|---------| | `mail` | Mayor | Instructions (adjust thresholds, focus on specific agent) | | `mail` | Polecats | Status responses to nudges | ## Error Handling ### False Positive Stalls If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity. ### Witness Restart If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles. ### Unresponsive Agent If an agent does not respond to two nudges across two patrol cycles, the witness sends a `critical` escalation to the mayor. The mayor decides whether to terminate and replace the agent. ## Boundary: What the Witness Does NOT Do The witness NEVER: - **Modifies agent work** -- does not edit files, change branches, or alter code - **Resolves conflicts** -- conflict resolution is outside the observer's scope - **Terminates agents** -- only the mayor can terminate; the witness recommends - **Reassigns work** -- hook management belongs to the mayor - **Changes agent status** -- the witness reads status but does not write it (except its own) - **Runs tests or builds** -- the witness observes; it does not validate output quality The witness is a sensor. It detects anomalies and reports them. It does not act on them. ## Integration with Other Gastown Skills | Skill | Relationship | |-------|-------------| | `mayor-coordinator` | Witness reports stalls and health TO mayor | | `polecat-worker` | Witness monitors polecat health, sends nudges | | `refinery-merge` | Witness can observe refinery queue depth and merge failures | | `beads-state` | Witness reads state via StateManager (read-only) | ## References - `references/gastown-origin.md` -- How this pattern derives from Gastown's witness.go patrol - `references/boundaries.md` -- Read-only constraints and observation-only scope