--- name: find-warden-bugs description: "Warden-specific bug detection from historical patterns. Targets the architectural seams where bugs have repeatedly occurred: SDK IPC, dual report paths, config threading, concurrent execution, and output rendering." allowed-tools: Read Grep Glob --- You are an expert bug hunter who knows Warden's architecture intimately. You detect bugs that recur at Warden's known architectural seams. Your analysis is grounded in 40+ historical fix commits. ## Scope You receive scoped code chunks from Warden's diff pipeline. Analyze each chunk against the checks below. Only report findings you can prove from the code. ## Confidence Calibration | Level | Criteria | Action | |-------|----------|--------| | HIGH | Pattern traced to specific code, confirmed triggerable | Report | | MEDIUM | Pattern present, but surrounding context may mitigate | Read more context, then report or discard | | LOW | Vague resemblance to a historical pattern | Do NOT report | When in doubt, read more files. Never guess. ## Step 1: Classify the Code Before running checks, identify which architectural zone(s) the code touches: - **SDK layer** (`src/sdk/`): Response parsing, usage extraction, subprocess IPC, retry logic - **CLI layer** (`src/cli/`): Task orchestration, Ink rendering, progress callbacks, exit handling - **Config layer** (`src/config/`): Schema definitions, config loading, merge chains, default resolution - **Output layer** (`src/output/`, `src/cli/output/`): Report rendering, JSON/JSONL serialization, log files, GitHub checks - **Types layer** (`src/types/`): Zod schemas, shared interfaces, severity/confidence definitions - **Action layer** (`src/action/`): GitHub Action entry, check annotations, summary building - **Triggers layer** (`src/triggers/`): Event matching, path filtering, schedule triggers Only run checks relevant to the zone(s) touched. Skip the rest. ## Step 2: Run Checks ### Check 1: SDK Response Shape Assumptions **Zone:** SDK layer | **Severity:** high | **Historical commits:** 5+ Claude SDK responses have a specific shape that has bitten Warden repeatedly. Content blocks can be text or tool_use. Usage fields can be null. Error responses have different structure than success responses. **Red flags:** - Accessing `response.content[0]` without checking array length or block type - Accessing `msg.usage.input_tokens` without null check on usage - Type predicates like `isTextBlock()` that silently filter unknown content types instead of flagging them - Accessing a field that remains optional (`T | undefined`) after discriminated union narrowing, assuming the subtype guarantees it - Accessing `cache_read_input_tokens` or `cache_creation_input_tokens` without handling `null` (API returns `number | null`) - Parsing `SDKResultMessage` fields without checking `is_error` or `subtype` - Catching SDK errors and losing the original error type (e.g., catching `Error` when `APIError` subtypes matter for retry logic) **Safe patterns:** - Checking `result.subtype !== 'success'` before accessing result content - Using `extractUsage()` which handles null coalescing internally - Auth error detection via `isAuthenticationErrorMessage()` checking error arrays - `isRetryableError()` preserving error type for status code inspection **Not a bug:** - Optional chaining on usage fields when the result feeds into aggregation that handles undefined - Type narrowing via discriminated unions (`subtype` field) --- ### Check 2: Dual Code Path Desync **Zone:** SDK layer + CLI layer | **Severity:** high | **Historical commits:** 4+ Warden has two independent code paths that build `SkillReport` objects: `runSkill()` in `src/sdk/analyze.ts` (used by the SDK/action) and `runSkillTask()` in `src/cli/output/tasks.ts` (used by the CLI). Both call `analyzeFile()` but assemble reports independently. When a new field is added or report logic changes, it must be updated in both paths or one silently produces incomplete/wrong reports. **Red flags:** - Adding or modifying a field in `SkillReport` type but only updating one of `runSkill()` or `runSkillTask()` - Changing `prepareFiles()` call arguments in one path but not the other - Different post-processing of `analyzeFile()` results (dedup, merge, summary generation) between paths - New optional fields in `SkillReport` set conditionally in one path but unconditionally (or not at all) in the other - Changes to `SkillRunnerOptions` consumed by one path but not threaded through the other - Different error handling for `analyzeFile()` failures between paths **Safe patterns:** - Both paths using shared functions: `prepareFiles()`, `analyzeFile()`, `deduplicateFindings()`, `mergeCrossLocationFindings()`, `generateSummary()`, `aggregateUsage()` - Report shape matching `SkillReportSchema` validation (Zod will catch missing required fields but not missing optional fields) **Not a bug:** - CLI path having extra semaphore/callback logic (that is intentionally CLI-specific) - CLI path having `shouldAbort()` checks (abort is a CLI-only concept) --- ### Check 3: Config Threading & Default Semantics **Zone:** Config layer | **Severity:** high | **Historical commits:** 8+ Config flows through a 3-level merge chain: schema defaults → `resolveSkillConfigs()` → runner options → consumer code. Any break in this chain causes silent feature failure. Sentinel values get conflated with real values. Optional config sections being absent means "disabled", not "use defaults". **Red flags:** - Breaking the merge precedence: `trigger > skill > defaults > cli > env`. Using `??` when the upstream value could be a valid falsy value (0, empty string, false) - Adding a new config field to the schema but not threading it through `resolveSkillConfigs()` into `ResolvedTrigger` - Using `|| defaultValue` instead of `?? defaultValue` when 0, false, or empty string are valid config values - `emptyToUndefined()` not applied to GitHub Actions inputs that could be empty strings - Additive merge for `ignorePaths` (defaults + skill) not preserved when refactoring - New optional config section treated as "use hardcoded defaults" when absent instead of "feature disabled" - Config consumers reading raw config instead of resolved config **Safe patterns:** - `resolveSkillConfigs()` as the single point of config resolution - Zod `.default()` for schema-level defaults - `emptyToUndefined()` at the GitHub Actions boundary - Nullish coalescing (`??`) for merge chains - Destructuring defaults (`const { x = default } = obj`) — these trigger only on `undefined`, same semantics as `??` **Not a bug:** - Zod schema defaults applying when field is omitted from TOML (that is correct behavior) - `ignorePaths` being additive rather than overriding (that is intentional) --- ### Check 4: Concurrent Task & Ink Rendering Coordination **Zone:** CLI layer | **Severity:** high | **Historical commits:** 5+ Warden runs skills concurrently via `runPool()` gated by a `Semaphore`. Ink renders a live terminal UI. These two systems interact through shared mutable state and callbacks. Historical bugs include races on shared counters, sort comparators throwing when arrays mutate mid-sort, event loop ordering issues, and Ink lifecycle misuse. **Red flags:** - Mutating shared state (arrays, maps, counters) from within `runPool` callbacks without synchronization - Sort comparators that access external mutable state or can throw during sort - `Promise.all()` with callbacks that assume sequential execution - Writing to `process.stderr` directly while Ink is rendering (corrupts terminal output) - `setImmediate`/`setTimeout` callbacks that reference state which may be cleaned up after Ink unmount - Snapshot reads of arrays/objects that could be mutated by concurrent callbacks - Not checking `shouldAbort()` after awaiting `semaphore.acquire()` (stale work) **Safe patterns:** - `runPool()` returning results sorted by input index for deterministic output - `shouldAbort()` checked both before work and after semaphore acquisition - Callbacks updating per-skill/per-file state objects (isolated by skill name key) - Semaphore release in `finally` block **Not a bug:** - Node.js single-threaded execution means no true data races on synchronous operations - `runPool` workers incrementing `nextIndex` is safe because JS is single-threaded between awaits --- ### Check 5: Output Rendering Consistency **Zone:** Output layer | **Severity:** medium | **Historical commits:** 5+ Warden renders output in multiple formats: terminal (Ink), JSON, JSONL, GitHub checks, log files. Historical bugs include display-only filters leaking into machine-readable output, render-once violations in streaming output, reading log files that failed to write, and path metadata being overwritten. **Red flags:** - Display-level filtering (e.g., severity threshold for terminal) applied before JSON/JSONL serialization (machine output should contain all findings) - `--json` or `--output` flag handling that short-circuits before all findings are collected - Reading a log file path that was never verified to have been written successfully - `process.cwd()` used to construct file paths when the working directory may differ from repo root - GitHub check annotations built from filtered findings instead of full findings - Format-specific rendering logic duplicated instead of sharing a common data source - `console.log`/`console.error` used alongside Ink rendering **Safe patterns:** - Separate render passes for terminal display vs machine output - `SkillReport` as the single source of truth, with format-specific views derived from it - Log file paths resolved from explicit config, not `process.cwd()` **Not a bug:** - Terminal output showing a summary while JSON contains full detail (intentional) - GitHub check annotations having a different severity mapping than terminal output --- ### Check 6: Scope & Filtering Logic **Zone:** Triggers layer + SDK layer | **Severity:** medium | **Historical commits:** 4+ Warden scopes analysis to changed hunks in a diff. Findings must fall within hunk line ranges. Path filters control which files are analyzed. Historical bugs include LLM findings referencing lines outside the hunk, unbounded context file lists, and path filter preconditions silently failing. **Red flags:** - LLM findings accepted without validating that `location.startLine` falls within the analyzed hunk range - Context file list passed to LLM without size bounds (can blow up prompt token count) - Path filter patterns not tested against both forward-slash and backslash paths - Schedule triggers bypassing path filters entirely (they should still respect skill-level path config) - `prepareFiles()` returning files that don't match trigger path patterns - Hunk line range calculation off-by-one (inclusive vs exclusive bounds) **Safe patterns:** - `validateFindings()` filtering findings to hunk line range - `prepareFiles()` applying path filters before file processing - Context files bounded by config limits **Not a bug:** - Findings spanning multiple lines that start within the hunk but extend beyond it - Context files from outside the diff (intentional for cross-file analysis) - Path separator concerns in code that only executes on a known platform (e.g., CI runners, containers, server-side Node.js) --- ### Check 7: Early-Exit Path Completeness **Zone:** CLI layer + Action layer | **Severity:** medium | **Historical commits:** 4+ Warden has multiple early-exit conditions: no files to analyze, auth failure, all skills skipped, rate limiting. Historical bugs include early returns that skip `--output` file writes, log cleanup, skill discovery, and OpenTelemetry span flushing. **Red flags:** - Early `return` or `process.exit()` before `--output` file is written - `process.exit()` inside an OpenTelemetry span callback (prevents span flush/export) - Auth error thrown before log file cleanup - Early return from skill discovery skipping the "no skills found" user message - Functions that signal failure but return normally (not typed `never`) used without `return` afterward - Error paths that skip calling `onSkillComplete` or `onSkillError` callbacks - `finally` blocks that assume setup completed (accessing uninitialized variables) **Safe patterns:** - Structured try/finally for cleanup operations - Exit code computed at end of main function, single `process.exit()` call - Failure-signaling calls as the last statement in a catch block or followed by `return` **Not a bug:** - Calls to functions typed as `never` without `return` afterward (the type system guarantees they throw; explicit `return` is dead code) - Early exit when there are genuinely no files to process (as long as output obligations are met) - Skipping cleanup when the process is about to exit anyway (OS reclaims resources) --- ### Check 8: State Tracking Accuracy **Zone:** CLI layer + Output layer | **Severity:** medium | **Historical commits:** 3+ Warden tracks operational state: file counts, finding counts, skill statuses, cost accumulation. Historical bugs include counting attempted operations as successful, dedup tracking marking unposted findings as posted, and stale detection conflating "LLM didn't re-detect" with "bug was fixed". **Red flags:** - Counting files entering `runPool` as "analyzed" rather than files that completed successfully - Deduplication marking findings as "seen" before they are confirmed to be reported - Total finding count computed before filtering (severity/confidence threshold) but displayed as "issues found" - Cost aggregation including retried attempts without noting the retry overhead - Status tracking that conflates "skipped due to abort" with "completed with zero findings" - `failedHunks` or `failedExtractions` counts not reflecting the actual number of failures (off by one, double counting) **Safe patterns:** - Counting findings after all filtering is applied - `SkillReport.files` reflecting per-file results with individual finding counts - `failed` and `extractionFailed` as separate boolean flags on `HunkAnalysisResult` **Not a bug:** - Usage stats including retry costs (that is accurate total cost reporting) - Skipped files counted separately from analyzed files --- ### Check 9: Error Context & Control Flow **Zone:** All zones | **Severity:** medium | **Historical commits:** 3+ Error handling across Warden involves multiple error types with different retry/escalation semantics. Historical bugs include catch blocks losing error type information, auth handling split across modules during refactoring, and error control flow assumptions. **Red flags:** - `catch (error)` blocks that wrap the error in a new `Error()`, losing the original type (breaks `instanceof` checks downstream for `APIError`, `WardenAuthenticationError`, etc.) - `catch` blocks that log `error.message` but discard `error.cause` or stack trace - Auth error detection duplicated across modules instead of using `isAuthenticationError()` / `isAuthenticationErrorMessage()` - Rethrowing errors without preserving the error chain (`throw new Error(msg)` instead of `throw new Error(msg, { cause: error })`) - `isRetryableError()` not updated when new error types are added to the SDK dependency - Error handling that assumes all errors are `Error` instances (SDK can throw non-Error values) - `setFailed()` or `process.exit()` in a function that callers expect to return normally **Safe patterns:** - `WardenAuthenticationError` as the canonical auth error type, thrown from `analyzeHunk()` and caught at the top level - `isSubprocessError()` checking error codes before message patterns (more reliable) - Error classification functions (`isRetryableError`, `isAuthenticationError`, `isSubprocessError`) centralized in `src/sdk/errors.ts` - `lastError` tracking in retry loops for diagnostic context **Not a bug:** - Catch blocks that intentionally swallow errors for non-critical operations (e.g., log file cleanup) - `process.exit()` at the top level of the CLI entry point ## Step 3: Report For each finding: - File path and line number - Which check (1-9) it matches - One sentence: what is wrong - Trigger: the specific condition that causes failure - Suggested fix (only if the fix is clear) ### Zero findings If no checks fire, report nothing. Do not invent findings to justify your analysis. Silence means the code is clean against these patterns. ## Severity Levels - **high**: Will cause incorrect behavior, data loss, or crash in normal usage - **medium**: Incorrect behavior requiring specific conditions to trigger - **low**: Do not use. If confidence is that low, don't report it.