--- name: scaffold-review description: Analyze conversation history, find gaps and drift in AGENTS/CLAUDE instructions and skills, propose and apply targeted improvements. user-invocable: true --- # Scaffold Review Analyze recent agent conversation history to find what's broken, stale, or missing in your scaffolding (AGENTS/CLAUDE instructions, skills, project configs). Propose changes, apply them, and record what you did. The goal is **convergence**: each run brings the scaffold closer to how the user actually works. --- ## Step 1: Load State Resolve the active agent home: ```bash if [ -n "${AGENT_HOME:-}" ]; then : elif [ -n "${CODEX_HOME:-}" ] || [ -n "${CODEX_THREAD_ID:-}" ] || [ -n "${CODEX_CI:-}" ]; then AGENT_HOME="${CODEX_HOME:-$HOME/.codex}" elif [ -n "${CLAUDE_HOME:-}" ] || [ -n "${CLAUDECODE:-}" ] || [ -n "${CLAUDE_CODE:-}" ]; then AGENT_HOME="${CLAUDE_HOME:-$HOME/.claude}" elif [ -d "$HOME/.codex/sessions" ] && [ ! -d "$HOME/.claude/projects" ]; then AGENT_HOME="$HOME/.codex" elif [ -d "$HOME/.claude/projects" ] && [ ! -d "$HOME/.codex/sessions" ]; then AGENT_HOME="$HOME/.claude" else echo "Unable to infer AGENT_HOME. Set AGENT_HOME explicitly." >&2 exit 1 fi ``` Read the review ledger (memory of prior runs): !`cat "$AGENT_HOME/scaffold-review-ledger.json" 2>/dev/null || echo '{"runs": [], "deferred": [], "trends": []}'` Find conversations since last run (or last 14 days if first run), with sizes for budgeting: !`if [ -d "$AGENT_HOME/projects" ]; then find "$AGENT_HOME/projects" -name '*.jsonl' -not -path '*/subagents/*' -mtime -14 -size +10k -exec ls -lh {} \; ; elif [ -d "$AGENT_HOME/sessions" ]; then find "$AGENT_HOME/sessions" -name '*.jsonl' -mtime -14 -size +10k -exec ls -lh {} \; ; fi | awk '{print $5, $9}' | sort -k2` **Budget check:** If total JSONL exceeds 5MB, split the corpus across agents rather than having each read everything. --- ## Step 2: Extract Signals Use **3 focused analyzers** in parallel. For Codex, these are parallel shell/Python extraction passes over JSONL, not separate agent sessions. ### Agent 1: Corrections & Friction Scan user messages for: - Explicit corrections ("no, I meant...", "that's not what I asked", "actually...") - Behavioral directives ("don't do X", "always do Y") - Frustration markers (short messages after long assistant responses, re-prompting the same thing) - User doing something manually after the assistant offered to do it (trust failure) For each correction, answer: **Is there scaffold guidance for this? Was it followed? Was it wrong?** Output: list of corrections with root cause (missing guidance / stale guidance / buried guidance / wrong guidance). ### Agent 2: Usage Patterns & Drift From assistant tool-call records, extract: - **File access heatmap**: top 20 files by Read/Edit frequency. Compare against what AGENTS/CLAUDE instructions reference. - **Command frequency**: top commands by prefix (git, python, cargo, etc.) - **Skill invocation rates**: which skills are used, which are never used - **New tools/patterns**: anything in recent conversations but not older ones - **Dead references**: paths in AGENTS/CLAUDE instructions that no longer appear in conversations Output: frequency tables + list of stale/missing references. ### Agent 3: Workflow & Structure Look at multi-step patterns: - Repeated sequences across conversations (e.g., kill server -> launch -> health check -> benchmark -> kill) - Session preambles: first 3 user messages from each conversation. If the user explains the same thing across 2+ sessions, that's a scaffold gap. - Things that disappeared: commands/files/patterns that used to appear but don't anymore Classify patterns by stability: - **Crystallized** (5+ conversations): codify into skill or AGENTS/CLAUDE instructions - **Stable** (3-4): add as guidance, keep watching - **Emerging** (2): note as trend, don't codify yet Output: pattern list with stability ratings + gap analysis. ### Analyzer Rules (all analyzers) 1. **Never read a full JSONL file.** Use `head -c 50000` or targeted grep extraction: ```bash # Codex user messages python3 - <<'PY' import json for line in open("file.jsonl", errors="ignore"): obj = json.loads(line) if obj.get("type") != "response_item": continue payload = obj.get("payload", {}) if payload.get("type") != "message" or payload.get("role") != "user": continue parts = [block.get("text", "") for block in payload.get("content", []) if block.get("type") == "input_text"] text = " ".join(parts) if text: print(text[:200]) PY # Codex tool usage counts grep '"type":"function_call"' file.jsonl | grep -o '"name":"[^"]*"' | sort | uniq -c | sort -rn | head -20 # Command prefixes from exec_command calls python3 - <<'PY' import json from collections import Counter counts = Counter() for line in open("file.jsonl", errors="ignore"): obj = json.loads(line) if obj.get("type") != "response_item": continue payload = obj.get("payload", {}) if payload.get("type") != "function_call" or payload.get("name") != "exec_command": continue args = json.loads(payload.get("arguments", "{}")) cmd = args.get("cmd", "").strip().splitlines() if cmd: counts[cmd[0].split()[0]] += 1 for name, count in counts.most_common(20): print(count, name) PY ``` 2. **Max 15-20 conversations per analyzer.** Sample by recency if there are more. 3. **Return structured findings in <300 lines.** Conclusions, not data dumps. ### Codex-Specific Notes - Codex session logs usually store user, assistant, and tool activity under `response_item.payload`. - Commentary and final answers are both assistant messages; use `payload.phase` when you need to separate progress updates from final responses. - Tool calls appear as `payload.type == "function_call"` with JSON-encoded `arguments`. - `write_stdin` polling loops are common in remote or long-running jobs; treat them as one workflow, not separate tasks. --- ## Step 3: Synthesize & Compare After all agents report, read the current scaffold: - `"$AGENT_HOME/CLAUDE.md"` (with `AGENTS.md` symlink for Codex) - All `"$AGENT_HOME/skills/*/SKILL.md"` - Project-specific `CLAUDE.md` (or `AGENTS.md` symlink) files (find via conversation paths) Cross-reference agent findings against the scaffold. Classify each finding: | Status | Meaning | Action | | ------------ | ------------------------------------ | ------------------- | | **Conflict** | Scaffold says X, user corrects to Y | Fix immediately | | **Stale** | Scaffold references dead path/tool | Update or remove | | **Gap** | Repeated pattern, scaffold is silent | Add content | | **Buried** | Info exists but in wrong place | Reorganize | | **Dead** | Skill/section never used | Remove | Also compare against ledger trends: - **Confirmed** (seen 3+ runs): should have prominent scaffold placement - **Emerging** (seen 2 runs): note, don't act yet - **Reversed** (was trending, stopped): investigate why -- scaffold fix worked, or user gave up? --- ## Step 4: Propose Changes Organize proposals by type: ### Tier 1: Corrections Things the user explicitly corrected. Highest confidence -- apply unless vetoed. ### Tier 2: Structural Reorganizations: sections that should be split into skills, skills that overlap and should merge, info in the wrong file. ### Tier 3: New Content Workflows, paths, patterns that belong in the scaffold but aren't there yet. Apply the necessity test: would this have prevented a specific observed failure? If the assistant would get it right without the guidance, don't add it. ### Tier 4: Deletions Stale content, unused skills, dead references. Show evidence of staleness. ### Tier 5: New Skills Only if a crystallized workflow (5+ conversations) would clearly benefit from being a dedicated skill. Don't create skills speculatively. For each proposal, include: - **Evidence**: which conversations, what frequency - **Current state**: what the scaffold says now (quote it) - **Proposed change**: the exact edit - **Confidence**: high / medium / low Present all proposals to the user before applying. --- ## Step 5: Apply & Record For approved changes: 1. Apply all edits 2. Re-read modified files to check for internal consistency 3. Update the ledger: ```json { "timestamp": "", "conversations_analyzed": "", "proposals": [ { "description": "...", "tier": "<1-5>", "status": "applied|deferred|rejected", "confidence": "high|medium|low" } ], "trends_updated": ["..."] } ``` Write ledger: ```bash cat > "$AGENT_HOME/scaffold-review-ledger.json" << 'EOF' EOF ``` For deferred proposals, record the reason so a future run can reassess. --- ## Conversation JSONL Format Records differ by agent implementation. For Codex, the common shape is: **User / assistant messages:** ```json { "type": "response_item", "payload": { "type": "message", "role": "user|assistant|developer", "content": [ { "type": "input_text|output_text", "text": "..." } ], "phase": "commentary|final" } } ``` **Tool calls:** ```json { "type": "response_item", "payload": { "type": "function_call", "name": "exec_command", "arguments": "{\"cmd\":\"...\"}" } } ``` Claude-style records may still appear in older logs or other agent homes. Prefer the Codex schema when `~/.codex/sessions` is the source.