--- name: bare-eval compatibility: "Claude Code 2.1.148+" description: "Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation." tags: [eval, bare, grading, pipeline, testing, ci] version: 1.0.0 author: OrchestKit user-invocable: false complexity: medium context: inherit persuasion-type: discipline effort: low --- # Bare Eval — Isolated Evaluation Calls Run `claude -p --bare` for fast, clean eval/grading without plugin overhead. **CC 2.1.81 required.** The `--bare` flag skips hooks, LSP, plugin sync, and skill directory walks. ## When to Use - Grading skill outputs against assertions - Trigger classification (which skill matches a prompt) - Description optimization iterations - Any scripted `-p` call that doesn't need plugins ## When NOT to Use - Testing skill routing (needs `--plugin-dir`) - Testing agent orchestration (needs full plugin context) - Interactive sessions ## Prerequisites ```bash # --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled) export ANTHROPIC_API_KEY="sk-ant-..." # Verify CC version claude --version # Must be >= 2.1.81 ``` ## Quick Reference | Call Type | Command Pattern | |-----------|----------------| | Grading | `claude -p "$prompt" --bare --max-turns 1 --output-format text` | | Trigger | `claude -p "$prompt" --bare --json-schema "$schema" --output-format json` | | Streaming grade | `claude -p "$prompt" --bare --max-turns 1 --output-format stream-json` | | Optimize | `echo "$prompt" \| claude -p --bare --max-turns 1 --output-format text` | | Force-skill | `claude -p "$prompt" --bare --print --append-system-prompt "$content"` | | @-file in prompt | `claude -p "grade @fixtures/case-1.md against rubric" --bare` (CC 2.1.113 Remote Control autocomplete) | ### `--output-format stream-json` Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response. ```bash claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \ | while IFS= read -r line; do # line is a single JSON event; inspect $.type == "content_block_delta" jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line" done ``` Use `stream-json` over `json` when: - grading long outputs and you want incremental scoring, - piping into another CLI step-by-step (e.g. `ork:eval-runner`), - you need per-token timing data alongside the content. ## Invocation Patterns Load detailed patterns and examples: ``` Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md") ``` ## Grading Schemas JSON schemas for structured eval output: ``` Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md") ``` ## Pipeline Integration OrchestKit's eval scripts (`npm run eval:skill`) auto-detect bare mode: ```bash # eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true # Scripts add --bare to all non-plugin calls automatically ``` **Bare calls:** Trigger classification, force-skill, baseline, all grading. **Never bare:** `run_with_skill` (needs plugin context for routing tests). ### CC 2.1.119: `--print` honors agent `tools:` / `disallowedTools:` (M122) Before CC 2.1.119, `--print` mode ran with the full default tool set regardless of the agent's frontmatter `tools:` and `disallowedTools:`. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them. **As of 2.1.119, `--print` enforces the agent's declared tool surface.** Implications for eval design: | Consequence | Action | |---|---| | Eval graders that relied on unrestricted tool access may now fail | Audit grader prompts for tools they actually need; whitelist explicitly via the agent's `tools:` frontmatter | | Eval results match interactive runs | Reproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed `--print` | | `--agent ` also honors `permissionMode` in `--print` | Permission-gated tools (Bash, Edit) require either `permissionMode: acceptEdits` or explicit allowlists in the agent definition | Migration test: ```bash # Run an eval against an agent with a deliberately tight tools: list. # Graders that previously called Read/Bash freely will now fail unless those # tools are declared on the agent. claude -p "$prompt" --bare --print --agent grader-test ``` If the grader fails with a "tool not permitted" error, add the required tool to the agent's `tools:` frontmatter and re-run. ### CC 2.1.121: `CLAUDE_CODE_FORK_SUBAGENT=1` for grader determinism (#1545) Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, **non-interactive paths (`claude -p`, SDK) honor it too** — each grader invocation gets a fresh forked subagent context. **The cross-eval state-leak problem this fixes:** Without forking, sequential `claude -p --bare` graders inherit harness state: | Inherited | Symptom | |---|---| | memory MCP query cache | grader sees stale hit from previous run; same fixture grades differently | | `.claude/chain/*.json` on disk | grader for "implement" thinks "explore" already ran (file is from previous test) | | ToolSearch deferred-tool cache | first grader's MCP loads bleed into next grader's tool registry | | model picker pref | grader N inherits `--model=opus` from grader N-1 | This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions. **Fix:** `tests/evals/scripts/lib/eval-common.sh` exports `CLAUDE_CODE_FORK_SUBAGENT=1`, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow `.github/workflows/orchestkit-eval.yml` also sets it at the workflow level. Older CC silently ignores the env var (no-op). **Determinism contract:** running the same grader on the same fixture twice in a row produces the **same score**. Verified by `tests/evals/scripts/test-grader-determinism.sh`. ## Performance | Scenario | Without --bare | With --bare | Savings | |----------|---------------|-------------|---------| | Single grading call | ~3-5s startup | ~0.5-1s | 2-4x | | Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x | | Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x | ## Rules ``` Read("${CLAUDE_SKILL_DIR}/rules/_sections.md") ``` ## Troubleshooting ``` Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md") ``` ## Related - `eval:skill` npm script — unified skill evaluation runner - `eval:trigger` — trigger accuracy testing - `eval:quality` — A/B quality comparison - `optimize-description.sh` — iterative description improvement - Version compatibility: `doctor/references/version-compatibility.md`