---
name: bare-eval
compatibility: "Claude Code 2.1.148+"
description: "Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation."
tags: [eval, bare, grading, pipeline, testing, ci]
version: 1.0.0
author: OrchestKit
user-invocable: false
complexity: medium
context: inherit
persuasion-type: discipline
effort: low
---

# Bare Eval — Isolated Evaluation Calls

Run `claude -p --bare` for fast, clean eval/grading without plugin overhead.

**CC 2.1.81 required.** The `--bare` flag skips hooks, LSP, plugin sync, and skill directory walks.

## When to Use

- Grading skill outputs against assertions
- Trigger classification (which skill matches a prompt)
- Description optimization iterations
- Any scripted `-p` call that doesn't need plugins

## When NOT to Use

- Testing skill routing (needs `--plugin-dir`)
- Testing agent orchestration (needs full plugin context)
- Interactive sessions

## Prerequisites

```bash
# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."

# Verify CC version
claude --version  # Must be >= 2.1.81
```

## Quick Reference

| Call Type | Command Pattern |
|-----------|----------------|
| Grading | `claude -p "$prompt" --bare --max-turns 1 --output-format text` |
| Trigger | `claude -p "$prompt" --bare --json-schema "$schema" --output-format json` |
| Streaming grade | `claude -p "$prompt" --bare --max-turns 1 --output-format stream-json` |
| Optimize | `echo "$prompt" \| claude -p --bare --max-turns 1 --output-format text` |
| Force-skill | `claude -p "$prompt" --bare --print --append-system-prompt "$content"` |
| @-file in prompt | `claude -p "grade @fixtures/case-1.md against rubric" --bare` (CC 2.1.113 Remote Control autocomplete) |

### `--output-format stream-json`

Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response.

```bash
claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \
  | while IFS= read -r line; do
      # line is a single JSON event; inspect $.type == "content_block_delta"
      jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line"
    done
```

Use `stream-json` over `json` when:
- grading long outputs and you want incremental scoring,
- piping into another CLI step-by-step (e.g. `ork:eval-runner`),
- you need per-token timing data alongside the content.

## Invocation Patterns

Load detailed patterns and examples:

```
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
```

## Grading Schemas

JSON schemas for structured eval output:

```
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
```

## Pipeline Integration

OrchestKit's eval scripts (`npm run eval:skill`) auto-detect bare mode:

```bash
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
```

**Bare calls:** Trigger classification, force-skill, baseline, all grading.
**Never bare:** `run_with_skill` (needs plugin context for routing tests).

### CC 2.1.119: `--print` honors agent `tools:` / `disallowedTools:` (M122)

Before CC 2.1.119, `--print` mode ran with the full default tool set regardless of the agent's frontmatter `tools:` and `disallowedTools:`. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them.

**As of 2.1.119, `--print` enforces the agent's declared tool surface.** Implications for eval design:

| Consequence | Action |
|---|---|
| Eval graders that relied on unrestricted tool access may now fail | Audit grader prompts for tools they actually need; whitelist explicitly via the agent's `tools:` frontmatter |
| Eval results match interactive runs | Reproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed `--print` |
| `--agent <name>` also honors `permissionMode` in `--print` | Permission-gated tools (Bash, Edit) require either `permissionMode: acceptEdits` or explicit allowlists in the agent definition |

Migration test:

```bash
# Run an eval against an agent with a deliberately tight tools: list.
# Graders that previously called Read/Bash freely will now fail unless those
# tools are declared on the agent.
claude -p "$prompt" --bare --print --agent grader-test
```

If the grader fails with a "tool not permitted" error, add the required tool to the agent's `tools:` frontmatter and re-run.

### CC 2.1.121: `CLAUDE_CODE_FORK_SUBAGENT=1` for grader determinism (#1545)

Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, **non-interactive paths (`claude -p`, SDK) honor it too** — each grader invocation gets a fresh forked subagent context.

**The cross-eval state-leak problem this fixes:**

Without forking, sequential `claude -p --bare` graders inherit harness state:

| Inherited | Symptom |
|---|---|
| memory MCP query cache | grader sees stale hit from previous run; same fixture grades differently |
| `.claude/chain/*.json` on disk | grader for "implement" thinks "explore" already ran (file is from previous test) |
| ToolSearch deferred-tool cache | first grader's MCP loads bleed into next grader's tool registry |
| model picker pref | grader N inherits `--model=opus` from grader N-1 |

This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions.

**Fix:** `tests/evals/scripts/lib/eval-common.sh` exports `CLAUDE_CODE_FORK_SUBAGENT=1`, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow `.github/workflows/orchestkit-eval.yml` also sets it at the workflow level. Older CC silently ignores the env var (no-op).

**Determinism contract:** running the same grader on the same fixture twice in a row produces the **same score**. Verified by `tests/evals/scripts/test-grader-determinism.sh`.

## Performance

| Scenario | Without --bare | With --bare | Savings |
|----------|---------------|-------------|---------|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |

## Rules

```
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")
```

## Troubleshooting

```
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
```

## Related

- `eval:skill` npm script — unified skill evaluation runner
- `eval:trigger` — trigger accuracy testing
- `eval:quality` — A/B quality comparison
- `optimize-description.sh` — iterative description improvement
- Version compatibility: `doctor/references/version-compatibility.md`