--- name: acp-harness description: Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools. compatibility: Requires plaited/acp, Bun >= 1.2.9 --- # ACP Harness ## Purpose This skill provides a **unified toolkit for ACP client usage and agent evaluation**, optimized for TypeScript/JavaScript projects using Bun. 1. **ACP Client API** - Headless programmatic access to ACP-compatible agents 2. **Evaluation Harness** - Run prompts against agents and capture full trajectories **Use this when:** - Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory) - Evaluating built-in tools vs MCP servers vs skills for the same task - Generating training data with full trajectory capture - Running regression tests in CI/CD pipelines - Building multi-agent applications on a headless transport layer ## Foundation Use Cases The harness is a **foundation layer** - it captures trajectories; scoring happens downstream. ```mermaid flowchart LR Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"] Scoring -->|"scores"| Decision["Informed Choices"] ``` | Use Case | Harness Provides | You Build | |----------|------------------|-----------| | **Cross-agent skill eval** | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) | | **Tool comparison** | Trajectory with tool/skill attribution | Diff analysis, preference data | | **Training data** | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting for world-agent | | **Regression testing** | Deterministic prompt → trajectory capture | CI integration, golden file comparison | | **Multi-agent apps** | `createACPClient` transport layer | Session management, UI, agent switching | ### Agents Supporting Skills Skills can be installed across multiple agents, enabling cross-agent comparison: | Agent | Skills Directory | Install Command | |-------|------------------|-----------------| | Claude Code | `.claude/skills/` | `./install-workshop.sh --agent claude` | | Cursor | `.claude/skills/` | `./install-workshop.sh --agent cursor` | | OpenCode | `.opencode/skill/` | `./install-workshop.sh --agent opencode` | | Amp | `.agents/skills/` | `./install-workshop.sh --agent amp` | | Goose | `.claude/skills/` | `./install-workshop.sh --agent goose` | | Factory | `.factory/skills/` | `./install-workshop.sh --agent factory` | ### Example: Comparing Built-in vs Skill ```bash # Run same prompt with built-in tool bun scripts/run-harness.ts prompts.jsonl \ --agent claude-code-acp \ -o results-builtin.jsonl # Run same prompt with custom skill installed bun scripts/run-harness.ts prompts.jsonl \ --agent claude-code-acp \ --cwd /project/with/typescript-lsp-skill \ -o results-skill.jsonl # Compare trajectories - which used better tools? faster? more accurate? diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl) ``` ## Execution Environment **Recommendation:** Run evaluations in Docker containers for consistent, isolated execution. ```bash # Build and run with Docker Compose docker compose -f docker-compose.acp.yml run --rm acp-harness # Or build directly docker build -f Dockerfile.acp -t acp-harness . docker run --rm -e ANTHROPIC_API_KEY acp-harness ``` Docker provides: - Consistent environment across local and CI - Filesystem isolation without app-level sandboxing - Reproducible results for training data generation See [assets/](assets/) for example container configurations: - `Dockerfile.acp` - Base container with Bun and git - `docker-compose.acp.yml` - Compose file with volume mounts for results ## Non-Goals This harness is optimized for TypeScript/JavaScript projects using Bun. It is **not** designed for: - **Python projects** - Use [SWE-bench](https://github.com/SWE-bench/SWE-bench), [Braintrust Python SDK](https://www.braintrust.dev/) - **Academic model benchmarking** - Use [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) - **IDE integrations** - Use Copilot Evaluation Harness - **SaaS observability** - Use Braintrust, Langfuse platforms directly ## Quick Reference | Resource | Description | |----------|-------------| | `scripts/run-harness.ts` | Execute prompts against agent, capture trajectories | | [client-api.md](references/client-api.md) | `createACPClient` configuration, helpers | | [output-formats.md](references/output-formats.md) | JSONL schemas, format options | | [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, LLM-as-judge) | | [llm-judge-templates.md](references/llm-judge-templates.md) | Evaluation prompt templates | ## Evaluation Workflow ```mermaid flowchart LR Prompts["prompts.jsonl"] --> Harness["run-harness.ts"] Agent["ACP Agent"] --> Harness Harness --> Summary["summary.jsonl"] Harness --> Judge["results.md + results.full.jsonl"] Summary --> JQ["jq analysis"] Judge --> LLM["LLM-as-judge"] ``` 1. **Prepare** - Create `prompts.jsonl` with evaluation cases 2. **Execute** - Run harness against target agent 3. **Capture** - Trajectories streamed to output files 4. **Analyze** - Pipe to downstream tools for scoring ## Harness Script ### Basic Usage ```bash bun scripts/run-harness.ts --agent [options] ``` ### Arguments | Flag | Description | Default | |------|-------------|---------| | `prompts.jsonl` | Input file with evaluation prompts | Required | | `-a, --agent` | ACP agent command | `"claude-code-acp"` | | `-o, --output` | Output file/path | stdout | | `-c, --cwd` | Working directory for agent | current | | `-t, --timeout` | Request timeout in ms | `60000` | | `-f, --format` | Output format: `summary`, `judge` | `summary` | | `--progress` | Show progress to stderr | false | | `--append` | Append to output file | false | | `--mcp-server` | MCP server config JSON (repeatable) | none | ### Examples ```bash # Summary format (default) - minimal JSONL bun scripts/run-harness.ts prompts.jsonl -o results.jsonl # Judge format - creates two files for two-tier evaluation bun scripts/run-harness.ts prompts.jsonl --format judge -o results # Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory) # With MCP server (stdio transport) bun scripts/run-harness.ts prompts.jsonl \ --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}' # With MCP server (HTTP transport) bun scripts/run-harness.ts prompts.jsonl \ --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}' # Different agent (Droid ACP adapter) bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl # Stream with progress bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl ``` ## Input Format Each line in `prompts.jsonl`: ```jsonl {"id":"test-001","input":"Create a primary button","expected":"should contain ) ``` 3. [MESSAGE] I created the button... [->test-001-step-3] **Output:** I created the button template with primary styling. **Metadata:** category=ui, agent=claude-code-acp **Status:** passed **Duration:** 1234ms --- ``` **`.full.jsonl`** - Complete trajectory with step IDs for correlation: ```jsonl {"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...} ``` **Usage patterns by judge context window:** | Judge Model | Strategy | |-------------|----------| | Gemini (1M+ tokens) | Feed `results.full.jsonl` directly | | Claude/GPT-4 (128-200k) | Use `results.full.jsonl` for most runs | | Smaller models | Use `results.md`, retrieve specific steps by ID as needed | ## Programmatic Usage ```typescript import { createACPClient, createPrompt, summarizeResponse } from 'plaited/acp' // Requires: npm install -g @zed-industries/claude-code-acp const client = createACPClient({ command: ['claude-code-acp'], cwd: '/path/to/project', }) await client.connect() const session = await client.createSession() const { updates } = await client.promptSync( session.id, createPrompt('Create a button with hover state') ) // Full trajectory is in updates const summary = summarizeResponse(updates) console.log({ text: summary.text, toolCalls: summary.completedToolCalls, hasErrors: summary.hasErrors }) await client.disconnect() ``` See [client-api.md](references/client-api.md) for complete API documentation. ## Downstream Integration The harness outputs standard JSONL that pipes to any tool: ```bash # Filter with jq cat results.jsonl | jq 'select(.metadata.category == "ui")' # Count tool usage cat results.jsonl | jq -s 'map(.toolCalls | length) | add' # Feed full trajectory to Gemini (large context) cat results.full.jsonl | your-gemini-judge.ts ``` See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers. ## Evaluation Targets | Target | How to Evaluate | |--------|-----------------| | **Agent capability** | Direct prompts, analyze trajectory quality | | **Skills** | Set `--cwd` to project with skill, test skill-specific prompts | | **MCP Servers** | Use `--mcp-server` flag, verify tool usage in trajectory | | **Behavioral programs** | Analyze trajectory for bThread coordination patterns | ### Skill Evaluation ```bash bun scripts/run-harness.ts skill-prompts.jsonl \ --cwd /project/with/skill \ -o results.jsonl ``` ### MCP Server Evaluation ```bash bun scripts/run-harness.ts mcp-prompts.jsonl \ --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \ -o results.jsonl ``` ## Related - **plaited/acp** - Core ACP client module - **world-agent** - Training workflow, trajectory generation - **plaited-behavioral-core** - bThread patterns for agent coordination