# E2E test cases

End-to-end scenarios that wire the **real** worker + supervisor + TUI modules together the way
the daemon does (worker metric sink → supervisor SQLite → control API), using a fake Copilot
provider so no live network/token is needed. Spec: [`copilot-reverse.e2e.test.ts`](./copilot-reverse.e2e.test.ts).

**Policy: every code update must keep all e2e cases green.** Run with `npm run test:e2e`
(`npm test` also runs them — the suite is included in the default vitest run).

| ID | Scenario | Expected result |
|----|----------|-----------------|
| EP-01 | Anthropic `POST /v1/messages` with `stream: true` | SSE contains `message_start`, a `text` delta, and `message_stop` |
| EP-02 | OpenAI `POST /v1/chat/completions` | `choices[0].message.content` is the assistant text |
| EP-03 | `POST /v1/messages/count_tokens` | `200` with `input_tokens > 0` (lets clients time auto-compaction) |
| EP-04 | `/v1/messages` carrying an Anthropic **server-side tool** (`web_search_20250305`) | request completes `200` — the built-in tool is dropped, so the client never hangs waiting for a `tool_result` |
| EP-05 | A failing provider stream | worker emits `event: error` (not a silent close) **and** the supervisor records the failure with its message, visible at `GET /api/requests` / dashboard |
| EP-06 | `GET /` on the supervisor | `200` HTML dashboard page |
| EP-07 | `/logs` slash command | lists recent request errors with their messages |
| EP-08 | `/dashboard` and `/report` slash commands | open the dashboard URL and a prefilled GitHub issue URL in the browser |
| EP-09 | `/reset-claude` after `setup` wrote config | removes exactly the `ANTHROPIC_*` keys copilot-reverse added, preserving the rest |
| EP-10 | two concurrent Anthropic streams | each `message_start` carries a UNIQUE message id (no dedupe-to-first) |
| EP-11 | Anthropic stream | `message_start` seeds a non-zero `input_tokens` estimate (context bar not stuck at 0%) |
| EP-12 | provider returns usage | `message_delta` reports `input_tokens` (prompt − cached), `output_tokens`, `cache_read_input_tokens` |
| EP-13 | OpenAI stream with usage | a usage chunk with `total_tokens` is emitted before `[DONE]` |
| EP-14 | OpenAI stream fails mid-flight | an `error` chunk is emitted, not a silent close |
| EP-15 | dated Anthropic id (`claude-opus-4-8-20251101`) | fuzzy-matched to the available Copilot model |
| EP-16 | `claude-opus-4.8[1m]` request | the `[1m]` suffix is stripped before forwarding |
| EP-17 | Anthropic image block | round-trips through the proxy as image content (vision) |
| EP-18 | mixed text+tool stream | text@0, tool@1, `stop_reason=tool_use` |
| EP-19 | non-stream tool_use response | maps to Anthropic `tool_use` content |
| EP-20 | OpenAI assistant tool_call + tool result | both reach the provider as canonical blocks |
| EP-21 | failed request | error message persists in `request_log`, queryable via `/api/requests` |
| EP-22 | control API | exposes status, doctor, requests endpoints |
| EP-23 | fresh db | round-trips a recorded request (migration-safe schema) |
| EP-24 | `setup-claude` global | HUD status reports configured (user scope); `[1m]` + window written |
| EP-25 | `setup-codex` | writes native `~/.codex/config.toml` with `model_context_window` |
| EP-26 | reset after 1M setup | removes every key including the 1M-window keys |

### Codex `/responses` (EP-27 … EP-38)

The OpenAI Responses API end-to-end through a booted worker (Codex speaks only this). Hermetic — fake
provider, no network. Spec: same `copilot-reverse.e2e.test.ts`, `describe("E2E: Codex /responses")`.

| ID | Scenario | Expected result |
|----|----------|-----------------|
| EP-27 | non-stream `/openai/responses` | `object:"response"`, `status:"completed"`, an `output_text` message item, `usage` totals |
| EP-28 | streaming `/openai/responses` | ordered `response.created → output_item.added → content_part.added → output_text.delta → …done → response.completed`, monotonic `sequence_number`, final `usage` |
| EP-29 | streaming tool call | `function_call` item + `function_call_arguments.delta/.done`, args reassemble |
| EP-30 | non-stream tool call | maps to a `function_call` output item with `call_id`/`name`/`arguments` |
| EP-31 | prior `function_call` + `function_call_output` in `input` | round-trips to the provider as `tool_use` + `tool_result` |
| EP-32 | `input_image` content part | round-trips to the provider as an image block |
| EP-33 | `instructions` | becomes a `system` message |
| EP-34 | hosted `web_search` tool + a function tool | `web_search` passes through as a hostedTool; the function tool is kept |
| EP-35 | expired token | `401` with `error.type:"error"` (login hint) |
| EP-36 | mid-stream failure | a `data: {"type":"error"}` frame, not a silent close |
| EP-37 | a `/responses` request | recorded in the supervisor `request_log` with `endpoint:"/openai/responses"` |
| EP-38 | `gpt-4o[1m]` model | the `[1m]` suffix is stripped before forwarding |

## What each case protects (regressions it would catch)

- **EP-01/EP-02** — core proxy translation (OpenAI/Anthropic ⇄ Copilot canonical), streaming framing.
- **EP-03** — the count_tokens endpoint Claude Code relies on; missing → 404 → compaction mis-times.
- **EP-04** — the "infinite hang on server-side tools" class of bug (agent-maestro #163/#150).
- **EP-05** — the headline fixes: no silent stream-close, and request-error capture end-to-end.
- **EP-06** — dashboard route stays mounted.
- **EP-10/EP-11/EP-12/EP-13** — the usage/id fixes: unique message id (different asks → different answers), non-zero context bar, real token usage.
- **EP-15/EP-16** — model resolution (fuzzy match + 1M `[1m]` suffix).
- **EP-17** — vision passthrough (images not dropped).
- **EP-18/EP-19/EP-20** — tool-call translation in both directions.
- **EP-24/EP-25/EP-26** — the full setup→status→reset lifecycle for both clients.

## Live integration tests (opt-in, real Copilot)

[`copilot-live.integration.test.ts`](./copilot-live.integration.test.ts) hits the REAL Copilot
endpoints end-to-end (GitHub token exchange → worker → adapter → api.githubcopilot.com). It is
**not** part of `npm test` — run it with `npm run test:integration`. Every case auto-skips when no
GitHub login is on disk (so CI stays hermetic). Coverage: token exchange, model discovery (incl. a
real 1M-window model), OpenAI completion, Anthropic streaming with **different questions → different
answers** (the unique-id regression guard), real `message_delta` usage, and count_tokens.
- **EP-07/EP-08/EP-09** — TUI command wiring: logs/error visibility, dashboard/report, config reset.

## Real CLI Docker e2e (opt-in, real `claude` + `codex`)

The fullest test: the **actual `claude` and `codex` CLIs** drive the **real worker daemon** inside a
Linux container, with a real GitHub token (and optional WebIQ key) mounted. See
[`docker/README.md`](./docker/README.md) — built/run via `e2e/docker/Dockerfile.cli` + `cli-e2e.sh`,
not part of `npm test`. It writes a markdown report after each run. Checks:

| Scenario | Path | Passes when |
|----------|------|-------------|
| `codex exec` | `/openai/responses` | the model returns `CODEX_OK` |
| `claude -p` | `/anthropic/v1/messages` | the model returns `CLAUDE_OK` |
| `claude` web search | gateway `web_search` loop → WebIQ | a grounded answer (a Rust `1.x` version), no error |
| codex multi-line | `/openai/responses` | two-line reply preserved (`LINE_ONE`/`LINE_TWO`) |
| claude constrained | `/anthropic/v1/messages` | `6*7` → `42` |
| `[1m]` model id | resolveModel strip | `gpt-4o[1m]` still answers `ONEM_OK` |
| model discovery | `/anthropic/v1/models` | picker gets dashed `claude-opus-4-8[1m]`, no dotted ids leak |
| canonical opus | `/anthropic/v1/messages` | `claude-opus-4-8[1m]` resolves to Copilot opus + answers `OPUS_OK` |
| setup default model | `claudeCopilotReverseEnv` | the default ANTHROPIC_MODEL is dashed `claude-opus-4-8[1m]` + answers `DEFAULT_OK` |

## HTTP edge-case Docker e2e (hermetic — no real Copilot)

Boots the **real** worker (:7891) + supervisor (:7890) and drives them over HTTP on a dummy token, so
error paths, supervision lifecycle, and the crash-guard regression run without a real token or quota.
`e2e/docker/Dockerfile.http` + `http-e2e.mjs`; runs on every CI push. Checks: malformed JSON→400,
>20mb→413, unknown route→404, models/healthz/count_tokens shapes, status/doctor/requests/dashboard,
restart recovery, dead-socket broadcast churn survival, and a deterministic `EventBus` isolation guard
that **fails on a reverted PR #8** (throwing subscriber must not escape `emit`). It also checks model
discovery: `/anthropic/v1/models` advertises Claude families as dashed canonical ids + display + a
`[1m]` badge (`claude-opus-4-8[1m]`) and never leaks Copilot's dotted ids — so Claude Code's native
picker lights up. Real round-trips run only when a real token is mounted.

This black-box path caught two bugs nothing else did: a Codex tool-translation `400` (a `custom`/
`tool_search` tool forwarded nameless → Copilot rejects → "stream closed before response.completed"),
and empty terminal Responses events (`output_*.done` carried no text → Codex rendered nothing).