# E2E results Latest run of the end-to-end suite. Regenerate after every code change with `npm run test:e2e` and update this file (paste the summary). - **2026-06-29 (canonical model ids)** — `/anthropic/v1/models` now maps Copilot's dotted ids to the dashed canonical ids Claude Code's native picker recognises (`claude-opus-4.8` → `claude-opus-4-8[1m]`, friendly display, `[1m]` for opus 4.6/4.7/4.8 + sonnet 4.6); inbound `resolveModel` strips `[1m]` and fuzzy-maps back to the dotted Copilot id. setup's default ANTHROPIC_MODEL also goes through the canonical map (was dotted `claude-opus-4.8[1m]` → picker couldn't match → stuck on "Opus 4 1M"). New `model-canonical` unit + router round-trip; HTTP e2e asserts no dotted ids leak + opus carries `[1m]`; CLI e2e drives the real `claude` through `claude-opus-4-8[1m]` AND the setup-default model. Suite **433 passed**, HTTP e2e **18 passed**, CLI e2e **all passed**, build clean. - **2026-06-29 (native /model picker)** — setup writes CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1 so Claude Code populates /model from our /anthropic/v1/models; ANTHROPIC_MODEL[1m] stays the 1M default and does NOT lock the picker (verified vs docs). Reset clears the new key. Picker lists claude* ids; CC >=2.1.129. Project suite **422 passed**, e2e **43 passed**, build clean. - **2026-06-29 (TUI UX batch)** — Status shows per-scope client model (not just a check); chat setup tools require scope+model (ask before write); added recent_errors/metrics chat tools for command parity; /report body restructured to the #5 template; heartbeat interval configurable; versioned "what's new" banner shown ~3 launches. Project suite **421 passed**, e2e **43 passed**, build clean. - **2026-06-29 (runaway → /report)** — A guard trip now finishes 200 but tags the metric with a runaway reason (repetition/max_output/deadline) on all three backends. `recentErrors`/`aggregate` treat any tagged 200 as an error, and `/report` builds a "Stream runaways" section + a dedicated title so users file a prefilled issue (like #5) when a model degenerates. New unit tests in report/metrics-agg/anthropic-server. Project suite **417 passed**, build clean. - **2026-06-29 (stream runaway guards)** — Added `RunawayGuard` (repetition + max-output cap) + a 120s wall-clock deadline to all three streaming backends (Anthropic, OpenAI chat, OpenAI Responses). A degenerate upstream ("code\ncode\ncode…", never stops) is now cut to a clean `max_tokens` turn instead of freezing the session. Covered by `tests/core/stream-guard.test.ts` + an anthropic-server integration test (5000-delta source, asserts `emitted < 5000` and `stop_reason: max_tokens`). Full e2e: **43 passed / 0 failed**, project suite 414 passed, build clean. - **2026-06-29 (Docker e2e matrix + CI wiring)** — Added a third Docker e2e driver: a hermetic HTTP edge-case matrix (`e2e/docker/Dockerfile.http` + `http-e2e.mjs`) that boots the **real** worker + supervisor on a dummy token and drives them over HTTP — 15 checks covering proxy errors (malformed JSON→400, >20mb→413, 404, models/healthz/count_tokens), supervision lifecycle (status/doctor/ requests/dashboard/restart-recovery), and a deterministic `EventBus` isolation guard that **fails on a reverted PR #8** (verified: 15/15 with the guard, 2 fail without). Extended `cli-e2e.sh` with codex multi-line, claude constrained-numeric, and a `[1m]` model-id round-trip (now 7 real-CLI checks, all pass with a token mounted). New CI `docker-e2e` job runs http-e2e hermetically every push and gates the real-CLI run on a `CR_GH_TOKEN` secret. Hermetic suite green: `npm test` → **406 passed**, build clean. (HTTP edge cases are quota-free; only the golden round-trips + cli-e2e need a real token.) - **2026-06-29 (crash-guard hardening + code-review fixes)** — Guarded the synchronous throw sites that could kill the single-process TUI+supervisor app (a dead-socket SSE write inside `EventBus.emit`, a corrupt/locked `creds.json` read on the heartbeat tick), and added a process-level `unhandledRejection`/`uncaughtException` backstop that logs to `~/.copilot-reverse/crash.log` and keeps the TUI alive. A `/code-review` pass surfaced six follow-ups, all fixed: `CopilotTokenStore` now reads its GitHub token through a provider (a transient null can't poison the store for the session — it raises a clean 401 and recovers on the next read, instead of sending `authorization: token null`); the backstop uses a shared, size-capped/rotating crash logger and a 5-in-10s circuit breaker that exits for a clean restart on a storm; the heartbeat's defensive catch now logs instead of swallowing silently; the signed-out gate checks token-file existence (`hasGhTokenFile`) so a transient lock no longer reads as "signed out"; and the SSE route subscribes before the hello frame. Full suite green: `npm test` → **406 passed** (57 files), tsc build clean. Both Docker e2e suites pass against real services with a token (+ WebIQ key) mounted: heartbeat container (`e2e/docker/Dockerfile`) → **all 4 states passed** (connected/signed-out/expired/HTTP, real GitHub 401); real CLI container (`Dockerfile.cli`) → **4/4 passed** (`codex exec → /openai/responses → CODEX_OK`, `claude -p → /anthropic → CLAUDE_OK`, `claude` gateway web search → grounded "1.96.0", no tool leak) — confirming the worker's plain-string `CopilotTokenStore` path still works under the new union-typed constructor. - **2026-06-28 (real CLI Docker e2e + 2 Codex bugs fixed)** — Added a true black-box e2e: the actual `claude` and `codex` CLIs run inside a `node:22` container against the real worker daemon (`e2e/docker/Dockerfile.cli` + `cli-e2e.sh`), with a real token (+ WebIQ key) mounted. All three checks pass: `codex exec → /openai/responses → CODEX_OK`, `claude -p → /anthropic → CLAUDE_OK`, and `claude` gateway web search → a grounded answer ("1.96.0"). The driver writes a markdown report after each run (`/out/report.md`). This path **caught two bugs nothing else did**: (1) Codex sends `custom`/`tool_search` tools that the inbound translator forwarded nameless → Copilot 400 "Missing required parameter: tools[N].name" → "stream closed before response.completed"; fixed by keeping `custom` tools as named tools and allow-listing only nameless hosted tools (`web_search`). (2) The `ResponsesSSE` terminal events carried empty text, so Codex completed the turn but rendered nothing; fixed by replaying the accumulated text in `output_text.done` / `content_part.done` / `output_item.done` and populating `response.completed.response.output`. Also added 12 hermetic in-process Codex `/responses` cases (EP-27…EP-38) as fast regression. No secrets are committed — the only key in any file is the placeholder `copilot-reverse-local`; real creds/keys mount at runtime and report artifacts are gitignored. Full suite green: `npm test` → **394 passed** (56 files), tsc build clean, real CLI Docker e2e → **all passed**. - **2026-06-28 (GitHub-token heartbeat)** — Added a supervisor-side heartbeat that periodically (every 60s, plus once ~2s after boot) checks whether the stored GitHub token still exchanges for a Copilot token, and surfaces the result via a new optional `github` field on `/api/status`. The TUI's existing 2s status poll drives a live footer badge (`github ✓` / `✗ /login`), so an expired/revoked login shows within ~60s instead of only on the next failed request or a manual `/status`. Key design: a classifying `probeGithubAuth` distinguishes a definitive 401/403 (→ expired) from a transient timeout/5xx/network error (→ keep last-known-good), so a single GitHub hiccup never flips the badge; `nextGithubStatus` is a pure, sticky reducer. `signed-out` (no token) stays distinct from `expired`. New files: `src/supervisor/github-heartbeat.ts`, `tests/supervisor/github-heartbeat.test.ts`. **Real Linux container e2e** (`e2e/docker/`): a Docker image boots the real control API + a real `GithubHeartbeat.start()`, listens on a real port, and drives it over real HTTP (`GET /api/status`) against the real GitHub API — all four states pass, including `connected` (real token → real Copilot token exchange, mounted read-only) and `expired` (bad token → a real GitHub 401). Only the probe interval is shortened (now injectable) for the test. Full suite green: `npm test` → **378 passed** (56 files), `npm run test:e2e` → **31 passed** (4 files), container e2e → **all passed**, tsc build clean. - **2026-06-26 (default to WebIQ; disable slow borrow)** — gpt-5-mini (the Claude "borrow" backend model) is badly congested on Copilot's `/responses`: repeated `503 "high demand"` and 20s–7min stalls measured live (one user search ran 437s). Same native search on gpt-5.4-mini/gpt-5.4/gpt-5.5 is ~4s, and WebIQ is sub-second — so borrow is gated behind `COPILOT_WEB_SEARCH_ENABLED` (now `false`) and web search routes through WebIQ. `resolveWebSearchBackend(mode, hasKey)` → `copilot | webiq | unavailable` centralises the policy; the runner returns a fixed "run /webiq …" message (with the profile URL) when unavailable. Also: `borrowSearch` gained a 30s timeout (the missing one is why a stalled search hung the turn for minutes) and runs at `reasoning.effort:"low"` (~5-6x faster when it is used). status card + HUD show the resolved backend incl. unavailable. Codex's native `/responses` web_search is untouched. Verified live: Claude `web_search` routes through WebIQ end-to-end (`end_turn`, grounded Rust 1.96.0 answer, no tool_use leak); no-key path returns the `/webiq` guidance. Full suite green: `npm test` → **356 passed** (55 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-26 (Phase 2: web-search backends)** — Web search now works out of the box for both clients, with no key. **Claude path**: the gateway borrows gpt-5-mini's native web_search internally (`borrow-search.ts`), extracts url_citation sources, and feeds them back — verified live through the worker's `/anthropic/v1/messages` with a `web_search_20250305` tool: returned a fresh grounded answer (Rust 1.96.0), `stop_reason: end_turn`, and NO tool_use block leaked to the client. **Codex path**: the hosted `web_search` tool is passed through both `/responses` translators (new `CanonicalRequest.hostedTools`) so Copilot runs it server-side — verified live: gpt-5.5 returns output items `reasoning/web_search_call/message` with citations. **Backend routing**: gateway runner picks per call — default "copilot" borrow, or "webiq" (forced WebIQ) when enabled; mode persisted in webiq.json, read lazily (no restart). **Commands**: removed `/web-search-support`; added hidden `/webiq` + `/webiq clean`; status card + HUD show the active backend. Full suite green: `npm test` → **350 passed** (55 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-26 (outbound /responses routing)** — Newer Copilot models are served ONLY on /responses: live probe confirms `gpt-5.5` (and `gpt-5.3-codex`, `gpt-5.4-mini`, `mai-code-1-flash-internal`) report `supported_endpoints: ["/responses","ws:/responses"]` with no `/chat/completions`. Added the outbound translation module (`src/providers/copilot/responses-upstream.ts`), routed the adapter by `supported_endpoints` (with a `/chat` 400 `unsupported_api_for_model` → `/responses` retry net), and wired `fetchModelEndpoints` into the worker (lazy, same source as the model list). Verified live end-to-end against real Copilot: adapter `complete()`/`stream()` on gpt-5.5 hit /responses (`ROUTED_OK` / `STREAM_OK`); and through the worker over HTTP, `POST /openai/responses` non-stream returned `status: completed` + `output_text: E2E_RESPONSES_OK`, stream ran `response.created … response.completed`. gpt-4o still routes to /chat/completions. Full suite green: `npm test` → **329 passed** (54 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-26 (split antml: sentinel)** — Fixed a recurrence of the text-emitted-tool-call bug: the `ToolCallExtractor` split-sentinel hold-back (`PREFIX_TOKENS`) listed only the bare `` forms — the two namespaced slots were duplicated bare copies. Copilot streams Claude's `antml:`-namespaced tool call token by token, so an opening `` block (seen live as a `TaskUpdate` printed instead of executed). Derive the namespaced variants from the bare tokens so both are held back; added split-at-every-offset regression tests. Affects both outbound paths (Anthropic + the new `/responses` SSE) since they share the extractor. Full suite green: `npm test` → **312 passed** (53 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-26 (Codex /responses)** — Fixed Codex startup: new Codex removed `wire_api = "chat"` (codex#7782) and requires `"responses"`, which makes it POST `{base_url}/responses`. Implemented the OpenAI Responses API at `POST /openai/responses` (new `src/core/responses-inbound.ts`: item-centric request → canonical, canonical → response object, and a stateful SSE emitter with `sequence_number` and the `response.created → output_item.added → content_part.added → output_text.delta → … → response.completed` event sequence). Codex config now writes `wire_api = "responses"` (base_url stays `…/openai`; Codex appends `/responses`). Verified live end-to-end against real Copilot: non-stream returns a completed `response` object with an `output_text` item; stream begins `response.created` and ends `response.completed` with `output_text.delta` in between. Full suite green: `npm test` → **294 passed** (52 files), `npm run test:e2e` → **31 passed**, tsc build clean. - **2026-06-26** — Restored web_search / web_fetch for Claude Code through the gateway. The inbound translator now converts Anthropic's server-side web tools to function tools (instead of dropping them); a capped agentic loop in the Anthropic endpoint runs those tools internally against Microsoft Web IQ (`api.microsoft.ai`) and feeds results back, so the client only ever sees the final grounded answer (transparent). A new `/web-search-support` command stores the WebIQ key (data dir or `WEBIQ_API_KEY`, read lazily — no worker restart). Also added a startup status card (GitHub login state — connected/expired/signed-out, since the device-flow token has no real expiry — plus web search readiness, worker, and configured clients) and a live `web ✓/✗` indicator in the now two-line HUD footer. Verified live end-to-end (real Copilot + real WebIQ): a `web_search` request returned `stop_reason: end_turn` with a fresh answer and no tool_use block leaked to the client. Full suite green: `npm test` → **281 passed** (51 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-25** — Fixed the `/login` deadlock: the slash command buffered the device code behind a blocking token poll, so the Repl showed nothing and the user could never authorize. Split device login into `beginDeviceLogin` (returns the code immediately) + `complete()` (polls), and gave the TUI a non-blocking `/login` branch that renders the verification URL + code first, then a completion card. Also hardened the failure path: auth errors (e.g. `incorrect_device_code`) render a clean error card instead of crashing the process, and a double Enter no longer starts two device-code flows. Added a pre-flight auth gate on chat: a signed-out or expired-login message is blocked immediately with a "run /login" hint instead of hanging until the 120s turn timeout. Full suite green: `npm test` → **242 passed** (47 files), `npm run test:e2e` → **31 passed** (4 files), tsc build clean. - **2026-06-23** — Added `ToolCallExtractor` (recovers text-emitted ``/`` tool calls into structured tool chunks) plus changeset-driven auto-release (build-time version injection, release workflow). Full suite green: `npm test` → **236 passed** (47 files), e2e all green, tsc build clean. - **Date:** 2026-06-22 22:14 CST - **Runner:** vitest 2.1.9 · Node v25.6.0 - **Command:** `npm run test:e2e` - **Outcome:** ✅ 30 passed / 0 failed (4 files) The main suite (`copilot-reverse.e2e.test.ts`) covers EP-01 … EP-26 — see [`cases.md`](./cases.md) for the catalog. Plus the pre-existing smoke specs under `tests/e2e/` (m1-smoke ×2, m1a-smoke, anthropic-mixed-stream). ``` Test Files 4 passed (4) Tests 30 passed (30) ``` ## Full project suite `npm test` (unit + integration + e2e): **215 passed**, tsc build clean. ## Live integration (opt-in) `npm run test:integration` against the real Copilot API (local login required): **7 passed** (token exchange, model discovery incl. 1M window, OpenAI completion, Anthropic streaming with distinct answers, real message_delta usage, count_tokens). Auto-skips with no login.