# Reactive Agents — Roadmap

> **Last updated:** 2026-06-17 (v0.12.0 "Durable & Honest" published to npm — durable exec A–E + HITL, typed structured output, memory default-OFF, effect-free hooks)
> **The open-source agent framework built for control, not magic.**

This roadmap is the public-facing milestone tracker. The internal authoritative direction lives in `wiki/Decisions/2026-06-10-roadmap-realignment-v0.12-v1.0.md` + `wiki/Architecture/Specs/05-DESIGN-NORTH-STAR.md`. When they disagree, the decision doc wins and this doc is out of date — open an issue.

**Live board:** [github.com/users/tylerjrbuell/projects/1](https://github.com/users/tylerjrbuell/projects/1)

---

## Where we are today (v0.12.0 published)

- **v0.12.0** on npm (2026-06-17): all 35 packages, "Durable & Honest" — crash-resumable runs, durable human-in-the-loop approval gates, typed structured output across every provider, and honesty defaults (memory off, grounding opt-in, debrief off the critical path).
- **v0.11 line shipped:** Compose API (7 chokepoints, killswitches), Snapshot/Replay, `rax-diagnose`, OTel exporter (`@reactive-agents/observe`), `create-reactive-agent`, code-action strategy, skill persistence, gateway chat, Cortex studio with parameterized runs.
- **Architecture:** canonical kernel (capability-grouped, acyclic, single termination owner, unified `project()` context assembly). Independent audit grade: structurally healthy.

**What we learned (and publish honestly):**
- Heavy strategies (reflexion / tree-of-thought / plan-execute) show **no quality lift over the reactive kernel** on our internal benches, at 3–15× cost on local models. We document them as frontier/niche options, not headline features.
- The defensible, externally-demanded strengths are: **agents that actually work on local models** (per-model calibration, 4-stage tool-call healing, tier-adaptive context) and the **local flight recorder** (deterministic replay + diagnosis CLI, no SaaS attached).

---

## v0.12 — "Durable & Honest" ✅ Released 2026-06-17

**Goal:** close the one 2026 table-stakes gap, and make the surface match the quality of the internals. One migration event for users.

| Track | Contents | Status |
|---|---|---|
| **Durable execution** | Crash-resume: opt-in `.withDurableRuns()`, SQLite RunStore, checkpoint-every-N, `agent.resume(runId)`. Durable human-in-the-loop: approval requests survive process death (`approve`/`deny` from a new process). Acceptance: SIGKILL mid-run → resume → identical output. | **Shipped A–E** (checkpoint seam + state codec, SQLite RunStore + `.withDurableRuns()`, `resumeRun()`, durable HITL via `.withApprovalPolicy`/`approveRun`/`denyRun` on both `run()` and `runStream()` paths, plus Cortex UI: launch durable runs + approval gates from the config panel with an app-wide approval-toast prompt). |
| **DX wave** | Effect-free lifecycle hooks (no `Effect.succeed` required), builder consolidation (observability 5 methods → 1; one canonical hook route; documented config precedence), plain-Error mapping at promise boundaries. | **Shipped** — effect-free hooks (`.withHook()` plain fns) + observability consolidation: `.withObservability({ cortex, telemetry, logging, tracing, health, audit, costs })` fans out to one canonical route (dedicated methods retained, last-call-wins precedence documented). |
| **Memory default OFF** | Breaking-ish: stateless by default, one-line `.withMemory()` opt-in. No more surprise SQLite writes in CI. | **Shipped** (memory default-OFF; `balanced()`/`intelligent()` opt in explicitly). |
| **Cost honesty** | Tier-aware debrief synthesis (skip/template on local — the single largest per-run overhead), meta-tool prompt audit per tier. | **Shipped** — debrief forked off the critical path (~46% faster `run()`) + LLM debrief synthesis now skipped on the local tier (deterministic fallback kept; local synth failed ~52% at ~825 tok/~6s). |
| **Strategy honesty** | Adaptive routing defaults to reactive on local tier; heavy strategies documented per the parity data. | **Shipped** — adaptive defaults to reactive on the local tier (heavy-strategy parity; skips the analysis LLM call) + #195 closed (Compose hooks/killswitches/calibration thread through all 5 strategies, single+batch emit). |

Design spec: `wiki/Architecture/Design-Specs/2026-06-10-durable-execution.md`. **v0.12 milestone complete** — all five durable phases + the DX/honesty tracks shipped in v0.12.0 (2026-06-17). Slipped during 2026-06-16 triage to later lines: #188/#47/#35 → v0.13, #43 → v0.14.

---

## v0.13 — "Receipts" (public launch)

**Goal:** prove the differentiated claims reproducibly, then make noise.

- **Public local-model bench:** same task suite, qwen/llama 7–14B via Ollama — Reactive Agents vs Mastra vs LangGraph.js vs raw AI SDK. First-attempt success + token cost. Model + provider + date pinned, ≥3 seed variance, raw traces published. Stop-the-line rule: if external delta >15% from internal, fix the harness — not the result.
- **Flight recorder, front and center:** "record once, debug forever" — deterministic offline replay of the full provider interaction + `rax-diagnose` root-cause CLI. No observability SaaS required.
- **Cost governance demo:** budget + watchdog + approval killswitches composing on one agent, with our own measured overhead published.
- **Infra:** OIDC trusted publishing (no npm token rotation).
- **Show-HN / launch happens here** — with evidence, not adjectives.

---

## v0.14 — "Compounding"

**Goal:** the capability bet — agents that get better across runs.

- Progress recitation and cross-run experience-reuse, behind ablation gates.
- Sequenced after v0.13 deliberately: lift gets measured on the public bench (≥3 percentage-point rule, ≤15% token overhead), making results publishable rather than anecdotal.

---

## v1.0 — Polish & Release

- Every prior milestone gate re-run on the integrated codebase.
- `README.md` states only validated claims; vision pillar artifact table complete — each of the 8 pillars cites a file, bench number, or doc.
- Snapshot/replay determinism re-validated.
- This doc rewritten: what shipped, what was deferred, what was killed and why.

---

## Strategic positioning

The framework's defensible value, per empirical evidence:

- **Local-first reliability** — per-model runtime calibration, capability-signal tool routing, 4-stage healing pipeline, tier-adaptive context. The same agent code runs on a 4B Ollama model and a frontier model.
- **Control** — every harness primitive is developer-overridable via `.compose(harness)`; killswitches; `RunHandle` pause/resume/stop/terminate; durable resume (v0.12).
- **Observability** — default-on traces/metrics/logs, OTel exporter, deterministic replay + diagnosis CLI — all local, no SaaS coupling.
- **Honesty** — we publish our own overhead numbers and negative results (heavy-strategy parity, falsified optimizations). Claims scope per `01-RESEARCH-DISCIPLINE.md` Rule 11.

What we do **not** yet have:
- Public third-party benchmark (v0.13 gate)
- Production case studies / named users

---

## How to track progress

- **Decision record** — `wiki/Decisions/2026-06-10-roadmap-realignment-v0.12-v1.0.md` (why this sequencing)
- **Hot cache** (`wiki/Hot.md`) — current working state
- **Implementation plans** — `wiki/Planning/Implementation-Plans/`
- **Evidence artifacts** — `wiki/Research/Harness-Reports/`
- **Release tags** are cut by CI from `main`; CHANGELOG entries are authoritative

*Roadmap is rewritten on major releases or strategic shifts — not per commit.*