# The Autonomy Standard (`autonomy.ir.v1`) > **This is the single spec doc.** It defines the standard a profile is written in: the **actor** > unit, the **four catalogs** (capabilities, trigger-param sources, task lifecycle, config), the > **Runner** control seam, and **handoffs** (choreography over tasks + the human seam). It was > consolidated from six separate docs — AUTONOMY-IR, CAPABILITIES, TRIGGER-PARAMS, TASK-LIFECYCLE, > RUNNER, HANDOFFS — which are now the sections below. Cross-references between them are intra-doc > anchors. Strategy/why lives in `docs/VISION.md`; the rules in `docs/CONSTITUTION.md`. **Sections** 1. [The IR](#the-ir) — the actor model, the four slots, the four catalogs, conformance. 2. [Capabilities](#capabilities) — the authority model + the merge boundary. 3. [Trigger params](#trigger-params) — the cross-substrate param-source vocabulary. 4. [Task lifecycle](#task-lifecycle) — the state vocabulary the orchestrator reads. 5. [The Runner](#the-runner) — the universal actor-control seam. 6. [Handoffs](#handoffs) — choreography over tasks + the human seam. --- ## The IR `autonomy.ir.v1` — the standard. > **Status:** finalized model; the codebase is being aligned to it. The terms `workflow`, > `launch`, `run`, `raw`, `steps`, `box.model`, `skill|script`, `commit|propose` are **retired** — see > "What this replaces" at the end. If the code still shows them, the code is mid-migration, not the spec. > > **Actor model (current):** the one unit is the **actor** (`kind: agent | human`). Triggers are > `cron` | `event` | `dispatch` — the two PORTABLE kinds are `cron` (time) and `dispatch` (on-demand via > the Runner, [§The Runner](#the-runner)); `event` is the substrate-native escape hatch. There is **no `task:` > trigger**: a task is a work ITEM whose lifecycle state is a property the orchestrator READS when deciding > what to dispatch ([§Task lifecycle](#task-lifecycle)), never a trigger the substrate watches. ### The shape of the whole thing The IR is a **standard**. It concretely defines one unit (the *agent*) and three catalogs (capabilities, trigger-param sources, config keys). A **substrate** (`gh-actions`, `local`, …) — the agent **runner**, NOT the code host — is a **partial implementation** of that standard, realizing the subset it supports its own way. (`github` is a back-compat alias for `gh-actions`; it conflated the runner with the github code host — prefer `gh-actions`. Runner ⟂ code host: see §"Runner vs code host" and `docs/CODE_HOST_RESOURCES.md`.) **Conformance** reports the support matrix. It is exactly the relationship a web standard has to browsers: the spec is complete and concrete; each implementation supports part of it; a profile using a feature works to the degree its target supports it. ``` IR (the standard) — what exists, precisely. Never how. ↓ compile(profile, substrate) substrate (an impl) — how, for the subset it supports. Declares the rest unsupported. ↓ installation — what runs. conformance — the support matrix across substrates. ``` The core (the standard) **only validates spec-validity and wires** — it never interprets what a capability *does*, where a trigger param is *sourced*, or what a config key *means*. The substrate is the only thing that knows codex, `gh`, PRs, termfleet. ### The one unit: an actor (agent or human) There is no `workflow`, no `launch`/`run`/`raw`. There is one concept — an **actor** — with exactly four slots. An actor has a **kind**: `agent` (a machine participant) or `human` (a person). The four slots are identical for both; `kind` (the **role**) is **intrinsic and declared in the profile**, while *realization* (how the role is filled — a script, a model, a real person, or a **simulator** in test) is the substrate's/environment's choice. `kind: agent` is the default, so existing profiles are unchanged. The slots: ```yaml schema: autonomy.ir.v1 targets: [gh-actions, local] actors: developer: # kind: agent (the default) behavior: skills/developer # what it does — a SKILL (prose); run as a credentialed job capabilities: [code:propose, tasks:converse] # its authority (the standard's capability catalog) triggers: # when it fires; three forms — cron | event | dispatch - { dispatch: true, params: { ISSUE: subject.ref } } # portable: launched on demand by the orchestrator - { event: issue_comment } # substrate-native escape hatch timeout: 30 # a run-time bound (minutes) — the only non-capability field maintainer: # kind: human — a person; intrinsic, not a substrate choice kind: human behavior: humans/maintainer-review # the task spec the person is handed (situation / decision / result) capabilities: [tasks:converse, code:review] triggers: [{ dispatch: true }] # engaged on demand when the orchestrator routes work to a person planner: behavior: skills/planner capabilities: [tasks:author, tasks:converse] triggers: [{ cron: "17 6 * * *" }] policy: { box: {} } # governance (merge/risk/…); substrate + agents read what they know resources: [docs/standards/code.md] # verbatim files; the standard never interprets them ``` | slot | what it is | who reads it | |---|---|---| | **behavior** | what the actor does — a SKILL (prose); a `kind: human` actor's is the task spec a person is handed | the substrate *realizes* it: `kind: agent` → a credentialed job runs the skill via a model; `kind: human` → a real person (prod) or a simulator (test) | | **capabilities** | the actor's authority — from the capability catalog ([§Capabilities](#capabilities)); realized as the agent's own scoped token | the substrate realizes each as a permission on that token | | **triggers** | when it fires + the **params** it forwards. Three forms: `cron` (time), the portable `dispatch` (on-demand via the Runner — [§The Runner](#the-runner)), and the substrate-native `event` | the substrate's trigger executor; `cron` and `dispatch` are portable, `event` is carried | | **timeout** | optional run-time bound (minutes) — the only non-capability field | the substrate's job timeout | An actor also carries a **kind** (`agent` | `human`, default `agent`) — a discriminator, not a fifth slot. `kind` (the role) is the profile's; *realization* (how the role is filled — model/person/simulator) is the substrate's (see Kind/realization below). `policy` (global governance) and `resources` (verbatim files) sit at the top level. That is the entire IR. ### The four catalogs (the standard's concrete vocabulary) A profile depends **only** on these named vocabularies, never on a substrate's raw shapes. New entries are added to a catalog first, then implemented by substrates — purely additive, never a restructure. 1. **Capabilities** ([§Capabilities](#capabilities)) — the agent's *authority*, over three nouns (code · tasks · agent): `code:propose` · `code:review` · `code:merge` (gate-only) · `tasks:author` · `tasks:converse` · `agent:launch|list|update|cancel`. A capability IS a grant on the agent's own scoped token. 2. **Trigger param sources** ([§Trigger params](#trigger-params)) — what a trigger can forward to the agent: `subject.ref` · `subject.actor` · `subject.text` · `trigger.kind`. A trigger declares `params: { OPAQUE_NAME: source }`; the substrate resolves the source from its firing context. 3. **Agent fields** (no opaque config box) — the only non-capability field is `timeout`: | field | meaning | github | local | |---|---|---|---| | `timeout` | minutes before kill | job `timeout-minutes` | runner kill-after | There is **no** `config` box. Everything the box once carried is now either a capability (authority), substrate-DERIVED (the workflow filename = `.yml`; the model endpoint is provisioned for every skill agent), or simply gone (the trust/credential knobs — trust is the capability/permission split, below). The model budget is the bounded mint (a substrate concern, not an IR field). Substrate-specific github knobs (`workflowFile`/`persistCredentials`/`permissions`/`env`/`concurrency`) were leaks and are removed: a github permission set is *computed* from capabilities, never written in the IR. 4. **Task lifecycle** ([§Task lifecycle](#task-lifecycle)) — the states a work item can be in (`open` · `ready` · `working` · `in-review` · `input-required` · `blocked` · `done` · `rejected`). This is vocabulary the **orchestrator reads** off a work item to decide what to dispatch — **not** a trigger. A task's state is a property; the PM (on `cron`) reads it and `launch`es the matching worker (a `dispatch` agent). A **handoff** (a seam, [§Handoffs](#handoffs)) is a typed edge over these states — an upstream actor's work produces a transition the orchestrator observes and acts on. No substrate watches task state. ### Kind, realization, trust, review — orthogonal axes (only `kind` is in the IR) This is the distinction that took the longest to get right, so it is stated explicitly: - **Actor kind** (`agent` vs `human`) — the **role**: intrinsic, declared in the profile (the one of these that *is* an IR field — the `kind` discriminator). You cannot turn a human *role* into a permanent script; that would be a *different org design*. `kind` says *who* the actor is — a different axis from realization (*how* the role is filled). - **Realization** (how the role is filled) — the **substrate's/environment's choice**, not in the IR. For `kind: agent`: a credentialed job runs the skill via a model. For `kind: human`: a **real person** in production, or a **simulator** in a testbed — *same profile, different environment*. Filling the same role differently per environment is what makes an org with human actors **testable** ([§Handoffs](#handoffs)). - **Safety** (can a hijacked agent do harm?) — the **capability/permission split**, not mediation. The agent acts directly with a token scoped to its capabilities; the one irreversible power — merge — is withheld from every agent (`code:review` = bless via status, `code:propose` = push, never both), so no agent can land unreviewed code ([§Capabilities](#capabilities)). There is no credential-less job, no bundle, no trusted publisher. Safety is capabilities + budget — not an IR trust field. - **Change review** (does the resulting change get reviewed before merge?) — the `code:review` status + branch protection (`ci` + `agent-review` required); native auto-merge lands it. ### The substrate: a trigger executor + a runner, over a box A substrate factors into two implementables over one shared environment: 1. **Trigger executor** — fires an agent when its triggers say so and forwards the declared `params`. Decides *when*. Only `cron` is portable; events are carried and fired where supported. 2. **Runner** — runs agents and manages their lifecycle (the Runner contract, below), launching each into a box. over 3. **The box** — the environment an agent runs in: POSIX fs + shell + git + **a model endpoint** + the installed files. The model endpoint is **always** part of the box (a deterministic agent simply never calls it) — there is no "does it get a model" knob. On **local** the two are separate (the loop fires; termfleet runs). On **gh-actions** one platform fills both (Actions `on:` fires; the Actions job runs). An agent never sees the trigger executor — from its seat there is only the runner and the box. **Substrate config lives in the box, not the engine.** A substrate reads its installation config from `policy.box.` (e.g. `policy.box.gh-actions`: the model-proxy host, OIDC audience, model, bot git identity — keyed by the runner name, not the github code host). The engine bakes in **no** org identity — a profile supplies these and the compiler emits them as the install's `vars.*` defaults. Likewise `policy.box.risk.human_required_paths` is materialized **verbatim** for the human-approval gate to enforce: the substrate *carries* policy, it never authors or augments it. **Runner vs code host.** The substrate is the agent *runner* — where the fleet executes — and it is orthogonal to the **code host** (github: where the repo lives and `ci` / `security` / `deploy` run). A local-substrate org still has a github code host, so those CI/security/deploy workflows are **code-host resources** carried by the profile (constant across runners, like the standards docs); only the per-agent workflows are *generated* by the runner substrate. See `docs/CODE_HOST_RESOURCES.md`. #### The Runner contract The runner knows only **agents and their lifecycle** — no work, issues, or domain. `launch` carries **opaque params** through to the agent (the system never interprets them; the agent's tooling does). ```ts type SessionStatus = 'running' | 'paused' | 'cancelled' | 'done' | 'failed'; type LaunchParams = Record; // opaque pass-through interface Session { id: string; agent: string; status: SessionStatus; ref?: string; params?: LaunchParams } interface Runner { launch(agent: string, params?: LaunchParams): Session; // C get(id: string): Session | undefined; // R (one) list(): Session[]; // R (running) update(id: string, patch: { status?: SessionStatus }): boolean; // U cancel(id: string): boolean; // D } ``` The `agent:*` capability axis **is** this contract — an agent with `agent:launch` may launch others; the operator always holds the full contract over a running agent (the control plane). Full detail in [§The Runner](#the-runner). **Isolation is requested explicitly, and the runner stays code-host-blind.** A worker that produces an isolated change is launched with a `--branch ` runner-control param; the runner runs it in that branch's own workspace (a local runner: a git worktree; a gh-actions runner: the job's fresh checkout, which makes `--branch` a no-op). No `--branch` ⇒ the trunk workspace. The runner derives this from neither a capability nor the work item — the caller (the PM) names the branch — and it injects **no** code-host identity: an agent that needs its repo or PR resolves them through its own code-host tool (e.g. `gh api repos/{owner}/{repo}/…`, which `gh` fills from the remote). So the runner never names a code host, on any substrate. ### What is NOT an agent Not everything in an installation is an IR agent. Three kinds sit outside the standard: - **Repo-owned files** (`ci.yml`, README, package.json, docs) — `resources`. The IR never models repo CI. - **Substrate infrastructure** (github's model-proxy admin, the injected runtime, the control handler) — provided by the substrate, not declared in the profile. - **Agents** (developer, planner, pm, reviewer, preflight, …) — the IR. This is also why there is no `raw`: ingest maps a recognized agent to an agent, and anything else is a repo-owned resource — never an escape hatch in the IR. ### Conformance — the support matrix A substrate implements a **core** contract (required — any core-conformant substrate runs any IR) and an **expanded** set it advertises. `scripts/autonomy-conformance.ts` drives the real runner against its real backend and reports `supported`/`unsupported` per feature — extended, under this model, to the whole standard (capabilities, param sources, config keys), not just Runner ops. `compile` warns when a profile uses a feature its target does not support: **partial support is first-class, not failure.** **Runner core (MUST):** `launch` / `list` / `cancel`, ids received (not invented), params passed verbatim. **Trigger core (MUST):** fire `cron` and launch the agent. PM-on-cron is the universal dispatcher, so cron alone yields a working fleet. **Expanded (MAY):** `get`/`update`; enforce `maxConcurrent`/`timeout`/model-bounds/permissions; isolation; event triggers; the operator control surface; the merge boundary (the capability/permission split + native auto-merge). Honored where present, declared unsupported where not. ### Validation — by running it for real, not by tests There are **no unit tests** of behavior. The only real confidence is running the actual app, with real AI, on a real project. The conformance battery is the one deterministic harness (the substrate seam is mechanical). Live-proven to date: the github agent wrapper (privilege-separated, OIDC-minted, trust boundary intact) running a real codex agent end-to-end and opening a PR from a declared trigger param — work resolved purely from `subject.ref`, no implicit event reach-in. ### What this replaces (and why) | retired | replaced by | why | |---|---|---| | `workflow` as a separate noun | the **agent** (carries its own triggers) | "the system's entire knowledge is agents" | | `launch` vs `run` | one agent; execution is the substrate's choice | the split manufactured leaks (issue-driven, always-publisher) | | `raw` | agent, or a repo-owned resource | the IR is a standard; non-agents are files, not escape hatches | | `steps` / an "ABI" of work/change/model | nothing — that logic lives in the agent's behavior | the IR must not know issues, PRs, or models | | `box.model` / `skill` vs `script` | nothing — the box always has a model; execution is the substrate's | those leaked the box's execution model into the IR | | `commit` / `propose` on capabilities | trust = substrate security (derived); review = policy | capabilities are pure authority | | `agent` as the sole unit | the **actor** (kinds: `agent`, `human`) | a person is a first-class participant, not negative space | Every future change is *filling in the standard* — a new capability, param source, config key, or task-lifecycle state, plus a substrate realization. The four-slot **actor** (genus; kinds `agent` and `human`) and the standard/implementation/conformance split are the invariant. --- ## Capabilities The agent authority model. A profile **declares** what each agent may do as substrate-agnostic capabilities. A capability is **not** an instruction handed to a mediator — it **is a grant on the agent's own credential**. The substrate mints the agent a credential scoped to exactly its capabilities, and the agent does its own reads and its own writes **in-process**. Capabilities never name a substrate's resources (no `issue`, `pr`, `branch`, `workflow`); they name only the universal things an agent acts on. There are exactly **two guards**, and nothing else: - **capabilities** — *what* the agent may do (its scoped credential). - **budget** — *how much* it may spend (the bounded model token). ### The three nouns An autonomy agent acts on exactly three things: | noun | what it is | github | local (sketch) | |---|---|---|---| | **code** | the codebase under version control | the repo / branches | the working tree | | **tasks** | units of work + their discussion | issues | a work-store | | **agent** | the other agents + their lifecycle | workflow runs | the loop queue | (`merge` is the tell: it's a version-control operation, so the noun is honestly **code**, not a vague "artifact." That's substrate-agnostic at the *git* level — github and local are both git repos — not a github leak.) ### The capabilities | capability | meaning | github realization | |---|---|---| | `code:propose` | propose a change (write a feature branch, open a PR, queue auto-merge, dispatch CI) | `contents: write` + `pull-requests: write` + `actions: write` | | `code:review` | **bless** a change for merge (post the verdict that gates landing) | `statuses: write` (posts the `agent-review` status) | | `code:merge` | **land** a reviewed change onto the default branch | **never granted to anyone** — landing is native auto-merge (see the merge boundary) | | `tasks:author` | create / update / label / set state of work | `issues: write` | | `tasks:converse` | post comments / verdicts on work and changes | `issues: write` (comment scope) | | `agent:launch` | start another agent | `actions: write` (dispatch) | | `agent:list` | observe running agents | `actions: read` | | `agent:update` | pause / resume / retry another agent | control plane | | `agent:cancel` | stop another agent | `actions: write` + control plane | `observe` (read the code and tasks) is **baseline** — every agent has it; it is not a declared capability. Reads are bounded by the agent's sandbox + budget, never by a permission. The `agent:*` axis is exactly the **Runner contract** (`core/runner.ts`: launch / list / update / cancel over sessions) — the substrate-agnostic definition of the agent lifecycle ([§The Runner](#the-runner)). **Scope (optional).** A capability may carry a resource scope: `code:propose@roadmap` = "propose changes to roadmap files only" (the strategist's governance constraint, expressed as a scoped capability, not a deterministic guard script). The constitution's `human_required_paths` / `topics` are the global complement — the region **no** capability may ever reach. ### The trust model: agents are credentialed; only merge is gated The agent runs with a credential **scoped to its capabilities** and acts directly. There is no credential-less job, no bundle, no trusted publisher mediating its output. The one threat that justifies a boundary is **prompt injection** via untrusted input (issue bodies, fetched pages, fork diffs) — and that justifies a boundary only for the **irreversible, default-branch-affecting** power: > **The single hard boundary: an agent can never merge.** `code:merge` is never grantable to an agent. Everything an agent *can* do is recoverable, because the substrate is configured so a hijacked agent cannot reach `main`, workflows, or secrets: - **branch protection** on the default branch blocks direct push (a feature-branch PR is the only way in); - the github **`workflows` permission** is never granted, so `.github/workflows` can't be edited even with `contents: write`; - merging requires **two status checks — `ci` + `agent-review`** — and no single agent can produce both (see below); - the install holds **no secrets** (the model token is OIDC-minted and bounded — that is the budget guard). So a fully-hijacked `code:propose` + `tasks:converse` agent can, at worst, push junk to a feature branch or post a bad comment — both reverted in seconds, neither touching `main`. (This is a deliberate, small relaxation of the old "agent holds nothing" model; the cost is recoverable feature-branch/comment noise, the gain is that agents are real agents instead of envelopes passed to a mediator.) #### The merge boundary — a permission split, no app, no merge job Landing on the default branch is gated by **two non-overlapping permission sets**, so no single agent can land unreviewed code — and the merge itself is **GitHub native auto-merge**, not a token or an app: - **`code:review` = `statuses: write`** — the authority to *bless* a merge (post the `agent-review` verdict status). The reviewer holds this and **not** `contents: write`, so it can certify but cannot merge. - **`code:propose` = `contents: write`** — the authority to push a branch / open a PR / queue auto-merge. Proposers hold this and **not** `statuses: write`, so they can push but cannot self-certify a review. Branch protection requires `ci` + `agent-review` (0 approvals — so there's no self-approval problem and no app is needed). The proposer enables auto-merge when it opens the PR; **GitHub** lands it the instant both statuses are green. Consequences: - a hijacked **proposer** can't post `agent-review` → can never land anything the reviewer didn't bless; - a hijacked **reviewer** has no `contents: write` → can't merge or push at all; - **no agent holds `code:merge`** — it isn't a token capability; the platform performs the merge. `code:review` (bless) and the merge (perform) are deliberately separated; no agent holds both. That split — not a dedicated app or a trusted gate job — *is* the merge boundary. #### The deploy boundary — the merge boundary's sibling, at the production edge Merge guards what reaches `main`; **deploy** guards what reaches *production*, and the same principle holds one step out: **no agent deploys.** Deploy is not an agent job and not a capability any agent holds — it is a human-promoted, gated effect realized by the **code host** (github CI), independent of which substrate runs the agents (a local-substrate org still deploys via its github repo — deploy is a code-host concern, not a runner one). The realization: - deploy fires only on a **human-cut promotion tag** (e.g. `deploy-v*`), restricted by repo ruleset to admins — the fleet's `contents: write` cannot create it; - the deploy job runs in a **required-reviewer environment** (a maintainer approves each deployment; admin-bypass off); - the deploy *workflow itself is a code-host resource* carried by the profile (like `ci.yml` — not engine output, see `docs/CODE_HOST_RESOURCES.md`), and the worst case is bounded **outside** the trust loop (provider-side spend caps + instant rollback), since the agents are funded by what they could deploy. So merge and deploy are the two production boundaries: an agent may propose code and an agent may bless a review, but **no agent lands on `main` and no agent ships to production** — each requires the human/native gate. ### The agent lifecycle (what replaced prepare / interpret) ``` provide → skill → effect (substrate hands the (judges; emits result = (the agent's own scoped trigger's subject in) intent in capability terms) actions — direct, in-process) ``` - **provide** — the substrate materializes the trigger's *subject* (a PR's diff+checks, an issue, …) into the sandbox. Generic; the only variable is which subject, declared by the trigger. - **skill** — does the work and emits its typed `result`. - **effect** — the agent invokes its own capabilities directly. There is no merge step to route to: the reviewer posts `agent-review` (`code:review`), the proposer queued auto-merge, and GitHub lands it. There are no `prepare` / `interpret` scripts and no `config` hooks: "input gathering" is `provide`; "acting on the result" is the agent using its own capabilities. ### github realization — the mapping is the whole story The github substrate computes the agent job's `permissions:` block straight from its capabilities (the table above). That is the entire realization: a normally-credentialed job, scoped. No wrapper of trusted jobs around a credential-less core — the agent IS the job. There is no merge gate job and no app: landing is native auto-merge gated by the `ci` + `agent-review` required checks, and the permission split keeps any one agent from satisfying both. `config.permissions` does not exist; gh permission blocks are *computed*, never written in the IR. ### The OA agents, declared in this model | agent | capabilities | |---|---| | pm | `tasks:author`, `tasks:converse`, `agent:launch` | | developer | `code:propose`, `tasks:converse` | | reviewer | `code:review`, `tasks:converse` (posts `agent-review`; no `contents` → cannot merge) | | strategy_reviewer | `code:review`, `tasks:converse` (blesses a roadmap proposal; cannot merge) | | planner | `tasks:author`, `tasks:converse` | | strategist | `code:propose@roadmap`, `agent:launch` | No agent holds `code:merge`. ### What is NOT a capability - **Observation** — baseline; reads are bounded by the sandbox + budget, not a permission. - **Model access / budget** — the bounded model token (the budget guard), provisioned by the substrate; the IR declares the `budget`, not the credential. - **Trust mediation** — no longer a concept. The old "untrusted agent → bundle → trusted publisher" design is replaced by scoped credentials + the merge boundary (the `code:review` / `code:propose` permission split + native auto-merge). There is no trusted mediator, no merge gate job, and no app. --- ## Trigger params The cross-substrate contract. A trigger fires an agent and **forwards params to it** — the producing end of the Runner contract's `launch(agent, params)` (opaque `LaunchParams`). This is how an agent learns *what to act on* — explicitly, from declared config, **never** by reaching into a substrate's implicit event context. ### The shape In the IR, a trigger may declare `params`: ```yaml triggers: - event: issues config: { types: [labeled] } params: { ISSUE: subject.ref } # opaque name -> documented source ``` - **Param name** (`ISSUE`) — the profile's choice. **The core never interprets it**; it only wires it through to `launch(agent, params)`. - **Source** (`subject.ref`) — drawn from the **documented vocabulary below**. Every substrate MUST be able to resolve each documented source from its own firing context. The agent receives the resolved params (github: as job env; local: as `AUTONOMY_FORWARD` env) and its **tooling** interprets them (`gh` on github, ztrack on local). The substrate's own runtime may also use a resolved source for its realization (github fetches the `subject.ref` work item to bundle/PR it) — but that is the substrate reading the *documented* source, not implicit event magic. ### The source vocabulary (every substrate must implement these) | source | meaning | github resolves from | local resolves from | |---|---|---|---| | `subject.ref` | id of the work item that fired the trigger | `event.issue.number` / `event.inputs.issue_number` / `event.pull_request.number` | work-store item id | | `subject.actor` | who initiated it | `event.sender.login` / `github.actor` | requester | | `subject.actorRole` | the actor's authority over the project (for gating privileged commands); empty if N/A | `event.comment.author_association` (OWNER/MEMBER/COLLABORATOR/…) | requester's role | | `subject.text` | the text that fired it (comment/body); empty if N/A | `event.comment.body` / `event.issue.body` | queued message | | `trigger.kind` | why it fired | `event.action` / `event_name` | queue event kind | A source a substrate cannot resolve for a given trigger resolves to empty — the agent's tooling decides what to do with that. New sources are added here first, then implemented by each substrate; profiles depend only on this vocabulary, never on a substrate's raw event shape. ### How github realizes it (reference) `compileGithub` unions an agent's declared trigger params, resolves each source via the table above into the `setup` and agent job env (keyed by the opaque param name), and the agent fetches its work item from the `subject.ref` param via `gh` — replacing the old implicit `$GITHUB_EVENT_PATH` reach-in. The run id is deterministic per run, so no params are threaded between jobs. --- ## Task lifecycle The cross-substrate state vocabulary. `tasks` is one of the three nouns ([§Capabilities](#capabilities)). The IR already models *authority over* tasks (capabilities) and *triggers on* tasks (events), but it did **not** model the **state** of a task. This catalog adds that — a small, portable set of lifecycle states — so a trigger or a handoff can name *the state a task is in* without reaching into a substrate's raw events or label strings. It is a catalog, peer to [§Capabilities](#capabilities) and [§Trigger params](#trigger-params): purely additive. The state lives in the profile's **tracker** (a github issue, a ztrack item) and is **read by the orchestrator** — it is not realized per substrate, because no substrate watches it. ### The states | state | meaning | |---|---| | `open` | created, not yet triaged | | `ready` | triaged, ready for an actor to work | | `working` | an actor is acting on it | | `in-review` | a change is proposed, awaiting review | | `input-required` | blocked awaiting input from a named party (see below) | | `blocked` | cannot proceed (policy / repeated failure / budget) | | `done` | completed | | `rejected` | terminal, not done (duplicate / spam / wontfix / failed) | A profile maps these to its tracker's own state names (e.g. ztrack's `Ready` / `In Progress` / `In Review` / `Done`); the orchestrator reads them there. `input-required` carries a **from** in the seam (who must supply the input): `requester` (OA's existing `needs-info`) or `maintainer` (OA's existing `human-required`). The state is portable; the *who* is part of the handoff payload, not a separate state. These are not new inventions — they consolidate vocabulary OA already uses (`needs-info`, `human-required`, `agent-blocked`, and the stop-states in `ROADMAP.md`) into a portable set. ### How the orchestrator uses it A task's state is a **property the orchestrator reads** — it is **not** a trigger. There is no `task:` trigger and no substrate that watches task state ([§The Runner](#the-runner)). The dispatcher (the PM, on `cron`) reads each work item's state off the tracker and `launch`es the matching worker (a `dispatch` actor) through the Runner: ``` PM tick (cron) → read board → issue is `ready` → launch the developer (dispatch) with the item as --ref ``` This is why the lifecycle is portable without any substrate machinery: the only primitives the substrate must provide are `cron` (time) and `launch` (the Runner) — both universal. The substrate-native `event:` form remains as an escape hatch (partial-support is first-class — see [§The IR](#the-ir)). ### How a handoff uses it A handoff (a **seam**, see [§Handoffs](#handoffs)) is a typed edge over this lifecycle: an upstream actor's work *produces* a state transition; the orchestrator *reads* it and dispatches the downstream actor. The lifecycle is the shared vocabulary that makes the producing and consuming ends name the same thing. New states are added here first. A profile's skills depend only on this vocabulary (mapped to their tracker's own states, e.g. ztrack), never on a substrate's labels or event names. ### Done is verified, not presumed A task — agent or human — reaches `done` only when its **acceptance criteria (AC)** are *verified*, by a **deterministic check and/or an AI-judge check** (the reviewer agent is OA's existing AI-judge for agent work). There is **no `presumed-done` transition**: an elapsed timer or a sent notification never makes a task `done`. A triggered task with no verified result is `pending` (or `blocked`/`failed`/escalated), never `done` — otherwise a task with no result is silently counted complete. This applies to humans too: a human is an untrusted, opaque actor (like a model agent), so the *claim* "I did it" is validated by a check on the **effect**, not taken on faith. The check verifies the effect; it cannot verify diligence (a human can rubber-stamp, an agent can be right by luck) — that residue is covered by **accountability** (an attributable, on-record decision), not verification. A human touchpoint is therefore exactly one of two things: - **Verified task** — has an AC + check (deterministic and/or judge). Reliable outcome → it *can block* the flow (resume on verified done) → it *counts* as human work → it *reduces autonomy* (the org waited on a person). The resolution must be an **explicit, authorized act** (e.g. an `/agent approve` command gated by `subject.actorRole`, or a native review) — a closed loop, not a value inferred from prose. - **Notification** — no AC (`presumed-done`). No reliable outcome ⇒ it is **fire-and-forget**: it *must* be non-blocking, must *not* be counted as completed work, and does *not* reduce autonomy. This is a legitimate, declared mode — "as good as a notification" — you just may not pretend it is more. The **ask type** decides which is required: `inform` → notification (no AC); `do` / `decide` / `approve` → outcome required → AC + check mandatory. The forbidden middle is a task that gates or counts on a human but has no AC — that fabricates completion. (The lifecycle rule: an unanswered handoff is `humanPending`, and a flow is `complete` only when no handoff is left unresolved.) --- ## The Runner The universal actor-control seam. The Runner is the one seam through which the system **runs, lists, and stops actors**. It is the agent graph's control plane: an orchestrator (e.g. the PM) never reaches into `gh`, `termfleet`, Slack, or a person directly — it calls `run` / `list` / `stop` and the Runner realizes them. This is peer to [§Capabilities](#capabilities) (authority over the nouns) and [§Task lifecycle](#task-lifecycle) (the work-item state vocabulary). ### The nouns, kept distinct - **task** — a *work item* (a github issue, a ztrack item). Its lifecycle (`ready`/`in-review`/… — [§Task lifecycle](#task-lifecycle)) is a **property of the task**, read by the orchestrator. It is **not** a trigger. - **actor** — `agent | human` (the IR unit, [§The IR](#the-ir)). What *does* the work. - **action / run** — *an actor working a task*. The thing you observe "right now" ("a `develop` agent is running on #5"; "a `maintainer` approval is pending on #3"). An action is what `list` returns. ### The interface (`packages/core/src/runner.ts`) All verbs are **async** (return `Promise`s) — a backend may talk to a provider over the network: ``` launch(agent, params?) -> Promise # C — start/engage an action; returns a Session get(id) -> Promise # R — one list() -> Promise # R — in-flight update(id, {status}) -> Promise # U — apply a status transition cancel(id) -> Promise # D — stop / retract ``` `Session = { id, agent, status, ref?, params? }`; `params` is opaque pass-through (the runner never interprets it). Agent realizations may additionally stream logs (a `watch`); human realizations cannot. Completion is **not the runner's call to invent** — `done` is reached only by an `update` carrying a **verified** result (an AC + a deterministic and/or AI-judge check), never presumed from a timer or a sent notification ([§Task lifecycle](#task-lifecycle), "done is verified, not presumed"). Until then a session is `running`; it may end `cancelled` or `failed`, never silently `done`. ### One interface, realized by (actor kind × substrate) | realization | launch | list | update / cancel | watch | |---|---|---|---|---| | **agent × github** | `gh workflow run` (workflow_dispatch) | `gh run list` | `gh run cancel` | run logs | | **agent × local** | termfleet SDK `createAgentWindow` | `snapshot().windows` | `closeWindow` | tail | | **human × any** | **engage** (record the action; an optional black-box backend notifies a person) | **in-flight asks** | `update` = apply the verified resolution / `cancel` = retract | **— none —** | The orchestrator calls the same verbs regardless of kind; the actor's `kind` selects the realization (and, for agents, the substrate selects the backend). ### The human runner is a black box You cannot *execute* or *watch* a person, so the human realization is the Runner's degenerate twin: it implements the same `launch`/`get`/`list`/`update`/`cancel` but **has no `watch`** — you can't look over someone's shoulder; the only progress you ever observe is the **completion boundary**, applied via `update(id, {status:'done'})` by an authorized verified act (e.g. `/agent approve` gated by `actorRole`, a native review, or supplied info a check validates). The no-op floor (`HumanRunner` with no `engage`) is pure **bookkeeping**: `launch` records the parked action and it stays `running` forever — it never sets `done` itself. How the ask is delivered and the reply detected — Slack, github issue comments, email, an agentic notifier — is an **opaque, swappable `engage` backend**; the runner only ever exposes the five verbs + session status, never the channel. ### Consequences - **`dispatch`** is the IR trigger meaning "this actor is invoked on demand through the Runner" (vs the autonomous `cron`/`event` triggers); `kind` picks agent-execution vs human-engagement. - There is **no `task:` trigger** — `task` is the work item; a lifecycle state is its property, which the orchestrator reads when deciding what to `launch`. - There is **no separate ledger or steward** — the "ledger" is `list()`; the orchestrator (the PM) is the single place that launches agents, engages humans, and resumes on either's verified completion, applying capacity / retry / backpressure uniformly because every dispatch flows through it. ### Status - One `Runner` contract (`packages/core/src/runner.ts`): `launch`/`get`/`list`/`update`/`cancel`. - Agent realizations: `ExecRunner` (reference), Termfleet (local), Github — built + conformance-tested. - Human realization: `HumanRunner` — **built** as the no-op (bookkeeping) floor that conforms to the same contract; a notifying `engage` backend and PM wiring are the `actor-model-human-handoffs` next steps. --- ## Handoffs How actors trigger each other (and humans). > **Status:** design note feeding H1 (`docs/VISION.md`). Grounded in established prior art, not invented. > Defines how participant-to-participant handoff works in OA, and what is missing to make it explicit, > typed, and substrate-neutral. Companion to [§The IR](#the-ir), [§Task lifecycle](#task-lifecycle), > [§Capabilities](#capabilities), and [§Trigger params](#trigger-params). ### The core fact: actors don't trigger each other — they trigger `tasks` In OA an actor doesn't call another actor. It changes the state of a **task**, and the next actor's trigger fires on that change. The profile already works this way: PM labels an issue → `developer`'s trigger fires; `developer` opens a PR → `reviewer`'s trigger fires. Nobody named the developer or the reviewer; the task state change is the handoff. This is a named, well-studied model — **choreography** (no central conductor; each participant reacts to state others leave), implemented as a **blackboard** (participants coordinate through shared state, never by calling each other), whose formal semantics are a **Petri net** (a token in a place *enables* the next transition). The task — the issue/PR and its lifecycle — is the token. The work-store you don't need to invent is just `tasks`, made stateful ([§Task lifecycle](#task-lifecycle)). Consequence: **OA needs no agent-to-agent messaging, no orchestrator, no new protocol.** Handoffs flow through the shared, visible `tasks` state, which is also the audit trail. `agent:launch` (the Runner contract) remains the *orchestration escape hatch* for a direct, named transfer — used sparingly, as PM already does (a visible command comment for audit + a direct dispatch for reliable delivery). ### The two axes (the whole design space) Every system studied — microservices, actors, Kanban, workflow engines, classical MAS, LLM frameworks — sits in a 2×2: - **Axis A — who decides the next actor: orchestration vs choreography.** Central command to a *named* target (visible flow, but coupling / a "god service" risk) vs decentralized reaction to *state/events* (decoupled, but the flow is implicit). *(Richardson, Newman, Fowler.)* OA is choreography by default — and choreography's usual weakness (invisible flow) does not bite, because the flow lives in visible `tasks` and **typing the seams makes the otherwise-implicit graph explicit.** - **Axis B — who initiates: push vs pull.** Upstream shoves work down (no backpressure; queues blow up as utilization → 1, per Little's Law / Kingman) vs downstream *claims* when it has capacity, gated by a token. **Pull is the only mode with intrinsic backpressure.** *(Hopp & Spearman, TPS/Kanban.)* OA's `maxConcurrent` + `max_open_agent_prs` + PM-sweeps-when-capacity-allows is already pull + WIP limits — lean into it as the stability mechanism. Four archetypes fall out; OA uses two: **choreography + token-enabled** (default, over `tasks`) and **directed command** (the `agent:launch` escape hatch). For *content*, OA follows the declarative lineage: **typed task, not whole-context** (Contract Net's announcement, A2A's `Task`, LangGraph's `Command(update, goto)`) — not the LLM-framework habit of shipping the whole transcript. ### The unit: an actor (agent or human) A handoff target may be a machine or a person, so the one unit is an **actor** with a `kind` ([§The IR](#the-ir)). The four slots are identical for both: | slot | `kind: agent` | `kind: human` | |---|---|---| | behavior | script / skill | a task spec for a person (situation / decision / result) | | capabilities | artifact/tasks/agent authority | *same vocabulary* — what the person may do | | triggers | cron / event / `dispatch` | *same* — the person is `dispatch`ed (engaged on demand) when the orchestrator routes work to them | | config | timeout, model, … | assignee/candidates, escalation, sla, decision (RACI) | | realization | substrate runs it (deterministic / model-interpreted) | a real person (prod) or a **simulator** (test); notifies + escalates + blocks until the token is redeemed | `kind` (the **role**) is **intrinsic and declared** in the profile; *realization* (how the role is filled — script, model, real person, or a **simulator** in test) is the substrate's/environment's choice. (This corrects the earlier "human-interpreted = a third execution mode" framing — `human` is a kind; person/simulator are realizations of it.) ### The seam: a typed edge over the lifecycle A **seam** is the typed handoff between an upstream actor's output and a downstream actor, mediated by the orchestrator reading the task lifecycle: ``` upstream actor ──produces──▶ task enters state S ──orchestrator reads S──▶ dispatches the downstream actor ``` The seam carries a typed payload — validated by the structured-handoff research (clinical SBAR / I-PASS): - **in** — what the upstream presents (situation + background + assessment). - **decision** — what is asked (the RACI/DACI type: do-the-work / decide / approve / consult / inform). - **out** — what is returned to resume, **with receiver confirmation** (I-PASS "synthesis by receiver": the handoff is not complete until the receiver confirms — a closed loop). For agent→agent, the producing side is often implicit today (the behavior sets the state); typing it explicitly is later work (the seam graph that the twin reads). For the human seam it is essential. ### The human seam = the same seam + four affordances Triggering a human is triggering an actor **plus** the things humans need because they have unbounded latency and no polling loop. Each maps to a clean piece of prior art: 1. **Durable, indefinite pause + redeem handle** — the flow blocks until the human redeems a token. *Analog: AWS Step Functions `waitForTaskToken` / Temporal Signals.* 2. **A worklist they pull from + a push path** — offer-to-many → claim. *Analog: BPM candidate-group + claim / Camunda external-task fetch-and-lock; van der Aalst resource patterns (offer vs allocate, push vs pull).* 3. **An escalation policy** — notify → ack-or-timeout → escalate → rotate to whoever is on-call now → repeat. Two axes: escalate if not **acknowledged**, re-trigger if not **resolved**. *Analog: PagerDuty / Opsgenie escalation policies.* 4. **A structured payload + closed loop** — the seam's `in`/`decision`/`out` + receiver confirmation. *Analog: SBAR / I-PASS; RACI/DACI for the decision type.* The seam is identical; the `kind: human` realization adds these. A2A's `input-required` task state is the model for "needs more info mid-task," including when that info comes from a person. ### Testing actors / simulating humans An org with human actors must be testable **without** real humans — otherwise the Bench leg can't run on any realistic org (autonomy ratio < 100% always). So human simulation is a *precondition* for Bench, not a convenience. The actor model makes it work because the **seam is the substitution boundary**: anything that honors a seam can fill the role. The *same profile* runs in production with people and in a testbed with **simulators** — only the substrate's *realization* of `kind: human` actors differs (realization is the substrate's/environment's concern, not the profile's). Three properties make a human actor simulatable, and the design already provides them: - **A typed, machine-producible payload.** A simulator must consume `in` and produce `out`. A free-form prose handoff is not simulatable — so testability is an *independent* reason the seam payload is typed (`in`/`decision`/`out`), not just human-readable. - **A redeem handle decoupled from identity.** The flow blocks on a token *anyone holding it* can redeem (the Step Functions `waitForTaskToken` model). A person or a simulator resumes the flow identically. - **Realization supplied by the environment.** The testbed is a test realization of the substrate; it supplies simulators for `kind: human` actors via a testbed-level fixtures file (`actor → simulated behavior`). The profile stays environment-agnostic. Human simulators come in tiers, by use: - **Fixture** — deterministic ("maintainer approves after 1h unless the diff touches workflows"); for reproducible proof/unit tests. - **Distributional** — samples latency + approve/reject from a distribution; feeds the **twin** (which needs distributions, not averages). - **Model-roleplay** — a model plays the role per a persona/rubric; for rich bench scenarios. Simulators are **calibrated from real human-seam measurements** (H3), and the twin↔testbed division applies to humans too: the **simulator is the cheap screen; the real human in dogfood is the ground truth** that calibrates it. Two cautions: an *optimistic/uncalibrated* human sim yields fitness numbers that don't reflect reality (same trap as averages-not-distributions); and optimizing an org against a predictable simulator invites **Goodhart** (designs that exploit the sim). So: sims for screening, real-human dogfood for truth. This is distinct from hand-driving the autonomy: a deterministic simulator *substitutes a human input* so the autonomy runs unattended and reproducibly — it does not drive the autonomy. A deterministic sim is in fact *better* for measurement validity than a real operator, which would contaminate the run. ### What changes (and what doesn't) **Doesn't change** — the mechanism is already right: choreography through `tasks`; `agent:launch` as the escape hatch; `maxConcurrent`/`max_open_agent_prs` as pull/WIP backpressure; the four-slot unit, the capability model, the trust/wrapper split. **The real delta** — the substrate-coupled, untyped part is the *trigger* (today it names raw github events: `event: issues`, `pull_request_target`). The changes make the handoff edge portable, typed, and give the human edge a declared consumer: 1. **`tasks` lifecycle catalog** — state vocabulary the orchestrator reads ([§Task lifecycle](#task-lifecycle)). *Additive.* 2. **The `dispatch` trigger form** — an actor invoked on demand through the Runner (`{ dispatch: true; params? }`); `cron` + `dispatch` are the portable kinds, `event:` stays as the escape hatch. The orchestrator (PM) reads a task's state and `launch`es the matching actor — no substrate watches task state. 3. **The actor model** — `kind: agent | human` on the unit; `kind` declared, not inferred. The `kind: human` realization (worklist + escalation + durable-pause + payload) is the main new substrate build. 4. **Migrate `human_required` from a risk flag to a declared consumer** — `policy.box.risk.*` stays the *producer rule* that transitions a task to `human-required`; add a `maintainer` actor of `kind: human` that the orchestrator `dispatch`es when it reads that state, whose `out` (approve/reject, confirmed) resumes the merge gate. The side-effect becomes an explicit, typed handoff. **Later (H4, the twin):** declare the *producing* side too, so the seam graph is explicit and measurable. ### The forks 1. Keep `event:` as a substrate-native escape hatch (recommended — partial support is first-class), with `cron` + `dispatch` as the portable path. 2. Payload *content* lives in the human behavior spec (opaque, like every behavior); only the `decision` *type* surfaces as a config key so the substrate routes approve vs consult vs inform. 3. *Holding* and reconciling task state (H2) and the producing-side seam graph (H4) are separate, later horizons — H1 needs only the lifecycle *vocabulary* + the `dispatch` trigger + the `kind: human` actor. ### Incremental proof The `maintainer` actor ships as a `dispatch` `kind: human` actor: the orchestrator (PM) reads the `human-required` state off a task and `launch`es the maintainer — the same portable seam on every substrate (the `human` realization — worklist + escalation + durable-pause — is the new build). No github-label-watching trigger is involved; task state is a property the orchestrator reads, not an event.