# ATM CLI Architecture ## 1. Overview The current target architecture keeps the ATM CLI surface, but moves durable mail and roster ownership to SQLite and reintroduces one tightly-bounded singleton daemon runtime for routing, notification, transport, and runtime health/state queries. The current workspace contains: - `atm-core`: reusable service library - `atm`: CLI binary The Phase Q target workspace remains intentionally small and adds: - `atm-daemon`: daemon runtime binary / transport host - `atm-rusqlite`: first concrete SQLite store implementation The CLI stays thin. Product logic moves into `atm-core`. The retained command surface is: - `send` - `read` - `ack` - `clear` - `log` - `doctor` - `teams` - `members` ## 1.1 Documentation Structure Documentation structure is governed by [`documentation-guidelines.md`](./documentation-guidelines.md). This file owns product architecture. Crate-local architectural detail is being moved into: - [`docs/atm/architecture.md`](./atm/architecture.md) - [`docs/atm-core/architecture.md`](./atm-core/architecture.md) - [`docs/atm-daemon/architecture.md`](./atm-daemon/architecture.md) - [`docs/atm-rusqlite/architecture.md`](./atm-rusqlite/architecture.md) Phase-Q supersession note: - earlier daemon-free architecture statements in this file are historical from the prior rewrite line - for the current mail/runtime target architecture, Section 21 is authoritative ## 2. Crate Boundaries The post-Q product runtime is implemented by four crates: - `atm-core` - `atm` - `atm-daemon` - `atm-rusqlite` Product-level boundary rules: - `atm-core` owns ATM business logic and the strict I/O boundaries that Phase Q routes through a daemon runtime. - `atm` owns CLI parsing, dispatch, rendering, and bootstrap. - `atm-daemon` owns runtime composition, transport adapters, singleton enforcement, and live-status runtime state. - `atm-rusqlite` owns the first concrete SQLite implementation of the durable store boundaries. - `atm-core` must not own clap or terminal-formatting concerns. - `atm` must not own mailbox, workflow, log-query, or doctor business logic. - `atm-daemon` must not become a second business-logic crate. - `atm-rusqlite` must not absorb workflow or command logic; it implements store contracts only. Crate-local boundary detail is owned by: - [`docs/atm-core/architecture.md`](./atm-core/architecture.md) - [`docs/atm/architecture.md`](./atm/architecture.md) - [`docs/atm-daemon/architecture.md`](./atm-daemon/architecture.md) - [`docs/atm-rusqlite/architecture.md`](./atm-rusqlite/architecture.md) ### 2.3 Release Publication Boundary The `1.0` retained-surface release is a source-repo replacement of the old `agent-team-mail` CLI/core publication path, not a new public package family. Architectural rules: - this repo becomes the source of truth for publishing: - `agent-team-mail` - `agent-team-mail-core` - this repo does not publish its retained CLI/core release under the crate names `atm` or `atm-core` - crate identity continuity for downstream users is preserved by package-name replacement while keeping the CLI binary name `atm` - historical parity channels remain: - crates.io - GitHub Releases - Homebrew - `winget` is not part of historical parity, but it is required in the new release architecture because Windows installation must be first-class for `1.0` without Rust tooling or manual archive extraction Release-process ownership rules: - release automation is repo-owned infrastructure, not ad hoc operator procedure - the new repo must own: - release artifact manifest - preflight workflow - release workflow - release-gate script/helpers - release inventory generation and verification - Homebrew formula update automation - `winget` manifest/update automation and verification - the `publisher` agent instructions are part of the release-control surface and must be ported into this repo with source-of-truth paths updated to the new repo layout and retained crate list Release infrastructure notes: - Homebrew continues to use the shared `randlee/homebrew-tap` repository and existing `Formula/agent-team-mail.rb` / `Formula/atm.rb` formulas - `HOMEBREW_TAP_TOKEN` is a required secret for the `atm-core` repo before the ported Homebrew update automation can run successfully - `winget` uses the same `randlee` publisher namespace proven in `claude-history`; the retained CLI package ID for this repo is `randlee.agent-team-mail` - the ported `winget` flow uses the default GitHub workflow token and does not introduce a separate `winget`-specific secret requirement - the release workflow should use `vedantmgoyal2009/winget-releaser@v2` against the Windows ZIP release asset and its SHA256 rather than inventing repo-specific manifest plumbing first - the initial `winget` manifest submission is a one-time manual bootstrap action; recurring releases are workflow-driven after the package exists in `microsoft/winget-pkgs` - release verification must treat `winget` submission success and manifest generation as the immediate release signal because Microsoft review normally delays public installability by 1-2 days Schema ownership references: - Claude Code-native message schema: [`claude-code-message-schema.md`](./claude-code-message-schema.md) - ATM additive/interpreted message schema: [`atm-message-schema.md`](./atm-message-schema.md) - legacy ATM read-compatibility schema: [`legacy-atm-message-schema.md`](./legacy-atm-message-schema.md) - `sc-observability` schema ownership pointer: [`sc-observability-schema.md`](./sc-observability-schema.md) - ATM-owned error-code registry: [`atm-error-codes.md`](./atm-error-codes.md) - schema enforcement models: `tools/schema_models/claude_code_message_schema.py` and `tools/schema_models/atm_message_schema.py` and `tools/schema_models/legacy_atm_message_schema.py` ### 2.4 Shared Observability Boundary `atm-core` must not import `sc-observability` directly. Instead, `atm-core` defines a sealed `ObservabilityPort` boundary plus ATM-owned event and query models. `atm` implements that port using `sc-observability`. ATM still owns: - ATM-specific event naming - ATM-specific structured fields - mapping CLI filters to shared query/follow APIs - ATM doctor projections over shared health models - ATM-owned config semantics for baseline roster, alias resolution, and runtime-identity precedence `sc-observability` should own as much generic functionality as possible: - emission - record storage and retention policy - historical query - follow/tail - severity filtering - structured field filtering - runtime health reporting Phase K delivered the ATM-side integration work. Phase L now governs the remaining release-hardening, boundary cleanup, and validation needed before initial release. Initial retained-command integration scope: - `sc-observability-types` - `sc-observability` Deferred from the initial retained-command integration scope: - `sc-observe` - `sc-observability-otlp` The controlling ATM-side implementation design is: - [`docs/atm-core/design/sc-observability-integration.md`](./atm-core/design/sc-observability-integration.md) ## 3. Module Layout Detailed crate/module layout is owned by the crate-level docs: - [`docs/atm-core/modules/`](./atm-core/modules/) - [`docs/atm/commands/`](./atm/commands/) Product-level constraints that remain relevant here: - no plugin framework - no daemon client - no runtime spawning layer - no separate `tail` command in the initial rewrite - no separate `status` command in the initial rewrite - the retained release-critical team recovery surface is limited to: - `teams` - `members` - `teams add-member` - `teams backup` - `teams restore` - broader historical team lifecycle/orchestration commands remain out of scope Supersession note: - `no daemon client` and `no runtime spawning layer` describe the pre-Phase-Q retained CLI/runtime line only - the Phase Q target architecture in §21 supersedes those constraints with: - one explicit daemon runtime - no hidden direct SQLite fallback - one explicit daemon auto-start path when the daemon is absent ## 4. Core Types ### 4.1 Semantic Newtypes Per `rust-best-practices`, validated primitives and semantic ids should not remain as raw `String` values across the service boundary. Required public newtypes: - `TeamName` - `AgentName` - `IdentityName` - `MessageKey` - `MessageId` - `MessageBody` - `MessageSummary` - `IsoTimestamp` - `MailAddress` - `TaskId` Required resource/config wrappers: - `ConnectionCap` - `QueueDepth` - `RetryBudget` - `BusyTimeout` - `RequestDeadline` - `HomeDir` - `AbsolutePath` - `LogFieldKey` - `LogFieldValue` These are required to reduce repeated validation and remove stringly typed command paths. ### 4.2 Workflow State And Display Types Canonical axis enums: ```rust pub enum ReadState { Unread, Read, } pub enum AckState { NoAckRequired, PendingAck, Acknowledged, } pub enum MessageClass { Unread, PendingAck, Acknowledged, Read, } ``` Display bucket enum: ```rust pub enum DisplayBucket { Unread, PendingAck, History, } ``` Selection enum: ```rust pub enum ReadSelection { Actionable, UnreadOnly, PendingAckOnly, ActionableWithHistory, All, } ``` Ack activation mode: ```rust pub enum AckActivationMode { PromoteDisplayedUnread, ReadOnly, } ``` Display mapping is fixed: - `MessageClass::Unread` -> `DisplayBucket::Unread` - `MessageClass::PendingAck` -> `DisplayBucket::PendingAck` - `MessageClass::Acknowledged` -> `DisplayBucket::History` - `MessageClass::Read` -> `DisplayBucket::History` ### 4.3 Typestate Transition Model Per `rust-best-practices`, legal workflow transitions should be encoded in the type system inside the core pipeline. Private marker states: ```rust pub struct UnreadReadState; pub struct ReadReadState; pub struct NoAckState; pub struct PendingAckState; pub struct AcknowledgedAckState; pub struct StoredMessage { // persisted fields + read-state marker + ack-state marker } impl StoredMessage { pub fn display_without_ack(self) -> StoredMessage; pub fn display_and_require_ack(self, at: IsoTimestamp) -> StoredMessage; } impl StoredMessage { pub fn mark_read_pending_ack(self) -> StoredMessage; } impl StoredMessage { pub fn acknowledge(self, at: IsoTimestamp) -> StoredMessage; } ``` There is no inverse transition on either axis. The public axis enums and `MessageClass` are for reporting and filtering. The typestate markers enforce legal transitions inside `atm-core`. ### 4.4 Log Query Types Log query types should remain generic enough to map onto shared `sc-observability` APIs. Required public types: ```rust pub enum LogMode { Snapshot, Tail, } pub enum LogLevelFilter { Trace, Debug, Info, Warn, Error, } pub struct LogFieldMatch { pub key: LogFieldKey, pub value: LogFieldValue, } pub struct LogFieldMap(BTreeMap); pub struct AtmJsonNumber(String); pub enum LogFieldValue { Null, Bool(bool), String(String), Number(AtmJsonNumber), Array(Vec), Object(LogFieldMap), } ``` Architectural rules: - `LogFieldKey` replaces raw field-name strings at the public observability boundary - `AtmJsonNumber` replaces raw numeric `serde_json` values at the public observability boundary - `LogFieldValue` and `LogFieldMap` replace raw `serde_json::Value` / `Map` in `LogFieldMatch` and `AtmLogRecord` - these ATM-owned types must serialize to the same JSON shape the CLI exposes today; the boundary cleanup is a Rust API cleanup, not a CLI wire-format redesign - conversion to and from raw `serde_json` values remains centralized inside `atm-core` ### 4.5 Observability Construction Contract `CliObservability` (atm crate) should expose one structured construction path for initial release, and `CliObservabilityOptions` is also owned by the `atm` crate: ```rust pub struct CliObservabilityOptions { pub stderr_logs: bool, } impl CliObservability { pub fn new(home_dir: &Path, options: CliObservabilityOptions) -> Result; } ``` Architectural rules: - the top-level `init(stderr_logs)` helper may remain as a CLI convenience, but it should delegate to `CliObservability::new(...)` - dynamic dispatch via `Box` remains acceptable for initial release - the current sealed-trait pattern remains acceptable for initial release - `DoctorCommand` injectability is explicitly deferred unless implementation surfaces a concrete need ### 4.6 Identity And Alias Projection ATM must distinguish canonical routing identity from the Claude-facing sender projection. Architectural rules: - runtime identity resolves from explicit CLI override, hook identity, or `ATM_IDENTITY`, not repo-local `[atm].identity` - ATM-owned aliases are input shorthands that resolve to canonical member names - same-team messages keep current canonical sender projection behavior - cross-team messages may project an alias-friendly sender in the persisted `from` field for Claude-facing ergonomics - whenever cross-team alias projection is used, ATM must also persist canonical sender identity in `metadata.atm.fromIdentity` - self-send checks, target validation, routing, and audit logic must use the canonical sender identity rather than the display-oriented `from` projection - ATM-owned post-send hooks are best-effort recipient-scoped helpers, not part of the atomic send boundary - the hook runs only after a successful non-`dry-run` send or ack; it fires after both `atm send` and `atm ack` - Phase Q addition: the retained hook contract now includes `atm ack` reply writes as hook-producing outbound messages, not only `atm send` - each `[[atm.post_send_hooks]]` rule binds one recipient selector and one command argv - `recipient = "*"` acts as a wildcard match for all recipients - multiple matching rules all execute, in config order - relative post-send-hook paths resolve from the discovered `.atm.toml` directory and execute with that same directory as the working directory - bare executable names use normal `PATH` lookup - the hook receives inherited environment plus one ATM-owned JSON payload in `ATM_POST_SEND` - the payload includes `from`, `to`, `sender`, `recipient`, `team`, `message_id`, `requires_ack`, `is_ack` (bool), and optional `task_id` - Phase Q addition: `is_ack` is the explicit send-vs-ack discriminator for daemon-owned hook evaluation and downstream nudge logic - the hook may optionally emit one structured result object on stdout with a declared log level, message, and optional structured fields; ATM parses it on a best-effort basis for post-send diagnostics - absent or invalid hook-result stdout is ignored rather than treated as hook failure - recipient non-match is silent - retired flat hook keys and `[atm].post_send_hook_members` are configuration errors, not compatibility aliases - hook-decision logging must preserve sender, recipient, matched rule selector, and final execution outcome for troubleshooting - hook failure or timeout never rolls back a successful send ## 5. Persisted Schema ### 5.1 Team Config The rewrite reuses the existing team config schema where feasible. Only a small subset is required by the retained surface: - member roster - enough member metadata to preserve round-trips when present - bridge remote host configuration needed for origin-file merge when present ATM config and team-launch config are distinct concerns: - ATM-owned config uses the `[atm]` section of `.atm.toml` - launcher-owned sections such as `[rmux]` and future `[scmux]` remain outside the `atm-core` runtime config boundary and are ignored by ATM - `[atm].team_members` is the ATM-owned baseline roster for doctor/orchestration checks - `[atm].aliases` is the ATM-owned shorthand map for canonical agent names - `[[atm.post_send_hooks]]` is the ATM-owned best-effort post-send automation surface - retired flat hook keys and `[atm].post_send_hook_members` must fail fast with migration guidance - `[atm].identity` is obsolete in the retained multi-agent model and must not participate in runtime identity resolution Team config loading must follow a narrow-scope recovery policy: - compatibility-only schema drift may use deterministic defaults at the schema boundary - malformed member records should be isolated at member scope only when the remaining roster is still trustworthy - missing `config.json` is a distinct `missing-document` condition, not a parse error - root-document corruption or invalid root structure remains a command error - identity and routing fields must never be guessed to keep commands running Diagnostics for team config failures must preserve: - failure class when known - file path - member or collection scope when known - parser line and column when available - original parser cause for operator repair ### 5.1.1 Deprecated `[atm].identity` `[atm].identity` remains parse-compatible only as an obsolete migration field. It is no longer part of runtime sender or actor resolution. Current runtime contract: - runtime identity resolves from explicit CLI override when supported, then hook identity, then `ATM_IDENTITY` - if no runtime identity source is available, the command fails with `ATM_IDENTITY_UNAVAILABLE` - `[atm].identity` is ignored for runtime resolution even when still present in `.atm.toml` Deprecation and migration contract: - `atm doctor` reports stale `[atm].identity` with `ATM_WARNING_IDENTITY_DRIFT` - operator migration path is: remove `[atm].identity` and set `ATM_IDENTITY` in the active agent environment instead - keeping the obsolete key temporarily is tolerated for migration diagnostics only; it must not change runtime behavior Sample operator-facing repair cases live in [`persisted-data-repair.md`](./persisted-data-repair.md). ### 5.2 Inbox Message Current persisted inbox superset may contain: - Claude-native baseline fields: - `from` - `text` - `timestamp` - `read` - `summary` - optional producer field `color` - legacy ATM top-level additive fields such as: - `source_team` - `message_id` - `pendingAckAt` - `acknowledgedAt` - `acknowledgesMessageId` - shared/de facto interpreted fields such as: - `taskId` - forward metadata container: - `metadata` - unknown fields Schema ownership split: - Claude-native baseline fields are documented in [`claude-code-message-schema.md`](./claude-code-message-schema.md) - legacy ATM top-level additive compatibility fields are documented in [`legacy-atm-message-schema.md`](./legacy-atm-message-schema.md) - forward ATM machine-readable schema is documented in [`atm-message-schema.md`](./atm-message-schema.md) Forward architectural rules: - new ATM-only machine-readable data belongs in `metadata.atm` - legacy top-level ATM fields remain read-compatible but are deprecated for new write behavior - forward ATM-authored alert metadata, including legacy `atmAlertKind` and `missingConfigPath`, belongs under `metadata.atm` as `metadata.atm.alertKind` and `metadata.atm.missingConfigPath` - cross-team alias projection stores canonical sender identity in `metadata.atm.fromIdentity` - ATM may enrich a Claude-native stored message by adding `metadata.atm` without rewriting the native Claude fields - the current live design still uses a shared inbox surface; a separate ATM-native inbox is intentionally deferred to a later architecture phase Current compatibility rule: - the current runtime send/alert write path may continue writing legacy top-level alert fields during the compatibility period - the metadata.atm alert placement defined above is the forward architectural target and must not be partially implemented without the corresponding migration sprint and tests - the owning design rationale for this migration remains [`atm-core/design/dedup-metadata-schema.md`](./atm-core/design/dedup-metadata-schema.md) §2.2 and §3.3 File-ownership rule: - Claude-owned inbox content is not an ATM-owned source of truth for ATM-local workflow durability - ATM may still have legacy compatibility write paths on the shared inbox surface, but those paths must be documented as compatibility behavior rather than a general pattern to copy - ATM-owned machine state should converge on ATM-owned sidecars or equivalent ATM-owned persisted state when stronger write guarantees are required - mailbox-local ATM workflow state now lives in the ATM-owned sidecar family at `.claude/teams//.atm-state/workflow/.json` - `read`, `ack`, and `clear` project mailbox display state by joining Claude-owned inbox records with the ATM-owned workflow sidecar - messages without a stable ATM identity remain compatibility-only and may still use the legacy inbox-local workflow fields until a later enrichment phase lands Canonical read and ack axes are derived from persisted fields and not serialized separately. Invariant: - legacy top-level `message_id` values may be UUID or absent - forward ATM metadata `messageId` values must be ULID - write-path schema enforcement must reject placing ULID identifiers in the legacy top-level `message_id` slot and must reject placing UUID identifiers in forward `metadata.atm.messageId` - read-path validation failure for those ATM-owned fields must log a warning, treat the malformed ATM-owned field as absent for ATM semantics, and continue processing the message when the Claude-native envelope remains usable - when ATM authors a new ULID `messageId`, the persisted message `timestamp` must be derived from that ULID creation time so identifier ordering and timestamp ordering are aligned - legacy or externally imported records may still lack ATM machine identifiers - such records must be preserved as-is until enriched ## 6. Public Service APIs ### 6.1 Send Service Supersession note: - the API shape in this section remains relevant - the file-append-first ordering details below are compatibility-line behavior for the pre-Phase-Q runtime - the authoritative Phase Q send ordering is defined in §21 as: `SQLite commit -> Claude export / remote daemon handoff` Public entrypoint: `send::send_mail(request: SendRequest, observability: &dyn ObservabilityPort) -> Result` `SendRequest` contains: - home directory - current directory - sender override - target address input - team override - message source - summary override - requires-ack flag - optional task id - dry-run flag `SendMessageSource` variants: - inline text - stdin text - file reference `SendOutcome` fields: | Field | Type | Description | | --- | --- | --- | | `action` | `&'static str` | Stable send action marker. | | `team` | `String` | Resolved target team. | | `agent` | `String` | Resolved target recipient. | | `sender` | `String` | Resolved sender identity. | | `outcome` | `&'static str` | Delivery result such as `sent` or `dry_run`. | | `message_id` | `Uuid` | ATM-authored UUID v4 for the send operation. | | `requires_ack` | `bool` | Whether the message requires acknowledgement. | | `task_id` | `Option` | Optional task identifier persisted on the message. | | `summary` | `Option` | Generated or caller-supplied summary text. | | `message` | `Option` | Rendered message body for dry-run output. | | `warnings` | `Vec` | Actionable degraded-mode warnings surfaced when send succeeds under a permitted fallback condition. | | `dry_run` | `bool` | Whether the send was executed as a dry run. | The file-reference path may be rewritten through the file policy layer. The CLI JSON output mirrors the current contract. Normal send JSON output includes: - `action = "send"` - `team` - `agent` - `outcome` - `message_id` - `requires_ack` - `task_id` - `warnings` when send completed in a degraded but permitted mode Dry-run send JSON output includes: - `action = "send"` - `agent` - `team` - `message` - `dry_run = true` - `requires_ack` - `task_id` - `warnings` when dry-run surfaces degraded send conditions Send ordering rules: - resolve target address, team existence, and agent membership as one address-resolution stage before mailbox path selection - enter the atomic append boundary before final inbox mutation - validate message text inside the atomic append boundary - current legacy top-level `message_id` generation remains supported for live compatibility - forward metadata schema generation must create the ATM ULID `messageId` first and derive the persisted message `timestamp` from it - perform duplicate suppression and final append inside the same atomic append boundary #### 6.1.1 Idle-Notification Lifecycle - message classification first attempts to parse the persisted `text` field as JSON and treat the message as an idle notification when the parsed object has `type == "idle_notification"` - if parsing fails, or `type` differs, the message is classified as a normal message - when a newly appended message is classified as an idle notification, the mailbox append boundary removes any older unread idle notification from the same sender in the same inbox before appending the new record - `atm clear --idle-only` remains manual backlog cleanup, not the primary lifecycle path Deferred follow-on work: - read-time auto-purge of displayed idle notifications - daemon-side idle-notification removal behavior #### 6.1.2 Task-Assignment Classification - classification uses the same text-field JSON detection pattern and treats a message as a task assignment when the parsed object has `type == "task_assignment"` - because the Claude Code schema is fixed, classification must populate `extra["task_id"]` and `extra["priority"]` from the parsed text-field JSON rather than extending `MessageEnvelope` with new top-level fields - final field naming and task-subsystem semantics remain coordinated with the future `arch-ctask` task subsystem design; see `atm-core` issue `#17` - task-assignment extraction remains deferred until the `arch-ctask` subsystem is defined Missing-team-config fallback is limited to `send`: - fallback applies only when `config.json` is missing and the target inbox already exists - malformed `config.json` remains a command error - fallback must surface an actionable sender warning - fallback may send a best-effort repair notice to `team-lead` - repair notices must be deduplicated by unresolved condition so repeated sends do not flood inboxes ### 6.2 Read Service Public entrypoint: `read::read_mail(query: ReadQuery, observability: &dyn ObservabilityPort) -> Result` `ReadQuery` contains: - home directory - current directory - actor override - optional target address - team override - selection_mode - seen_state_filter - seen_state_update - ack_activation_mode - limit - sender_filter - timestamp_filter - optional timeout `seen_state_filter` is false when `--no-since-last-seen` is set. `--all` bypasses this filter regardless of the stored value. `seen_state_update` is false when `--no-update-seen` is set. Timeout rule: - if the requested selection is already non-empty after filtering and selection-mode application, return immediately - otherwise wait for a newly eligible message until the timeout expires `ReadOutcome` contains: - action - resolved team - resolved agent - selection_mode - history_collapsed - mutation_applied - messages - bucket_counts `ReadOutcome.bucket_counts` exposes: - unread - pending_ack - history Read deduplication rule: - collapse multiple entries with the same non-null `message_id` to the most recent entry before bucket selection and output rendering - when timestamps tie, keep the later encountered inbox record Read/enrichment rule: - when a message needs ATM workflow semantics but lacks ATM-owned machine metadata, ATM may enrich the original stored message additively - enrichment must be idempotent and must not rewrite native Claude fields except for the explicitly documented cross-team alias projection carve-out on `from`, which also requires canonical sender identity in `metadata.atm.fromIdentity` The read service derives `MessageClass` from `(ReadState, AckState)` and applies display-bucket selection to the derived class, not to raw persisted fields. For merged inbox surfaces, any displayed-message mutation must be written back to the physical inbox file that contributed the displayed record. The merged view is a read projection, not a synthetic write target. The CLI JSON output mirrors the current contract: - `action` - `team` - `agent` - `messages` - `count` - `bucket_counts` - `history_collapsed` ### 6.3 Ack Service Public entrypoint: `ack::ack_mail(request: AckRequest, observability: &dyn ObservabilityPort) -> Result` `AckRequest` contains: - home directory - current directory - actor override - team override - source message id - reply body `AckOutcome` contains: - action - resolved team - resolved agent - source message id - optional task id from the acknowledged message - reply target - reply message id - reply text - warnings: Vec - Phase Q addition: `warnings` carries best-effort post-send-hook diagnostics for `atm ack` without changing the successful acknowledgement state The ack service is responsible for the legal transition from `(Read, PendingAck)` to `(Read, Acknowledged)` plus the reply append. When the source message came from an origin inbox file in the merged surface, the acknowledgement writeback must update that source file atomically rather than projecting the change onto a different inbox file. ### 6.4 Clear Service Public entrypoint: `clear::clear_mail(query: ClearQuery, observability: &dyn ObservabilityPort) -> Result` `ClearQuery` contains: - home directory - current directory - actor override - optional target address - team override - optional age filter - idle-only flag - dry-run flag `ClearOutcome` contains: - action - resolved team - resolved agent - removed total - remaining total - removal counters by class Clear eligibility is computed from the two-axis model: - clearable: `(Read, NoAckRequired)` and `(Read, Acknowledged)` - non-clearable: every other combination ### 6.5 Observability Boundary The observability boundary is a sealed `ObservabilityPort` (or equivalent injected interface) defined in `atm-core` and implemented in `atm`. It is responsible for: - command lifecycle emission - log query - log tail/follow - observability health projection The retained boundary must remain ATM-owned and must not leak shared `sc-observability` types directly into `atm-core` public APIs. `atm-core` owns the ATM-specific event and query vocabulary needed for ATM’s messaging workflows, retained-log query/follow, and doctor readiness. `atm` owns the concrete `sc-observability` integration and CLI-facing routing decisions such as `--stderr-logs`. Future hook- or `schooks`-driven observability orchestration remains out of scope for the initial ATM release and must not be inferred from this boundary. ### 6.6 Log Service Public entrypoints: - `ObservabilityPort::query(query: AtmLogQuery) -> Result` - `ObservabilityPort::follow(query: AtmLogQuery) -> Result` ATM CLI surfaces such as `atm log snapshot`, `atm log filter`, and `atm log tail` consume those boundary methods directly rather than routing through a separate `log::query_logs(...)` or `log::tail_logs(...)` wrapper. `AtmLogQuery` contains: - mode - level filters - field matches - time window - limit `AtmLogSnapshot` contains: - returned records - truncation flag when the shared query source truncates results `LogTailSession` is an owning stateful object that yields matching records from the shared observability follow API without exposing a public callback trait. Ordering rules: - snapshot queries return newest-first records before CLI output limits are rendered - tail sessions yield records in follow arrival order ATM must not parse daemon log files directly in this service. ### 6.7 Doctor Service Public entrypoint: `doctor::run_doctor(query: DoctorQuery, observability: &dyn ObservabilityPort) -> Result` `DoctorQuery` contains: - home directory - current directory - team override `DoctorReport` contains: - summary - findings - recommendations - environment override visibility - current team member roster from `config.json` - observability health `DoctorFinding` contains: - severity - code - message - remediation The report model should reuse the current doctor command’s severity/finding structure where useful, but in the Phase Q target architecture it must include daemon/runtime checks rather than assuming a daemon-free local-only model. Roster output rules: - show all current `config.json` members in doctor output - show baseline `[atm].team_members` first - show `team-lead` first among the baseline members when present - show extra runtime members after the baseline set - snapshot `~/.claude/teams/*/inboxes/*.lock` at doctor start and end; any lock path present in both snapshots is stale and should surface as `ATM_WARNING_STALE_MAILBOX_LOCK` with `rm -f ` recovery guidance ### 6.8 Team Recovery Services The retained release-critical local team surface is intentionally narrow. ATM-owned public entrypoints should cover: - local team discovery - local member listing - local `add-member` - local team backup - local team restore Architectural rules: - these services are local file/config/inbox operations; they must not depend on daemon orchestration or runtime spawning - `teams` list is discovery-oriented and should remain deterministic over the ATM home directory - `add-member` is the retained local roster-repair path and must reject duplicates before mutating config - `backup` snapshots current team config, inboxes, and the ATM team task bucket into a timestamped snapshot directory - inbox backup excludes transient mailbox `*.lock` sentinels, dotfiles, and restore markers - `restore` is a local recovery path and must: - preserve the current team-lead entry and `leadSessionId` - restore only missing non-lead members - clear runtime-only restored-member state before persistence - restore non-lead inboxes from the chosen snapshot - sweep stale mailbox `*.lock` sentinels before restored inbox files are copied in - treat stale-sentinel sweep as result-bearing: if that cleanup hits a read-only-filesystem failure, restore must stop and surface `MailboxLockReadOnlyFilesystem` instead of warning and continuing - recompute `.highwatermark` from the maximum restored task id - support a dry-run path without making changes - Claude Code project task-list restoration remains separate from the retained ATM team backup/restore surface ### 6.9 Members Service The retained `members` surface is a local roster inspection service. Architectural rules: - it must succeed without daemon or hook-only state - it must load the roster from local team config - it should order members deterministically, with `team-lead` first when present - it may surface persisted member metadata already present in config - later hook/session enrichment may be layered on without changing the base local verification purpose of the command ## 7. Read Pipeline Supersession note: - the stage list below describes the retained file-backed line - the Phase Q target pipeline is `ingest/reconcile -> SQLite projection -> optional state mutation -> return outcome` - once Phase Q lands, SQLite projection rather than merged file truth becomes authoritative for `read` The read pipeline stages are: 1. resolve actor and target inbox 2. build the hostname registry for configured origin inboxes 3. load mailbox records from the merged inbox surface 4. collapse duplicate `message_id` entries to the newest visible record 5. classify read axis, ack axis, and derived message class 6. apply sender and timestamp filters 7. apply seen-state filter unless selection is `All` 8. map derived message class to display bucket and apply selection mode 9. wait if `timeout` is set and the current selection is empty 10. sort newest-first and apply limit 11. apply legal read-axis and ack-axis transitions for displayed messages 12. persist state changes atomically 13. update seen-state when enabled 14. return outcome This ordering is part of the architecture contract. ## 8. Ack Pipeline The ack pipeline stages are: 1. resolve actor identity and own inbox 2. load the merged inbox surface and locate the source message 3. classify the source message into read and ack axes 4. require pending acknowledgement before mutation 5. resolve the reply target inbox from the source envelope 6. atomically apply the ack transition and append the reply 7. emit command lifecycle records 8. return outcome This stage list describes the pre-Phase-Q compatibility line. The Phase Q target pipeline is superseded by the SQLite SSOT and daemon-boundary design in Section 21. ## 9. Clear Pipeline The clear pipeline stages are: 1. resolve actor identity and target inbox 2. load the persisted inbox surface 3. classify each message into read axis and ack axis 4. compute clear eligibility from the two-axis model plus pending-ack override 5. apply optional age and idle-only filters 6. atomically persist the kept set when not in dry-run mode 7. emit command lifecycle records 8. return outcome This stage list describes the pre-Phase-Q compatibility line. The Phase Q target pipeline is superseded by the SQLite SSOT and daemon-boundary design in Section 21. ## 10. Log Pipeline The log pipeline stages are: 1. resolve the injected observability port implementation 2. map CLI filters into shared query/follow filters 3. query or follow records through the observability port 4. project ATM-owned record fields for CLI rendering 5. return records to the CLI layer Shared `sc-observability` should own record storage, filtering, and follow mechanics. ATM should own only ATM-specific query defaults and field projections. ## 11. Doctor Pipeline The doctor pipeline stages are: 1. resolve config and environment overrides 2. resolve effective team and identity inputs 3. inspect ATM config for obsolete fields such as `[atm].identity` 4. verify local team/mailbox/config paths 5. verify hook identity availability 6. compare baseline `[atm].team_members` against `config.json.members` 7. verify observability initialization and health 8. verify observability query readiness for `atm log` 9. assemble findings, recommendations, and ordered roster output 10. render report ## 12. Mailbox Storage Supersession note: - this section describes the retained mailbox/file-storage line - Phase Q supersedes it as the target architecture with SQLite durable truth and Claude inbox files as compatibility ingress/export only - any mailbox-lock or file-truth rule in this section is transitional unless restated in §21 The mailbox layer owns: - tolerant reads - atomic append - duplicate suppression - conflict merge - origin-inbox merge - atomic workflow-state updates - atomic clear-set replacement - sender-scoped idle-notification dedup inside the atomic append boundary The mailbox layer does not own selection policy, display buckets, output formatting, log query behavior, or doctor diagnostics. ## 13. Identity And File Policy ### 13.1 Hook Matching When `ATM_POST_SEND` is set for a configured post-send hook, the payload must contain: - `sender` - `recipient` - `team` - `from` - `to` - `message_id` - `requires_ack` - `is_ack` - optional `task_id` when present - optional `recipient_pane_id` when ATM already knows the authoritative pane mapping for the recipient The post-send hook runs only after a successful outbound mailbox write from `atm send` or `atm ack`. It executes once when recipient matching succeeds, uses `is_ack = false` for `atm send` and `is_ack = true` for `atm ack`, may optionally emit one structured stdout result for observability, and never rolls back a successful message write on failure or timeout. Phase Q hook-note: - once roster and pane mapping truth move to SQLite, the send path should place the authoritative recipient pane id into `ATM_POST_SEND.recipient_pane_id` - post-send hook implementations should prefer that payload field over local file rediscovery when it is present Supported structured hook-result levels remain: - `debug` - `info` - `warn` - `error` ### 13.2 Identity Resolution Hook-file identity is retained because it is a current non-daemon convenience path for send/read identity resolution. Only hook identity resolution is required for the rewrite. Session-resolution paths that exist only to bridge runtime/daemon ambiguity are not required. Repo-local config identity is not retained as a runtime fallback. In the multi-agent model, runtime identity must come from explicit CLI override, hook identity, or `ATM_IDENTITY`. An obsolete `[atm].identity` field may be diagnosed by doctor, but it must not control sender/actor resolution. ### 13.3 File Policy The current `send --file` behavior is retained: - inspect Claude settings permissions when available - if the referenced file is allowed, send a direct file reference - otherwise copy to ATM share storage and rewrite the message body accordingly ## 14. Observability `atm-core::observability` defines ATM event/query models plus the sealed `ObservabilityPort` boundary. `atm` provides the concrete `sc-observability` implementation and injects it into core services. Initialization: - `atm` initializes logging once at process startup - `atm` constructs the concrete observability port after startup initialization - logging failures degrade to best-effort behavior for explicit mail commands Required ATM event classes: - command start - command success - command failure - mailbox record skipped Required ATM event fields: - command - team - actor - target - task id - outcome - error class when applicable - stable error code when applicable - message count when applicable - transition count when applicable For explicit observability consumer commands: - `atm log` depends on shared query/follow APIs - `atm doctor` depends on shared health APIs - failures in those consumer paths are command errors, not silently dropped events ### 14.1 Concrete Integration Shape The retained implementation uses an ATM-owned emit/query/follow/health boundary that projects shared observability behavior into ATM-owned types: - ATM-owned `AtmLogQuery` - ATM-owned `AtmLogRecord` - ATM-owned `AtmLogSnapshot` - ATM-owned `AtmObservabilityHealth` - an ATM-owned synchronous `LogTailSession` Required boundary responsibilities: - `ObservabilityPort::emit(...)` - `ObservabilityPort::query(...)` - `ObservabilityPort::follow(...)` - `ObservabilityPort::health(...)` The exact ATM-owned projected types and object-safe follow-session split are defined in: - [`docs/atm-core/design/sc-observability-integration.md`](./atm-core/design/sc-observability-integration.md) Initial-release boundary rulings: - this boundary is intentionally ATM-local; it does not attempt to model future hook-driven or `schooks`-orchestrated observability concerns - the health contract remains intentionally closed at: - `Healthy` - `Degraded` - `Unavailable` - public ATM observability projections must not expose raw `serde_json::Value` / `Map` directly ### 14.2 Shared Crate Usage Rules Implementation rules: - `atm-core` remains concrete-crate-neutral and consumes only the injected boundary - `atm` initializes the shared logger exactly once per process - the shared file sink is the authoritative retained log store for `atm log` - the shared console sink remains opt-in so it does not contaminate normal command output - the initial-release dependency is the published crates.io version `sc-observability = "1.0.0"` ### 14.3 Failure Diagnostic Rules Required diagnostic behavior: - CLI bootstrap failures must be logged before process exit - CLI parse/validation failures that occur before a core service runs must be logged before process exit - retained command-service failures must emit structured failure diagnostics with stable ATM-owned error codes - degraded recovery warnings that continue the command must also log stable error codes - command success-only logging is insufficient for the retained architecture ## 15. Error Model Root public error: ```rust pub struct AtmError { pub code: AtmErrorCode, pub kind: AtmErrorKind, pub message: String, pub recovery: Option, pub source: Option>, } ``` ```rust pub enum AtmErrorCode { // single central registry re-exported from crates/atm-core/src/error_codes.rs } ``` Required families: - config - missing document - address - identity - team not found - agent not found - mailbox read - mailbox write - file policy - validation - serialization - timeout - observability emit - observability query - observability health Every public error must include: - a stable ATM-owned error code - a stable class - human-readable cause - recovery guidance when the user can act The single source of truth for ATM-owned error codes is: - [`atm-error-codes.md`](./atm-error-codes.md) Persisted-data errors should additionally carry file/entity/parser context so CLI surfaces can report the exact failing document and scope. Phase Q error-model rules: - `AtmErrorCode` must not use wildcard or catch-all variants where a more specific code can be named - every documented `AtmErrorCode` must carry one recoverability classification in the central registry so CLI, daemon, and doctor surfaces can reason about retry vs operator-action vs fail-closed behavior - pattern matches over `AtmErrorCode` at module/crate boundary surfaces must be exhaustive; wildcard `_` match arms are not permitted ## 16. Trait Policy The initial rewrite should avoid public extension traits. If a trait becomes necessary: - prefer a sealed trait - verify object safety before stabilization Phase Q boundary rule: - all I/O-owning boundary traits are sealed by default - opening a boundary for external implementation requires explicit design review and crate-level documentation of the exception ## 17. Testing Strategy `atm-core` tests: - address parsing - config precedence - tolerant team-config parsing for compatibility-only schema drift - precise persisted-data diagnostics for non-recoverable config failures - bridge hostname resolution for merged inbox reads - settings resolution - hook identity resolution - file policy behavior - team membership validation - tolerant inbox parsing - origin-inbox merge - atomic append behavior - duplicate suppression - read-time duplicate collapse by `message_id` - workflow axis classification - workflow axis transitions - task-linked ack-required classification - seen-state behavior - timeout behavior - ack transition behavior - clear eligibility behavior - pending-ack clear override behavior - observability port emission behavior - observability port query/filter behavior - observability port failure behavior - doctor health projection behavior `atm` tests: - clap parsing - JSON output shape - human-readable output snapshots - send/read/ack/clear integration behavior - `atm log` integration behavior - `atm doctor` integration behavior - `atm teams` integration behavior - `atm members` integration behavior ## 18. Mailbox File Locking (Phase M) ### 18.1 Problem Statement `append_message` in `mailbox/mod.rs:23-27` performs an unlocked read-modify-write: 1. `read_messages(path)` — reads and deserializes the full inbox 2. `messages.push(envelope)` — appends the new record in memory 3. `atomic::write_messages(path, &messages)` — writes to temp file, fsyncs, renames over original Step 3 is atomic with respect to partial writes but not concurrent callers. Two concurrent callers can both complete step 1 before either reaches step 3; the later rename silently overwrites the earlier, losing its appended message. The same race affects read writeback, ack transition, and clear set replacement. ### 18.2 Locking Primitive Decision **Decision: Use the `fs2` crate.** Rationale: - `fs2` provides `FileExt::lock_exclusive()` and `FileExt::try_lock_exclusive()` which map to `flock(2)` on Unix and `LockFileEx` on Windows - 98M+ downloads, maintained, compatible with the project's MSRV - avoids maintaining separate `cfg(unix)` / `cfg(windows)` implementations - the current `atm-core` Cargo.toml already carries `libc` and `windows-sys`, but only as low-level building blocks, not as a cross-platform mailbox-locking API Alternative rejected: direct `libc::flock` + `windows-sys::LockFileEx` — more control but duplicates what `fs2` already provides correctly. ### 18.3 Lock Architecture ``` +-----------------------+ | MailboxLockGuard | | (RAII, Drop releases) | +----------+------------+ | +----------v------------+ | lock.rs::acquire() | | open/create sentinel | | fs2::try_lock_excl() | +----------+------------+ | +-------------------+-------------------+ | | Unix: flock(fd, LOCK_EX) Windows: LockFileEx(handle) ``` - **Sentinel**: `{inbox_path}.lock` — pid-bearing runtime artifact, created lazily, removed on `MailboxLockGuard` drop, and best-effort evicted when the recorded pid is no longer alive - **Granularity**: per-inbox-file — concurrent sends to different recipients never contend - **Lock lifetime**: acquired before `read_messages`, held through `atomic::write_messages` durability boundary (temp-file write, rename, and any parent-directory sync), then the sentinel is unlinked and the guard is released - **Timeout**: bounded retry loop with `try_lock_exclusive()` + 50ms sleep, default 5s; on expiry returns `AtmError { code: MailboxLockTimeout }` - **Error classification**: only genuine "lock busy" results participate in the retry loop. Non-contention I/O and OS failures from the lock path fail fast as `MailboxLockFailed` with filesystem/permissions recovery guidance instead of being collapsed into a timeout. - **Cooperative limitation**: `fs2` locks are advisory and only coordinate ATM processes that participate in the same locking protocol. Direct file edits or other tools that bypass ATM locking are outside the protection boundary. This is an accepted limitation for the ATM shared-inbox model. ### 18.3.1 Stale-Sentinel Sweep Predicate The current `path.extension() == "lock"` filter is too narrow because it misses rotated sentinels such as `inbox.json.lock.old`. The executed P.10 design must match only filenames that still carry the sentinel suffix chain: ```rust let is_lock_sentinel_candidate = path .file_name() .and_then(|name| name.to_str()) .is_some_and(|name| name.ends_with(".lock") || name.contains(".lock.")); ``` Why this exact predicate: - `ends_with(".lock")` preserves the ordinary live sentinel path - `contains(".lock.")` catches rotated forms such as `.lock.old` and `.lock.replaced` - basename-only matching avoids broad false positives from parent directories - rejecting generic `contains("lock")` avoids matching unrelated files such as `locksmith.txt` Eviction remains conservative: - read the candidate contents as the documented `pid[:token]` owner record - if parsing fails, leave the file in place - if `process_is_alive(pid)` is true, leave the file in place - only then attempt removal This is still best-effort cleanup, not a second ownership protocol. The actual authority boundary remains the later `fs2` advisory lock plus the existing `lock_path_matches_file(...)` identity recheck after acquisition. Platform note: - Windows may not permit renaming a live locked sentinel the same way Unix does, so the broadened sweep is not a live-handoff mechanism - the predicate exists to clean up crash leftovers, repair leftovers, or externally rotated sentinel artifacts that otherwise evade the old exact `.lock` extension test ### 18.3.2 Read-Only Filesystem Classification P.10 should add a dedicated read-only-filesystem mailbox-lock code instead of overloading the generic non-contention lock failure bucket. Required platform mapping: - Linux: `libc::EROFS` (`30`) - macOS: `libc::EROFS` (`30`) - Windows: `windows_sys::Win32::Foundation::ERROR_WRITE_PROTECT` (`19`) The classification helper belongs at the lock-path error-conversion boundary, not duplicated ad hoc at individual call sites. The intended shape is: ```rust fn is_readonly_filesystem_error(error: &io::Error) -> bool ``` and then a shared mapper such as: ```rust fn mailbox_lock_path_error( operation: &'static str, lock_path: &Path, error: io::Error, ) -> AtmError ``` Call-graph decisions: - `open_lock_file(...)` maps read-only failures directly to `MailboxLockReadOnlyFilesystem` - `write_lock_owner_record(...)` maps both truncate and write failures through the same helper - `remove_lock_sentinel_with_retry(...)` explicitly does not retry read-only failures before the current permission-denied/backoff logic - public `sweep_stale_lock_sentinels(...)` surfaces the read-only diagnostic to the caller rather than logging and continuing - pre-acquisition stale eviction inside `acquire(...)` propagates the read-only diagnostic when the cleanup path hits it, because subsequent owner record writes cannot succeed on the same mount; this early-exit happens before any later `try_lock_exclusive()` attempt - each retry iteration must classify raw OS errors before consulting the timeout budget: `EROFS` / `ERROR_WRITE_PROTECT` exits immediately as `MailboxLockReadOnlyFilesystem`, while non-contention path failures such as `ENOSPC`, `EMFILE`, and `ESTALE` exit immediately as `MailboxLockFailed` - `MailboxLockGuard::drop` still warns only, because the successful mailbox mutation has already completed and `Drop` cannot change the command result Recommended recovery text: - message includes the attempted operation and lock path - recovery tells the operator to remount or move the ATM home to a writable filesystem before retrying, not merely to wait for another process Reason for a new code instead of enriching `MailboxLockFailed`: - read-only filesystem state is a stable, operator-actionable class with different remediation from ACL failures or transient path I/O - the retry policy must branch on this distinction - QA and integration tests need a stable machine-readable contract for it ### 18.4 Integration: Single-File Helper + Multi-File Lock Set `append_message` is a true single-file read-modify-write and should use one shared helper: ```rust pub fn locked_read_modify_write( path: &Path, timeout: Duration, mutate: F, ) -> Result<(), AtmError> where F: FnOnce(&mut Vec) -> Result<(), AtmError>, { let _guard = lock::acquire(path, timeout)?; let mut messages = read_messages(path)?; mutate(&mut messages)?; atomic::write_messages(path, &messages) } ``` That helper is the right shape for: - `append_message` - the missing-config team-lead notice path, because it also calls `append_message` It is **not** sufficient by itself for `read`, `ack`, and `clear`, because those commands call `load_source_files(...)` and compute a merged surface across the requested inbox plus any origin inboxes before writing back. To make those paths concurrency-safe, Phase M needs a second abstraction: ```rust pub fn acquire_many_sorted( paths: impl IntoIterator, timeout: Duration, ) -> Result, AtmError> ``` Required usage: - discover the full source-file set first - dedupe paths and sort them deterministically by canonical path string - source-file discovery must finish before the first inbox read - legitimately absent inbox paths at discovery time are excluded from the lock set rather than locked speculatively - source discovery must fail closed for mutation commands: unreadable `read_dir(...)` entries or equivalent enumeration faults are treated as source set instability, not as warnings that can be skipped - source discovery faults abort the command before lock acquisition; mutation commands never attempt a partial lock set after a discovery failure - acquire all locks against one total timeout budget - if any acquisition fails, drop every earlier lock immediately and abort before any source-file read - if a discovered file disappears or becomes unreadable after lock planning but before `load_source_files(...)` completes, abort without persisting any partial state; this remains a normal operator-actionable file-read failure, not a partial-lock degraded mode - then call `load_source_files(...)` - hold every guard until every source writeback completes This intentionally preserves a single logical merged-surface decision boundary for `read`, `ack`, and `clear`. Those commands are not allowed to degrade into partial-lock best-effort mutation, because doing so would mix snapshots from different logical times and make writeback correctness nondeterministic. ### 18.4.1 Cooperative Locking Contract For `ack_mail` `ack_mail` sometimes needs to mutate a source inbox set and append the reply to another inbox that was not part of the initial actor-source set. The accepted implementation does not use a subset-lock then upgrade-to-superset sequence. Instead it uses: 1. an unlocked observational snapshot of the actor-source set 2. unlocked validation of the pending-ack state and reply inbox path 3. one final acquisition of the full sorted superset that includes the reply inbox 4. re-discovery of source paths, reload of current source files, and re-validation of the pending-ack state under that final lock set 5. persistence of both the updated source message and reply while the superset locks are still held This avoids the deadlock risk of trying to expand a held subset into a larger sorted lock set. The unlocked preflight is acceptable only because `ack_mail` does not mutate from that preflight snapshot: the shared commit helper reloads and re-validates both the source-path set and the pending-ack state under the final superset lock before writing anything. If the state drifted, `ack_mail` aborts instead of mutating a stale snapshot. | Caller | Lock required | |--------|--------------| | `append_message` | `locked_read_modify_write` | | `send` missing-config notice append | `append_message` coverage | | source discovery fault (`read` / `ack` / `clear`) | abort before lock acquisition; no partial lock set attempted | | `read` writeback | initial selection load is unlocked; acquire the multi-file lock set only for the reload + writeback phase | | `ack` transition + reply | unlocked preflight, then one final cooperative superset lock including reply inbox; see §18.4.1 | | `clear` set replacement | multi-file lock set held from first read through persist | | `read_messages` (read-only, no writeback) | No | ### 18.4.2 Read-Only Vs Read-Modify-Write ATM now treats mailbox access as two distinct patterns: 1. Read-only snapshot: - discover source inbox paths - load and classify the current merged surface without mailbox locks - use this for display-only selection and timeout polling 2. Read-modify-write: - re-acquire the deterministic source lock set only when a command is about to persist mailbox state - re-discover and re-validate the source path set under lock - reload the mailbox state, recompute selection, apply transitions, and persist while the lock set is still held This keeps non-mutating reads out of the lock path while preserving a stable writeback boundary for commands that actually rewrite inbox files. Executed command mapping: - `read` uses an unlocked observational snapshot for display selection and timeout polling, then enters the shared lock+reload+recompute path only when display-state mutation is actually required - `ack` uses an unlocked preflight to resolve the reply target and candidate source message, then acquires one final sorted superset lock and re-validates the pending-ack state under that lock set before writing source/reply state - mutating `clear` acquires the shared lock plan before its mutating reread and holds it through removal computation, mailbox replacement, and workflow-state updates; `clear --dry-run` remains observational only ### 18.4.3 Executed Mailbox Workflow Migration Phase P completed the mailbox workflow-state migration. P.4 delivered the sidecar move, and the current architecture documents the post-P.5 executed state. Current executed rule: - ATM-owned workflow durability for identified mailbox messages is written to `.claude/teams//.atm-state/workflow/.json` - `send` authors forward `metadata.atm.messageId` ULIDs for ATM-authored records and seeds the corresponding sidecar entry - QA criterion (a) for ULID assignment is verified through `send_mail` coverage; the helper that writes `metadata.atm.messageId` remains an internal `pub(crate)` workflow API and is not exposed to integration tests - `read` projects mailbox display state from the sidecar and only rewrites the inbox file for legacy compatibility records that still lack a stable ATM identity - `ack` writes the reply inbox file plus the source/reply workflow-state files under one deterministic lock plan - `clear` classifies removable messages from the projected workflow view and removes matching workflow-state entries when the inbox record is deleted Current executed limitation: - `send` and the missing-config team-lead notice path still seed workflow state via an atomic owner-routed `load -> mutate -> save` sequence instead of a dedicated freshness-proving helper - that means the sidecar family is already the source of truth, but concurrent same-recipient send-side seeding is not yet hardened to the same lock/reload/recompute standard used by mailbox read/ack/clear - P.6 is the tracked hardening continuation for that specific gap ### 18.5 New Error Codes - `MailboxLockFailed` / `ATM_MAILBOX_LOCK_FAILED` — lock-path creation, open, or acquisition failed for a non-contention filesystem or OS reason - `MailboxLockReadOnlyFilesystem` / `ATM_MAILBOX_LOCK_READ_ONLY_FILESYSTEM` — the lock path or lock sentinel lives on a read-only filesystem, so ATM cannot create, update, or remove the required mailbox-lock artifact - `MailboxLockTimeout` / `ATM_MAILBOX_LOCK_TIMEOUT` — lock not acquired within timeout - New `AtmErrorKind::MailboxLock` variant in `error.rs` ### 18.6 Shared Mutable File Atomicity Mailbox locking closes the concurrent lost-update race for inbox files, but it is only one part of the persistence contract. Phase M also treats atomic file replacement as a repo-wide rule for shared mutable ATM-owned structured state. Scope: - live inbox files - team `config.json` - ATM-owned task-bucket files restored or rewritten by team recovery - `.highwatermark` - shared persisted coordination/state files such as send-alert or restore-progress markers when they carry ATM-owned operator state - any future ATM-owned JSON/JSONL/state file rewritten by more than one ATM process or operator workflow Architectural rule: - no live shared mutable structured file may be rewritten in place - writers must use a temp-file + fsync + rename style replacement on the same filesystem, or a documented equivalent with the same atomicity guarantee - for rename-based replacement, the helper must also fsync the parent directory after the rename whenever the platform supports directory-sync semantics; this is the Phase M crash-durability boundary for mailbox/config/shared-state replacement - `atm-core` must own one shared low-level atomic persistence primitive and a small set of typed writer helpers layered on top of it, rather than open-code file replacement logic at individual call sites - existing helpers such as `atomic::write_messages(...)` and `write_team_config(...)` are the preferred integration points; new shared state added by Phase M should extend that helper pattern with typed helpers for task-bucket, highwatermark, and shared coordination files instead of open-coding direct `fs::write(...)` mutations Single-write-path guardrail: - each live file family should have one owning write boundary - low-level atomic replacement belongs in `persistence.rs` - file-family semantics belong in one owner-layer helper such as mailbox or team-admin - command handlers should express intent and call the owner-layer helper rather than assemble write mechanics locally - if a new write precondition appears, the default response should be to extend the shared helper or owner-layer helper rather than introducing a parallel write path Current owner-layer boundaries: - Claude-owned inbox compatibility surface: `mailbox::store::observe_source_files(...)` for observational snapshots, `mailbox::store::with_locked_source_files(...)` for shared mailbox read/ack/clear lock+reload orchestration, and `mailbox::store::commit_mailbox_state(...)` / `mailbox::store::commit_source_files(...)` as the persistence leaf - ATM-owned source-of-truth state: `workflow::{load_workflow_state(...), save_workflow_state(...), project_envelope(...), remember_initial_state(...), apply_projected_state(...), remove_message_state(...)}`, `read::seen_state::save_seen_watermark(...)`, `send::alert_state::{register_missing_team_config_alert(...), clear_missing_team_config_alert(...), save(...)}`, and `team_admin::write_team_config(...)` - ATM-owned restore/task state: `team_admin::restore::restore_task_state_from_backup(...)`, `team_admin::restore::write_restore_marker(...)`, and `team_admin::restore::clear_restore_marker(...)` - staging/scratch artifacts: `team_admin::restore::prepare_restore_workspace(...)` and `team_admin::restore::cleanup_restore_workspace(...)` Current architectural limitation: - mailbox replacement is atomic and lock-coordinated for concurrent ATM writers, but it is not yet compare-and-swap against non-cooperating Claude writers - therefore the current shared-inbox rewrite path is still a compatibility boundary, not the ideal long-term source-of-truth architecture for ATM-local workflow state - separately, send-side workflow seeding still lacks a dedicated freshness boundary across concurrent same-recipient sends; that is a post-P.5 hardening gap rather than a reason to move workflow durability back into Claude-owned inbox records This rule intentionally applies beyond mailbox files so future work does not reintroduce partial-write or torn-state risks through backup/restore or shared auxiliary state paths. ### 18.6.1 Deterministic Locking-Test Strategy The follow-up locking fixes require failure-path tests, but those tests must not depend on races or hang-prone construction. Test strategy: - contention tests use a helper thread/process that acquires the target lock and signals readiness through a channel or barrier - the command under test uses a short bounded lock timeout - assertions use `recv_timeout(...)`, elapsed-time ceilings, and scoped guard teardown instead of indefinite `join()`/sleep loops - source-discovery fault tests use a deterministic seam (for example, an injected directory-entry iterator/fault source) to force an unreadable origin entry without depending on filesystem timing or permission quirks - non-contention lock error tests use a deterministic seam around the lock attempt/classifier rather than trying to synthesize platform-specific OS failures opportunistically - durability tests validate helper sequencing and error propagation through deterministic seams; they do not attempt literal crash simulation in unit or integration test runs This is intentionally stricter than the Phase M success-path deadlock tests so CI remains bounded and repeatable across macOS, Linux, and Windows. ## 19. Restore Transaction Atomicity (Phase M) ### 19.1 Problem Statement `restore_team` in `team_admin.rs` currently mutates in this order: 1. Copy inbox files to the live inbox directory 2. Restore task bucket 3. Recompute highwatermark 4. Write `config.json` If the process crashes between steps 1 and 4, inbox files for members not in config exist with no detection mechanism. ### 19.2 Revised Restore Ordering (Config-Last with Staging) ``` 1. Validate backup and compute restore plan (no mutations) 2. Write .restore-in-progress marker to team directory 3. Stage inbox files to .restore-staging/inboxes/ 4. Move staged files to live inboxes/ (fs::rename — atomic same-filesystem) 5. Restore task bucket 6. Recompute highwatermark 7. Write config.json + fsync (atomic temp+rename via write_team_config) 8. Remove .restore-in-progress marker ``` Key properties: - crash at steps 2-6: config.json unchanged, extra inbox files harmless, marker signals re-run - read-only failure during the pre-copy stale-sentinel sweep aborts before live inbox replacement begins, preserving the pre-restore team state - crash at step 7: config write is itself atomic via the existing `write_team_config(...)` temp-file + rename path, so no partial config write is possible - crash at step 8: config is written, stale marker cleaned up by next doctor/restore run ### 19.3 Staging Directory - location: `{team_dir}/.restore-staging/inboxes/` - lifecycle: created at step 3, contents moved at step 4, directory removed after config write - failure path: staging directory cleaned up, no config written ### 19.4 Doctor Integration New check: scan for `.restore-in-progress` in team directories. - Severity: warning - Recovery guidance: "A previous `atm teams restore` was interrupted. Re-run the restore command to complete it, or remove the marker file manually if the restore is no longer needed." If `.restore-staging/` already exists at restore start, the implementation must either clean it before staging begins or fail with actionable recovery text. It must never merge old staging contents with the new restore attempt. ## 20. Phase M Minor Architecture Changes ### 20.1 AtmError Display Backtrace `AtmError` keeps the user-facing `Display` output concise: - `Display` renders only the primary message and recovery text - captured backtraces stay available through Debug output and a dedicated accessor on `AtmError` This avoids multi-kilobyte backtrace blobs in normal CLI/log output while preserving full diagnostic depth for explicit debugging. ### 20.2 resolve_actor_identity Consolidation Duplicate function in `ack/mod.rs`, `clear/mod.rs`, and `read/mod.rs` moves to `identity/mod.rs` as `pub(crate) fn resolve_actor_identity(...)`. All three call sites update to use the shared helper while preserving the existing override -> hook -> runtime identity resolution order. ### 20.3 normalize_json_number Panic Removal `normalize_json_number(...)` must not panic on untrusted numeric text. Phase M replaces the old panic path with graceful fallback: on exponent parse failure or unsupported exponent range, return the raw string unchanged and emit `tracing::warn!`. A library function must not panic on potentially untrusted input. ### 20.4 Error-Surface Audit Methodology Phase M uses an explicit audit methodology for `REQ-CORE-ERROR-DOC-001` and `REQ-CORE-ERROR-RECOVERY-001` so signoff does not depend on ad hoc review. Method: - grep the production source tree for `expect(` and bare `AtmError` construction sites - review the resulting inventory manually against the explicit Phase M audit inventory in the sprint plan - exclude: - test-only code - `#[cfg(test)]` modules embedded in production files - intentional invariant assertions that do not represent operator-actionable failures - keep the remaining production-path sites in scope for either: - `# Errors` documentation updates - `.with_recovery()` additions - panic removal or other structural correction when the failure mode is not acceptable in library code The initial planning audit identified 16 production-path `expect(...)` sites requiring review under this methodology. Phase M treats that number as a starting inventory, not as a substitute for a fresh grep during implementation. ### 20.5 Phase L.7 Build-On Notes Phase M builds on the already-landed L.7 runtime surface (`team_members`, `aliases`, `post_send_hook`, doctor identity drift warning). Phase M does not re-open that feature set; it only adds the remaining concurrency, restore, and code-review hardening needed for 1.0. ### 20.6 Security Boundaries (Phase O) Phase O adds three architecture-level hardening decisions: 1. **Address validation is the trust boundary for path construction** - team and agent names must be validated before any helper constructs `{ATM_HOME}/.claude/teams/{team}` or `{agent}.json` - `address.rs` and `home.rs` together form the boundary; downstream code must not attempt ad hoc sanitization after path joins are already built 2. **PID-file locking remains conservative by design** - the send-alert lock uses a PID-file-style stale-lock check - PID reuse is an accepted limitation: a reused PID can make a stale lock look alive, so ATM may conservatively preserve that stale lock until timeout or manual cleanup - this limitation favors false-alive availability loss over false-dead lock eviction 3. **Atomic writes must use collision-proof temp names** - temp files for atomic replacement must use UUID-based suffixes instead of timestamp-only suffixes - this keeps same-process rapid writes to the same target path from colliding on the temp-file name while preserving the target basename for operator debugging ## 21. Phase Q Runtime Architecture Phase Q supersedes the mailbox-lock architecture as the target design for ATM mail correctness. The file-based mailbox line remains an interim compatibility surface only. ### 21.1 Authoritative State ATM moves to a split state model: - SQLite is the authoritative durable store for: - messages - ack/task state - read/clear visibility state - team roster - daemon memory is the authoritative live runtime view for: - current agent status - `pid`: durable SQLite truth cached in daemon memory as the primary liveness field - `last_active_at`: daemon-memory-only runtime state used for live overlays SQLite may persist last-observed status for diagnostics, but that snapshot is not the live truth. ### 21.1.1 SQLite Schema Contract The Phase Q target architecture uses one authoritative schema contract. Minimum tables: - `messages` - `ack_state` - `message_visibility` - `tasks` - `team_roster` - `inbox_ingest` Minimum key rules: - `message_key` is the canonical ATM durable message identity - `message_key` must be source-typed: - `atm:` for ATM-authored rows - `ext:` for imported external rows without ATM ids - imported legacy `message_id` / forward `metadata.atm.messageId` values map into that canonical identity model rather than replacing it Minimum index/constraint rules: - unique identity enforcement on `message_key` - dedupe index for imported external/legacy identities - lookup indexes for: - recipient/team mailbox projection - task lookup - visibility projection - ingest replay/high-water tracking Minimum `team_roster` durable fields: - `team_name` - `agent_name` - `recipient_pane_id TEXT NULL` - authoritative post-send-hook pane mapping when known - updated through the roster/registration path rather than rediscovered from local files after Phase Q migration - `pid INTEGER NULL` - durable roster truth for the current owning process identity - cached by the daemon as the primary liveness field ### 21.1.2 SQLite Runtime Invariants The SQLite runtime contract is part of the architecture, not an implementation detail. Required invariants: - `journal_mode = WAL` - `foreign_keys = ON` - mutating ATM flows use explicit transactions - no normal command path relies on implicit autocommit as its correctness model ### 21.1.3 Crash Recovery And Replay Crash recovery must preserve durable truth before compatibility export. Required architectural rules: - the ordering rule is `SQLite commit -> export / remote handoff` - re-export/replay is keyed by durable `message_key` - if daemon-managed retry/re-export state must survive crash, it is stored in SQLite with a bounded expiry/deadline rather than remaining RAM-only - WAL checkpoint is part of graceful shutdown, but recovery correctness must not depend on graceful shutdown having succeeded - persisted retry state must not become a long-lived remote outbox; expired retry rows fail closed during replay ### 21.2 Compatibility Surfaces Claude-owned inbox JSONL files remain required for: - Claude context injection - compatibility with direct Claude-native writers Architectural rule: - JSONL is ingress/egress compatibility only - JSONL is not ATM's authoritative durable mail state `config.json` remains a team-ingress surface, but roster truth moves to SQLite. ### 21.3 Information Flow There are three distinct paths: 1. Claude / compatibility path - Claude or legacy writers append JSONL - ATM imports through one owned inbox-ingress boundary - imported records become durable in SQLite - replay is idempotent and parseable rows are not silently dropped 2. Native agent path - native agent/plugin traffic does not use JSONL - native agents talk to the local daemon API - the daemon commits through the SQLite store boundary 3. Remote host path - cross-host delivery is daemon-to-daemon only - routing expands from `agent@team` to `agent@team.host` - sender-side daemons do not write remote host JSONL directly - successful remote delivery requires remote daemon acceptance ### 21.4 One Interface, Two Transport Implementations Phase Q uses one daemon API with two production transport adapters plus one test transport: - same-host: Unix domain socket - cross-host: TCP/TLS - tests: in-process `test-socket` This is one protocol with multiple implementations, not multiple systems. Test-transport rule: - `test-socket` implements the same dispatcher/handler contract without real socket I/O so subsystem and daemon-boundary tests can exercise the transport boundary in process Remote-delivery semantics: - bounded transient retry is acceptable for short intermittent failures - there is no durable long-lived remote outbox - if the remote host remains unreachable after the bounded retry window, send fails rather than leaving stale pending delivery behind - sender-side daemons do not treat a remote send as delivered until the remote daemon accepts it ### 21.5 Singleton Daemon The daemon is required at runtime, but it must remain thin. Hard invariant: - it must be impossible for two active ATM daemons to run on one host at the same time Daemon responsibilities: - transport listeners - route selection - live status cache - daemon-facing diagnostics and health queries used by `atm doctor` - watch/reconcile runtime if enabled Daemon non-responsibility: - it must not become the only home of ATM business logic Auto-start path: - production ATM commands first attempt to connect to the already-running daemon - if the daemon is absent, the CLI/runtime path may perform exactly one auto-start attempt - after one auto-start attempt, the CLI/runtime path retries connect once - if the daemon remains unavailable, the command fails with a typed actionable error - there is no silent fallback from the production path to direct SQLite or inbox-file access after auto-start failure ### 21.6 Strict I/O Ownership Phase Q's key architectural rule is strict ownership of all external I/O. Required ownership model: - only the store subsystem touches SQLite - only the inbox ingress/export subsystem parses or writes inbox JSONL - only the config-ingress subsystem parses team `config.json` - only the transport subsystem touches sockets - only the notifier/plugin subsystem talks to agent processes This is the architectural mechanism intended to prevent the boundary leakage that made the old daemon line unmaintainable. Privacy rule: - each boundary must expose only the trait or façade needed by callers - concrete implementations, helper constructors, and storage/transport details stay private to the owning module unless a later crate extraction makes the boundary stricter ### 21.6.1 Boundary Shapes Each I/O-owning subsystem needs one explicit architectural boundary. #### MailStore Dispatch model: - synchronous request/response from service code - transaction-scoped mutating calls Object-safety rule: - callers depend on an object-safe store trait or façade, not concrete SQLite types Minimum method set: - open/bootstrap store - run transaction - upsert/load message rows - upsert/load ack/visibility state - record/load ingest replay state - return health/readiness snapshot Scope rule: - `MailStore` owns message rows plus read/ack/visibility state tied directly to message lifecycle - `MailStore` is not the long-term owner of generic task-orchestration or daemon-status domains #### TaskStore Dispatch model: - synchronous request/response from service or task-handling code - transaction-scoped mutating calls where task and mail state must commit together Object-safety rule: - callers depend on an object-safe task-store trait or façade, not concrete SQLite types Minimum method set: - create/load/update task rows - attach/detach task linkage to `message_key` - record acknowledgement-related task transitions - query task metadata needed by mail/CLI projections #### RosterStore Dispatch model: - synchronous request/response for roster replacement, lookup, and readiness checks Object-safety rule: - callers depend on an object-safe roster-store trait or façade, not concrete SQLite types Minimum method set: - replace/load roster rows - query roster membership for routing/validation - return roster health/readiness snapshot #### InboxIngress Dispatch model: - batch import from one changed inbox source Object-safety rule: - callers depend on an object-safe ingress trait or façade, not direct JSONL parser structs Minimum method set: - import changed inbox source - compute canonical imported identity/fingerprint - report degraded/skipped rows with structured diagnostics #### InboxExport Dispatch model: - one-way export / re-export after durable commit Object-safety rule: - callers depend on an object-safe export trait or façade, not direct file writer implementations Minimum method set: - export ATM-authored Claude-compatible record - re-export by durable `message_key` - return typed export failure / retry-needed result #### Transport Dispatch model: - request/response for same-host and remote daemon traffic - the same dispatch contract must also support the in-process `test-socket` transport used by tests Object-safety rule: - callers depend on an object-safe transport trait or façade so local and remote adapters remain swappable Minimum method set: - serve local daemon API - send remote daemon request - query daemon health - shut down listener/connection set gracefully - construct or bind an in-process `test-socket` endpoint for transport-boundary tests Dispatcher rule: - transport hands off to one injected dispatcher boundary - the dispatcher owns request-kind routing only - request-family behavior lives in injectable handlers behind that dispatcher - adding a new request type must not require embedding business logic into Unix-socket or TCP/TLS adapter code Socket receive loop rule: - the receive loop must stay intentionally small - allowed responsibilities: - read one framed request - parse it into a qualified request enum/value - validate/authenticate the transport envelope - dispatch immediately to the owning handler boundary - serialize one typed response - forbidden responsibilities inside the receive loop: - direct SQL/store logic - watcher/reconcile logic - notifier/plugin delivery logic - embedded workflow/business-state transitions #### Dispatcher Dispatch model: - qualified request -> handler routing inside the daemon/runtime service layer Object-safety rule: - transport adapters depend on an object-safe dispatcher trait or façade, not on concrete request-family handler implementations Minimum method set: - dispatch parsed request to the correct request-family handler - return one typed response or typed error Boundary rule: - dispatcher owns routing, not business logic - request-family behavior lives in injectable handlers behind the dispatcher - adding a new request family should be an additive handler/registration change, not transport-adapter logic growth #### Watcher / Reconcile Dispatch model: - watch-event/debounce driven trigger into owned reconcile handlers Object-safety rule: - callers depend on an object-safe watcher/reconcile trait or façade, not concrete filesystem watcher implementations Minimum method set: - subscribe/start watch set - accept changed-path event - debounce/coalesce reconcile request - trigger owned ingress/reconcile handler - shut down watcher cleanly Boundary rule: - the watcher/reconcile subsystem owns filesystem watch events only - it must not perform SQL directly - it must not send socket traffic directly - it must not deliver notifier/plugin events directly - it may dispatch to the owning ingress/store/notifier handlers through their boundaries only #### Plugin / Notifier Dispatch model: - one-way notification plus status-reporting callbacks Object-safety rule: - callers depend on an object-safe notifier/plugin boundary, not agent-specific concrete implementations Minimum method set: - notify message/task delivery - report live status update - return typed backpressure / unavailable results ### 21.6.2 Structured Error And Observability Boundaries Phase Q must keep production runtime failure handling and observability structured at compile time. Architectural rules: - fallible production paths return typed `Result` / discriminated error enums across crate boundaries rather than relying on panic or unwrap - pattern matches over `AtmErrorCode` at module/crate boundary surfaces must be exhaustive; wildcard `_` match arms are not permitted - adapter layers may translate errors, but must preserve structured identity - when reviewing transitional compatibility paths, apply these structured-error rules together with the pre-Phase-Q pipeline stage lists and their supersession notes; see Sections 8 and 9 for the Ack and Clear pipeline stage lists and the inline notes that supersede them under Phase Q - SQLite-specific transaction, busy-timeout, shutdown-checkpoint, and `rusqlite` blocking-I/O rules are defined in [`docs/atm-rusqlite/architecture.md`](./atm-rusqlite/architecture.md) Sections 4, 5, and 6 and are part of this same Phase Q error boundary - `atm` owns CLI-side `sc-observability` bootstrap and CLI event emission - `atm-daemon` owns daemon/runtime/transport `sc-observability` emission - `atm-core` owns ATM event and error models above the shared observability boundary - native plugins may emit plugin-local diagnostics, but daemon-owned runtime, store, ingest, and transport events remain daemon-owned observability sinks - production runtime diagnostics must not collapse into ad hoc stdout/stderr debugging ### 21.6.3 Doctor Health Interface `atm doctor` remains a CLI command, but the Phase Q architecture requires one explicit daemon health interface. Architectural rules: - CLI doctor code queries daemon/runtime state through one explicit request / response boundary - the daemon owns collection of runtime-only health such as: - singleton ownership state - live status-cache health - ingest backlog / degraded-ingest state - SQLite readiness/openability as observed by the runtime - CLI code must not inspect private daemon state directly to synthesize health answers ### 21.6.4 Shutdown, Signals, Timeouts, And Resource Caps The daemon runtime must use one documented operational contract. Required architectural defaults: - graceful shutdown drain deadline: `5s` - force-cancel deadline: `10s` total - same-host daemon request deadline: `3s` - per-leg TCP/TLS connect deadline: `5s` - per-leg TCP/TLS read/write deadline: `5s` - total remote retry budget: `30s` - SQLite `busy_timeout`: `1500ms` - ingest batch processing slice: `2s` - doctor health query deadline: `3s` Required caps: - max concurrent accepted connections: `64` - max per-connection inflight requests: `32` - ingest queue depth: `1024` - retry queue depth: `256` - SQLite handle budget: `1..=4` - status-cache cap: `4096` Required signal behavior: - install `SIGINT`/`SIGTERM`/`SIGHUP` handling before listeners accept - `SIGINT` and `SIGTERM` enter graceful shutdown - `SIGHUP` triggers bounded rescan/reload without dropping singleton ownership ### 21.7 Test Strategy The daemon is not the test strategy. Phase Q test architecture must keep: - core service logic testable in-process - transport/watch/runtime logic testable through fakes or harnesses - daemon process spawning out of the core test path If a capability cannot be tested without real daemon spawning, that is treated as a design smell rather than the default approach. ### 21.8 Lock Elimination The lock-release gate proved the file-based line is acceptable only as interim relief. The target Phase Q architecture removes mailbox-lock dependence from ATM mail correctness by moving durable state ownership to SQLite and treating JSONL as compatibility ingress/egress only. ### 21.9 Five-Stage Migration Model Phase Q follows five architectural migration stages: 1. store and boundary foundation 2. compatibility ingest/export 3. ack/task migration 4. read/clear cutover plus thin daemon runtime 5. lock retirement and production gate This ordering is intentional: - durable truth moves first - compatibility paths stay owned and explicit - daemon runtime arrives only after service boundaries are proven - lock retirement closes the phase after the daemon/runtime and store model are already in place