# State Model How MobileGym represents, persists, snapshots, diffs, and resets the entire device's state. This is the trick that makes deterministic judging cheap and parallel RL rollouts cloneable in milliseconds. > ๐Ÿงฌ **One sentence.** Every app separates a small, mutable **runtime overlay** from large, mostly-read-only **world data**. Only the overlay enters snapshots. The benchmark diffs initial vs. final overlays to judge. ## Where state lives ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Per-app โ”‚ โ”‚ apps//state.ts Zustand store, persisted โ”‚ โ”‚ apps//data/defaults.json initial runtime values โ”‚ โ”‚ apps//data/*.json "world data" (lazy-loaded, โ”‚ โ”‚ NOT in snapshots) โ”‚ โ”‚ apps//data/loader.ts optional dynamic loader โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ OS โ”‚ โ”‚ os/OsStateStore.ts settings/hardware/perms/prefs โ”‚ โ”‚ os/managers/*Manager.ts write facades over OsStateStoreโ”‚ โ”‚ os/providers/*Provider.ts shared content (Contacts/SMS/ โ”‚ โ”‚ Media), each persists itself โ”‚ โ”‚ os/data/defaults.json initial OS values โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Services / runtime state (volatile) โ”‚ โ”‚ NotificationService, KeyboardService, permission request UI, โ”‚ โ”‚ SystemShadeService, LocationService runtime, TaskManager โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## The two layers, per app Every app's data has two **horizontal** layers. This is the most important concept in MobileGym. ### World data (read-mostly, off-snapshot) Large public entities the app displays: posts, products, train stations, places, songs. They're shared across runs and across users. - Lives in `apps//data/*.json` or via a `loader.ts` that fetches on demand. - Loaded via `import data from './data/posts.json'` (Vite bundles it) or async via `loader.fetchPosts()`. - **Not persisted to localStorage.** **Not part of snapshots.** Refreshing the browser doesn't reset it because it's not stored โ€” it's just bundled assets. ### Runtime overlay (small, mutable, in-snapshot) Per-environment state the agent can affect: the current user's profile and settings, drafts, likes, sent messages, edits to specific world entities, created entities. - Lives in `apps//state.ts`, persisted to localStorage at the `` key. - Initialized from `apps//data/defaults.json` (also small โ€” only the seed values). - **Persisted.** **Part of `__SIM__.getState().apps.`.** **Reset by `__SIM__.reset()`.** ### Composition at render time The view layer combines world + overlay, with the overlay taking precedence. The typical pattern is a **resolver**: ```ts // apps/RedBook/data/index.ts (sketch) import basePosts from './posts.json'; // world data import { useRedBookStore } from '../state'; // overlay export function resolvePost(id: string) { const overlay = useRedBookStore.getState().postOverlays[id]; // partial mutation const base = basePosts.find(p => p.id === id); return { ...base, ...overlay }; // overlay wins } ``` So **base + overlay** is what the agent sees on-screen, but only **overlay** is what the judge inspects. ### Why this matters The benchmark snapshots an environment, runs an episode, snapshots again, diffs the two. If world data were in the snapshot, the diff would be huge and the judge would have to filter out signal from noise. By **excluding** world data from snapshots, we get: - Sub-millisecond serialization (snapshots are tiny โ€” kilobytes, not megabytes). - Sub-millisecond comparison (diff a few hundred fields, not a few hundred thousand). - A clean **side-effect** signal: any mutation outside the task's `expected_changes` was an unintended consequence, no matter how small. ## Layered entity rules A few rules govern how world entities and runtime entities relate. Get these wrong and your judge will read the wrong source of truth. ### Entity isomorphism World data and runtime data use the **same schema** for the same entity type. They differ in scope, not shape: ```text basePosts[id] public-world posts (apps//data/*.json, off-snapshot) state.posts[id] runtime overlay (in snapshot, persisted) โ€” create / scenario-injected / override / hide a base post ``` A resolver composes them with overlay-wins precedence: ```ts function resolvePost(id: string) { const overlay = state.posts[id]; if (overlay === null) return null; // tombstone โ€” base hidden if (overlay !== undefined) return overlay; // full overlay (covers base) return basePosts[id] ?? null; // base falls through } ``` Overlay-completeness rule โ€” overlays must store **complete entities** at all times: | `state.posts[id]` | `basePosts[id]` | Meaning | |---|---|---| | object (full entity) | exists | Full override of the base entity | | object (full entity) | absent | Runtime-only entity (must satisfy schema's required fields) | | `null` | exists | Tombstone โ€” base entity is hidden / deleted | | absent | exists | Plain base, no runtime change โ€” `view_post(id)` falls back to base | | absent | absent | No such entity | Why "complete entities only" matters: `__SIM__.setState({ ...patch }, { deep: true })` is a write command, **not** a long-term data shape. Deep-merging a partial onto an absent overlay would silently create a runtime entity that fails the schema. To inject a scenario, read the full entity through the App accessor first (`base_note("id")` or `view_note("id")`), mutate fields, then write the full entity back. `setState` deep-merge semantics: | Patch value | Effect | |---|---| | `object` | Deep merge into the existing object | | `undefined` | No-op (preserves existing value) | | `null` | Writes `null` explicitly โ€” usable as tombstone | | `array` | Whole-array replace | | primitive | Whole-value replace | The rule generalizes to every "public world + current-user runtime" entity: posts / notes / comments / users / videos / products / songs / playlists / places / orders. Docs may call the static side the "static data layer"; in App and bench accessors, prefer the `base_*` naming, **not** `static_*`. ### Field ownership: IDs vs. full copies If a runtime entry **refers to** a base entity, store **only the ID** (and the relationship state), not a copy of the entity: ```jsonc // โœ… โ€” runtime only holds the relationship "user": { "savedPostIds": ["post_1"], "followedUserIds": ["user_1"], "joinedCommunityIds":["com_games"] } ``` ```jsonc // โŒ โ€” copying full base entities into runtime "user": { "likedPosts": [ { "id": "post_1", "title": "...", "content": "..." } ] } ``` Copies create sync hazards: when titles / counts / authors are edited later, two places now disagree on what's true. If a runtime entry **is** an entity (created by the simulated user, or scenario-injected and not in base), store the **full entity** plus a user-side ID index: ```jsonc { "user": { "postIds": ["my_post_1"], "commentIds": ["my_comment_1"] }, "posts": { "my_post_1": { "id": "my_post_1", "subreddit": "r/self", "author": "...", "title": "...", "content": "..." } }, "comments": { "my_comment_1": { "id": "my_comment_1", "postId": "post_1", "author": "...", "body": "..." } } } ``` This is the **entity-table + user-index pattern** (the preferred shape). The alternative โ€” a `createdPosts: [{full entity}]` array โ€” is shorter to write but leads to one-off `createdX` fields per entity type, loses base/overlay symmetry, and complicates aggregation. Use the entity-table pattern for any non-trivial app. Hard rules that follow from the pattern: - **Business entities live at the store top level, not under `user`.** Tables like `posts` / `comments` / `chats` / `messages` / `orders` / `playlists` / `drafts` must be siblings of `user`, so overlays, scenarios, and view accessors can all address them by a stable path. **Forbidden**: nesting an entity table under `user`, e.g. `user.chatThreads = { โ€ฆ }`. - **User-side indexes are ID-only.** Use `user.commentIds: string[]`, never `user.commentList: Comment[]` with content copies. Entity content lives in the entity table or is computed via `view_*`. Embedding sub-entities inside a **non-user entity** (e.g. `Note.commentList?: Comment[]`) is a separate question โ€” it's an entity-to-entity relationship the user-index rule doesn't cover, though it still tends to duplicate truth between the entity and the entity table. - **Polymorphic-exclusive relations use a Record, not parallel arrays.** A three-state vote (`'up' | 'down' | 'none'`) is `votes: Record`, not `upvotedIds` + `downvotedIds`. Parallel arrays would need an extra mutex invariant ("same id never appears in both") that the type system cannot enforce. See `apps/Reddit/state.ts` (`postVotes`, `commentVotes`) for the canonical shape. ### User-managed content judgment Whether an entity "belongs to the current user" cannot be inferred from "does it appear in `state.`?" alone. `state.posts` can contain runtime overlays for *other* users (scenario data, base overrides). The ownership test is **both** conditions: 1. The user-side index lists the ID โ€” `user.postIds.includes(id)`, `user.commentIds.includes(id)`, etc. 2. The entity's author field matches the current user โ€” `post.author === user.username`. This means: deletion / edit / "show me my posts" judges must check both. It also means base-dataset posts are public world data โ€” they are **not** editable or deletable as if they were the user's, even if the simulated user is their nominal author. Implication for task design: if a task requires the user to delete or edit "their own" content, the seed data for that content must live in runtime state (`defaults.json` plus the user-side index), not in base JSON. ### Aggregated counts โ€” derive, don't write back Counts on public entities (`likes`, `comments`, `upvotes`, `retweets`) are public-world numbers. The current user's contribution is layered on top at render / judge time: ```text display_likes(id) = base_likes(id) + (id โˆˆ user.likedPostIds ? 1 : 0) display_comments(id) = base_comment_count(id) + runtime_comments_for_post(id).length ``` Rules: - **Never overwrite `base_post.likes` on user action.** Update `user.likedPostIds` instead. The view layer / accessor derives the visible total. - **`view_*` accessors apply the derivation.** Both the App UI and `bench_env/task//app.py` must agree on the formula. - **Operate judges read the user-side state, not the count.** "Did they like this post?" โ†’ `state.user.postVotes[post_id] == "up"`, never "did `base_post.upvotes` go up by 1?" โ€” base aggregates may be strings / abbreviations / async. If you want the initial UI to display N likes with the user already having liked, set the base count to N-1 and let `user.likedPostIds` contribute the final one. The same logic applies to **user-relation summaries** (`followingCount`, `followerCount`, `likesCount`, etc.): - **If the App stores the IDs** (`followingIds: string[]`), the count must be derived: `state.user.followingIds.length`. **Do not** also persist `followingCount` as a separate written field โ€” it's a duplicate source of truth that will drift. - **If the App stores only the count** (no `followingIds`), that's an acceptable simplification for apps where the actual ID list isn't meaningful to the bench. The count can be a raw runtime field. - **Never both.** ### View accessors (the bench-facing surface) `view_*` is the shared display contract between the simulator front-end and `bench_env/task//app.py`. Both sides resolve to the same view-of-the-world that the user / agent actually sees. | Accessor | When to use | |---|---| | `view_(id)` | Default task-side read (queries, answers, sampling candidates). Returns the resolved view: base + overlay + tombstone semantics. | | `state_user()` | Operate judging. Returns the runtime user record (`postIds`, `likedPostIds`, etc.). | | `state_(id)` | Entity-level create / update / delete judging. Returns the runtime-overlay entry only (no base fallback). | | `base_(id)` | Internal helper inside `app.py`. **Do not call from `task.py` / `defs.py`.** Use it inside the accessor to build search indexes, then return `view_(id)` to callers. | Task-side reading rule by intent: | Task intent | Read from | |---|---| | Pure query / answer task | `init.view_*` โ€” what the user sees at task start | | Answer after agent operates | `current.view_*` โ€” final visible state | | Operate judgment | `current.state_user()`, optionally compared against `init.state_user()` | | Create / edit / delete judgment | `state_(id)` or a dedicated `check_*` helper | Forbidden in `task.py` / `defs.py`: - Direct `json.load(...)` of `apps//data/*.json`. Only `app.py` is allowed to read base JSON. - Treating `state.[id]` as the resolved view (it isn't โ€” overlays may be partial, missing, or tombstoned). - Returning `base_` as the answer (bypasses overlay rules). The boundary is deliberate: short-term we read JSON from disk; medium-term `apps//data/*.json` may move behind `/api/sim-data/...` or SQLite. Task code that goes through `view_*` survives that migration; task code that calls `json.load` will break. ### Multi-user state shapes The default is a single sandboxed current user. Multi-user is **not** a separate data model โ€” it's the same one with an extra `currentUserId` pointer and a `users[userId]` map: ```jsonc { "currentUserId": "u1", "users": { "u1": { "id": "u1", "publishedNoteIds": ["note_u1_1"], "likedNoteIds": ["base_note_1"] }, "u2": { "id": "u2", "publishedNoteIds": ["note_u2_1"], "likedNoteIds": [] } }, "notes": { "note_u1_1": { "id": "note_u1_1", "authorId": "u1", "title": "u1 ็š„็ฌ”่ฎฐ" }, "note_u2_1": { "id": "note_u2_1", "authorId": "u2", "title": "u2 ็š„็ฌ”่ฎฐ" } } } ``` The entity tables (`notes`, `posts`, `comments`) remain App-level runtime overlays and may hold entities from multiple users. The ownership rule still applies: an entity is "u1's" iff `users.u1.publishedNoteIds` contains the ID **and** `note.authorId === 'u1'`. View accessors resolve relative to `currentUserId`. ## Large world data: the `loader.ts` pattern When an app's content dataset exceeds a few thousand JSON lines (RedBook posts, Bilibili videos, Ebay products), bundling it into `defaults.json` blows up Vite's ESM transform time. Split it out and lazy-load: ``` apps//data/ โ”œโ”€โ”€ defaults.json # initial runtime state (users, settings, small seed entities) โ”œโ”€โ”€ index.ts # exports _CONFIG โ€” synchronous, defaults + constants only โ”œโ”€โ”€ loader.ts # async fetch of large datasets, module-level cache โ””โ”€โ”€ .json # bundled-but-not-imported large datasets (posts.json, videos.jsonโ€ฆ) ``` `loader.ts` shape: ```ts let cachedData: PostData[] | null = null; export async function loadPosts(): Promise { if (cachedData) return cachedData; const url = new URL('./posts.json', import.meta.url); const res = await fetch(url); // local same-origin JSON โ€” no NetworkService needed const raw = await res.json(); cachedData = processPosts(raw.posts); return cachedData; } export function getPostsSync(): PostData[] | null { return cachedData; // null until first preload completes } export async function preload(): Promise { await loadPosts(); // called by bench_env / app boot } ``` **Bench accessibility depends on how the app consumes the loaded data**: | Consumer pattern | Where the data lives | Visible to `__SIM__.getState()`? | |---|---|---| | **Store action** (X, RedBook) | Loader runs, then a store action writes it into the Zustand store | โœ… yes | | **Hook** (Bilibili, Ebay, Spotify) | Loader caches in module scope; component hooks read from there | โŒ no | | **Service** (Railway12306) | Loader hands off to a service singleton | โŒ no | If a bench task needs to read or write a large-data field, the app must use the **store-action** pattern โ€” load it into Zustand on boot or on demand, so it appears in `apps..x` in `__SIM__.getState()`. Hook / service patterns are valid for display-only data. When to split โ€” rule of thumb: | `defaults.json` size | Action | |---|---| | < 100 KB (~2 000 JSON lines) | Keep everything in `defaults.json` | | > 100 KB, or crawled/generated content | Split out, add `loader.ts` | Hard rules: 1. `index.ts` does **not** import large datasets. It only processes `defaults.json` + `constants.ts` and stays synchronous. **Do not** hardcode default settings inside `index.ts`, define constants there (use `constants.ts`), or declare types there (use `types.ts`). 2. `loader.ts` is independent of `index.ts`. Loader is called by stores / hooks, not by `_CONFIG` construction. 3. Large datasets do **not** affect the type of `_CONFIG`. 4. `bench_env` only sees what's in the store. If a large-data field must be bench-visible, route it through a store action. 5. Large JSON files use **camelCase** content names matching the loader function (`videoComments.json`, `playlistTracks.json`, `searchableCities.json`). **Forbidden**: app-name prefixes (`reddit_data.json`), source prefixes (`crawled-users.json`, `imported-posts.json`), or dotted namespaces (`catalog.products.json`). 6. **Every `loader.ts` must export `preload(): Promise`** with that exact signature, even if its body just awaits the loader's other entry points. Bench-env boot and the app shell both rely on this uniform shape to warm caches. 7. Loader function names follow `load()` and `getSync()` (no app-name prefix, no `Map` / `Data` suffix: prefer `loadPosts` over `loadRedditPosts` or `loadPostsMap`). 8. `fetch(new URL('./xxx.json', import.meta.url))` inside a loader is the **one legitimate exception** to the "use `NetworkService` for HTTP" rule โ€” the URL resolves through Vite's dev server to a same-origin static asset, no CORS or scenario routing is needed. Image assets referenced inside loaded JSON are resolved via `resolveAssetsDeep()` to `/@app-assets//...`, served by the Vite `serveAppAssetsPlugin` from `apps//assets/`. ### Data consumption decision ```text Will the data be modified at runtime? โ”œโ”€โ”€ yes โ†’ must go into the store (store action or afterHydration) โ”‚ (only this path is visible to __SIM__.getState() and bench) โ””โ”€โ”€ no, read-only reference data โ”œโ”€โ”€ do non-React consumers need it (services, OS managers)? โ”‚ โ””โ”€โ”€ yes โ†’ service layer โ””โ”€โ”€ no โ†’ hook is simplest ``` Hook and service patterns are valid for display-only data; never for fields a bench task must read or assert against. ## The full snapshot shape `window.__SIM__.getState()` returns: ```jsonc { "os": { "settings": { "global": { "wifiEnabled": true, "bluetoothEnabled": false, "language": "zh-Hans", โ€ฆ }, "system": { "brightness": 80, "volume": 60, "fontScale": 1.0, โ€ฆ } }, "hardware": { "battery": {โ€ฆ}, "wifi": {โ€ฆ}, "cellular": {โ€ฆ}, "bluetooth": {โ€ฆ}, "storage": {โ€ฆ}, "hotspot": {โ€ฆ}, โ€ฆ }, "permissions": { "wechat": { "android.permission.ACCESS_FINE_LOCATION": "granted", "android.permission.CAMERA": "denied" }, โ€ฆ }, "preferences": {โ€ฆ}, "providers": { "contacts": {โ€ฆ}, "sms": {โ€ฆ}, "media": {โ€ฆ} }, "clipboard": {โ€ฆ}, "notifications": {โ€ฆ}, "shade": {โ€ฆ}, "services": {โ€ฆ} }, "apps": { "wechat": {โ€ฆ}, "alipay": {โ€ฆ}, "redbook": {โ€ฆ}, /* one entry per installed app โ€” runtime overlay only */ } } ``` Two top-level keys: `os` and `apps`. Everything the benchmark cares about is in there. **Nothing else is judging-visible**, by construction. ### Bench-facing path conventions Judges access state via path strings. The framework hands tasks a typed dict; index it directly: ```python # Raw indexing โ€” preferred when reading state once init_state["apps"]["wechat"]["contacts"] final_state["os"]["settings"]["global"]["wifiEnabled"] # CriteriaTask โ€” dotted paths, anchored at the App's runtime overlay root class ToggleDarkMode(CriteriaTask): apps = ["alipay"] criteria = { "settings.general.darkMode.mode": "{mode}", } ``` `CriteriaTask` path rules: - Paths start at the App's overlay root (no `apps..` prefix โ€” the suite already declares its app). - `.` separates nested levels. - Leaf values are the actual setting / field. - `settings.x.y.z` is the standard prefix for per-app settings. Schema changes are load-bearing: **after modifying any settings structure inside an App's store, grep `bench_env/task//tasks.py` and `bench_env/task//defs/*.py` for matching paths and update them in the same commit.** A silent path mismatch produces a green CI run with a broken judge. Don't write defensive `.get("wechat", {})` chains. If your task declares `apps = ["wechat"]`, the runner guarantees the key exists. If it isn't there, that's a bug in the registry, not something to paper over. For typed access from inside `app.py` accessors, prefer **`view_*` accessors** (see [Layered entity rules](#view-accessors-the-bench-facing-surface) above) over raw `.["posts"][id]` chains. ### Bench app accessors Each task suite can define a thin `BaseApp` subclass in `bench_env/task//app.py`. Treat it as the bench-facing read model for that app: ```python class Wechat(BaseApp): @property def settings(self) -> dict: return self.get("settings", {}) def find_contact(self, name: str) -> dict | None: for contact in self.get_list("contacts"): if contact.get("name") == name: return contact return None ``` Rules: - Accessors are coupled to the app store schema. If you rename or move a state field, update the matching `app.py`, `tasks.py`, and `defs/*.py` paths in the same change. - For content apps with base data + runtime overlays, expose `view_*` helpers as the default query surface. They should apply base fallback, runtime override, tombstone, and current-user relationship rules. - Use lower-level `state_*` / raw helpers only for mutation-specific checks such as "did this runtime entity get created, overwritten, or tombstoned?" - Samplers should also return candidates through the same view helpers the UI semantics imply. They may use base-data indexes internally for speed, but their public return value should match what the user could see. ### Transient `_temp` state Use a top-level `_temp` object for runtime-only UI state: loading/error flags, current drag/focus state, temporary category windows, and similar values. Current behavior: | Property | Behavior | |---|---| | Persistence | `createAppStoreWithActions` excludes `_temp` from localStorage by default. If you use a custom `partialize`, keep that exclusion. | | Snapshot | `_temp` is still visible in `__SIM__.getState()` because snapshots read the in-memory store and strip only functions. | | Bench diff | `BaseTask.always_ignore` includes `apps.*._temp`, so `_temp` changes are filtered from unexpected-side-effect warnings by default. | Do not put meaningful business state in `_temp`. Draft content, selected entities, generated records, and anything a task might judge belong in the normal runtime overlay. ### Top-level vs. `_temp` decision | Condition | Top level | `_temp` | |---|---|---| | The bench can mutate or assert against this value | โœ… | | | The bench needs to seed this value | โœ… | | | User-produced business state (drafts, selections, generated content) | โœ… | | | Loading / fetching mid-state | | โœ… | | Focus / drag / hover / open-vs-closed UI mode | | โœ… | | Animation progress, temporary overlays | | โœ… | Common mistakes: a `createDraft` field belongs at the top level (it represents in-progress user content the task may judge); a `queryLoading` flag belongs in `_temp` (it's a request lifecycle detail no task should care about). ### Persistence contract: snapshot tables must survive reload If a runtime entity table appears in the snapshot (i.e. it's reachable from `__SIM__.getState().apps..x`), `partialize` **must keep it**. Excluding a snapshot-visible table from persistence breaks two invariants at once: the bench's init/current diff becomes unstable across reloads, and any user-created content (posts, comments, threads) is lost on refresh. The default `partialize` from `createAppStoreWithActions` already does the right thing โ€” it strips functions and `_temp`, keeps everything else. If you write a custom `partialize`, make sure every business-entity field is in the allowlist. The audit rule is simple: **whatever is in the snapshot must be in `partialize`**, except `_temp` and functions. ### Snapshot exposure rules `__SIM__.getState()` walks each app's Zustand store and produces a JSON-serializable view. Two rules govern what makes the cut: - **Functions are filtered out unconditionally.** Action functions, computed properties, and any other function-typed field disappears from the snapshot. - **Derived values must be stored as fields or routed through an adapter, not implemented as store-action getters.** Because actions are filtered, a getter-style `state.isLiked(id)` returns `undefined` to bench inspectors. If a derived value must be visible to the bench, either materialize it into a state field after every relevant action, or expose it via `registerStateAdapter(appId, fn)` โ€” the adapter post-processes the snapshot view without affecting in-store reads. ## Store semantic fields, not display strings `defaults.json` and persisted runtime state must store the semantic fields the app owns, not a UI display string that later has to be parsed. Bad: ```json { "route": "Shanghai - Chengdu" } ``` Good: ```json { "fromCity": "Shanghai", "toCity": "Chengdu" } ``` Build labels in the UI layer, selectors, or app accessors. Do not make `bench_env` split strings such as `"A - B"`, `"A/B"`, `"A to B"`, or `"12:00-13:00"` to recover state. If both a label and structured fields are needed, treat the structured fields as authoritative and derive the label unless the label is itself user-authored content. This keeps sampling and judging stable when copy, locale, punctuation, or UI formatting changes. ## Read-only vs. mutable conventions Inside an app's runtime overlay, some fields are **logically read-only** (current user identity, account creation date) and some are user-mutable (likes, drafts, settings). The framework doesn't enforce a distinction โ€” it's a coding convention. But two patterns make life easier: - Keep **structural constants** (tab definitions, service grids, feature flags) in `constants.ts`, not `defaults.json`. They're the same for every user; bench_env will never write to them. - Use a `_temp` field (or skip persistence for specific keys via Zustand `partialize`) for ephemeral UI runtime state (open dialogs, current scroll position) that should not persist and should be ignored by side-effect checks. ```ts createAppStoreWithActions( 'habits', initialState, (set) => ({ /* actions */ }), { partialize: (state) => { const { _temp, ...persisted } = state; return persisted; // _temp never enters localStorage }, }, ); ``` ## Settings: structure and Action patterns Per-app settings live inside the app's own runtime overlay (the Zustand store). Two shapes show up across apps; pick by depth. ### Naming The top-level key for app-level settings is **`settings`**, in both `defaults.json` and `state.ts`. The TypeScript type is `Settings` (e.g. `AlipaySettings`). The store field is also `settings`. The **only legitimate exception** is the system Settings App (`system/Settings`), which exposes its bucket as `preferences` because that mirrors Android's `SharedPreferences` concept. No third-party app should use `preferences`, `initialSettings`, `defaultAlarms`, `config`, or any other variant โ€” `defaults.json` should fail review if such top-level keys appear. Apps that genuinely have no user-tunable settings can omit the `settings` field entirely. The rule is "if you have one, call it `settings`", not "every app must have one". ### Nesting follows the settings page route hierarchy When a settings page tree has multiple levels โ€” e.g. `/settings`, `/settings/notifications`, `/settings/notifications/sound` โ€” the `settings` object's nesting **must match** the route hierarchy one-for-one. A page at `/settings/notifications/sound` writes to `settings.notifications.sound.*`; flat denormalized keys like `settings.notificationSoundVolume` break the analyzer's ability to pair routes with the state they mutate. If an app has a single settings page, keep the shape flat (no artificial categories). ### Flat settings โ€” 1-level Partial spread Use this when an app has a single settings page and all toggles sit at the same level (Calendar, WechatReading, TencentMeeting, Railway12306): ```ts updateSettings: (patch: Partial) => { set(state => ({ settings: { ...state.settings, ...patch } })); } // Caller: updateSettings({ theme: 'dark' }); ``` ### Nested settings โ€” category-scoped updaters Use this for apps with multi-level settings pages (Map, Alipay). One updater per category, typed precisely to the sub-shape: ```ts updatePayOrder: (patch: Partial) => { set(state => ({ settings: { ...state.settings, payment: { ...state.settings.payment, payOrder: { ...state.settings.payment.payOrder, ...patch } }, }, })); }, updateDarkMode: (patch: Partial) => { set(state => ({ settings: { ...state.settings, general: { ...state.settings.general, darkMode: { ...state.settings.general.darkMode, ...patch } }, }, })); }, ``` ```ts updateDarkMode({ mode: 'dark' }); updatePayOrder({ mode: 'custom' }); ``` ### Tolerable but not recommended ```ts // โš ๏ธ Functional updater โ€” flexible but callers must spread manually setSettings: (updater: AppSettings | ((prev: AppSettings) => AppSettings)) => void; ``` ### Forbidden ```ts // โŒ One setter per leaf โ€” fragmented; 20 setters for 20 toggles setDarkMode: (mode: string) => void; setFontSize: (size: number) => void; // ... // โŒ Whole-object replace โ€” caller must pass the complete settings or fields are lost setSettings: (settings: AppSettings) => void; ``` ### Where settings live in snapshots | Storage | Path | Owner | |---|---|---| | **OS-level** (cross-app) | `os.settings.global.*` / `os.settings.system.*` | `OsStateStore`, written via Managers | | **OS manager preferences** | `os.preferences.*` | Flat manager-routed values such as volume aliases, build overrides, and feature toggles | | **App-internal** | `apps..settings.*` | The app's own `state.ts` | For `CriteriaTask` paths see "[Bench-facing path conventions](#bench-facing-path-conventions)" above. ## Store actions: no query-style getters > โ›” **This rule has caused multiple "UI doesn't update after interaction" bugs across apps. It is non-negotiable.** Zustand actions are plain closures created inside `create()`. Their **reference never changes** after the store is built. When a component subscribes via `useStore(s => s.isLiked)`, Zustand's `Object.is` comparison always returns `true`, so the component **never re-renders when underlying data changes**. ```ts // โŒ FORBIDDEN โ€” query getters defined as store actions interface MyActions { isLiked: (postId: string) => boolean; isFollowing: (userId: string) => boolean; getEventById: (id: string) => Event; checkInteractions:(vid: string) => { liked: boolean; coined: boolean }; toggleLike: (postId: string) => void; // โœ… this one is fine โ€” it's a mutation } ``` ```tsx // โŒ FORBIDDEN โ€” component subscribes to a stable function reference const isLiked = useStore(s => s.isLiked); // same reference forever, no re-render ``` The fix โ€” **subscribe to data, derive locally**: ```tsx // โœ… subscribe to the array โ†’ reference changes when data changes โ†’ re-renders const likedPostIds = useStore(s => s.likedPostIds); const isLiked = (id: string) => likedPostIds.includes(id); // โœ… memoSelector that returns a Set (O(1) membership for big lists) export const selectLikedSongIds = memoSelector( (s: Store) => s.likedSongs, (songs) => new Set(songs.map(t => t.id)), ); const likedIds = useStore(selectLikedSongIds); const isLiked = (id: string) => likedIds.has(id); // โœ… subscribe to events array, find locally const events = useStore(s => s.events); const event = events.find(e => e.id === eventId); ``` The principle: **store `actions` contain only mutations (state-changing operations).** Queries (read-only derivations) live in the component, in a `memoSelector`, or as a standalone util. The same logic forbids `useShallow(s => s.someGetter)` wrapped in `useMemo` โ€” the dep is still a stable reference, the memo never recomputes. The `scripts/lint_store_getters.mjs` linter catches violations of this rule. ## The OS state model โ€” Android four-layer mapping `OsStateStore` mirrors the Android settings model: | Android concept | MobileGym tree | Examples | |---|---|---| | `Settings.Global` | `os.settings.global` | wifi/bluetooth on-off, locale, mobile-data | | `Settings.System` | `os.settings.system` | brightness, ringer volume, font scale | | `Settings.Secure`-like data | dedicated stores | permission-like state lives in `os.permissions` / `os.preferences` rather than a separate secure bucket | | App-scoped settings | `apps..settings` | in-app prefs owned by the app store | | Hardware state | `os.hardware` | battery percent/charging, WiFi RSSI, cellular, Bluetooth, storage, hotspot | | Permission grants | `os.permissions` | `permissions[appId][permId] = 'not_requested' / 'granted' / 'denied' / 'denied_forever'` | | `ContentProvider` data | `os.providers` | contacts, SMS, media | This is persisted as a single `os_state` localStorage entry plus separate per-provider stores. **`build` info and `telephony` are managed via the `managers/registry.ts` override mechanism** so the benchmark can inject scenarios (different device profiles, carriers, etc.) without rewriting the store. ## Managers โ€” write facades Apps **must not** write directly to `os.hardware.wifi.enabled`. They go through a Manager: ```ts import { ConnectivityManager } from '@/os/managers/ConnectivityManager'; ConnectivityManager.setWifiEnabled(false); // โ†’ sets os.hardware.wifi.enabled = false // โ†’ if airplaneMode would conflict, cascades correctly // โ†’ emits a broadcast on os/BroadcastBus ``` Why: constraint logic (airplane-mode cascade, volume clamp, brightness clamp) is centralized. If the same write happened from three apps directly, you'd hit three different broken edge cases. Managers live in `os/managers/`. Today: `ConnectivityManager`, `BatteryManager`, `AudioManager`, `DisplayManager`. Add a manager (rather than expanding one) when the new constraint vocabulary doesn't overlap with an existing one. ## Providers โ€” shared content Things many apps need to read or write but no one app *owns*: contacts, SMS, media. These live in `os/providers/` and are exposed through `ContentResolver`: ```ts import { ContentResolver } from '@/os/ContentResolver'; const contacts = ContentResolver.query('content://contacts/contacts'); ContentResolver.insert('content://sms/messages?conversationId=thread-1', { content: 'Hi!', isOutgoing: false, }); ``` Each provider: - Is its own persisted Zustand store. - Is excluded from the `os.services` snapshot section because it's substantive content, not service runtime. - **Appears in `os.providers.`** in `__SIM__.getState()` so judges can read it. - Is reset by `__SIM__.reset()`. If your app needs to *share* data with others, add it as a provider. If your app needs *its own* data, put it in the app's runtime overlay. ## Snapshot, reset, inject The four operations the benchmark performs millions of times during an RL run: ### Snapshot ```js const snapshot = window.__SIM__.getState(); ``` `AppStateRegistry.getAllAppStates()` walks the registered stores, calls each `.getState()`, and assembles the tree. No copies โ€” Zustand stores are immutable, so the snapshot is a reference graph. For serialization use `JSON.stringify(snapshot)`; for a deep-cloned safe copy, `structuredClone(snapshot)`. ### Reset ```js await window.__SIM__.reset(); // factory reset + page reload await window.__SIM__.resetState(); // factory reset, no reload ``` Both go through `_resetStateCore()` in `os/OSContext.tsx`, which runs in a deliberate order so the cleared in-memory state can't be re-persisted by the browser's beforeunload flush. The full sequence: 1. **`beginPersistReset()`** โ€” set a guard so any pending zustand-persist writes are dropped instead of racing with the clear. 2. **`resetAllAppStores()`** โ€” reset every app-store registered via `createAppStore*` to its initial state. 3. **`resetAllOsStores()`** โ€” reset every OS-side service / provider store registered via `createOsStore*` (both `_registry` and `_providerRegistry`). 4. **`OsStateStore.reset()`** โ€” reset the merged OS settings / hardware / permissions / preferences bucket to `OS_DEFAULTS`. 5. **`TaskManager.reset()`** โ€” clear the task stack, return to the desktop. 6. **`cancelAllPendingPersistWrites()` + `localStorage.clear()`** โ€” only after the in-memory state is clean does localStorage get wiped, so the persist layer can't write the just-cleared state back. 7. **`TextSelectionService.hideSelectionMenu()` + `await clearFileSystemDB()` + a second `cancelAllPendingPersistWrites()` / `localStorage.clear()`** โ€” handle the async file-system clear and one final cancel/clear in case anything queued during teardown. `reset()` calls `_resetStateCore()` then `window.location.reload()`. `resetState()` does the same work without the reload โ€” useful when the test harness will reload the page itself. Provider stores reset through step 3 (the registry walk) but their localStorage keys (`provider_sms`, `provider_contacts`, `provider_media`) are also wiped by step 6's `localStorage.clear()`. ### Inject ```js window.__SIM__.setState( { apps: { wechat: { user: { id: 'u_42' } } } }, { deep: true }, // merge nested objects instead of overwriting ); window.__SIM__.setState(snapshot, { deep: true, reload: true }); // โ†‘ used for "restore from snapshot": writes the patch + reloads to repaint everything ``` Inject lets tasks set up arbitrary initial conditions: pre-populate a contact list, set an account balance, mark a message as read. Without `setState`, every task would need to drive the simulator UI just to get into position. With it, tasks can teleport. For `os.*` paths, `applyOsStatePatch()` routes the patch to the responsible Manager: - `patch.os.build` โ†’ `setBuildOverrides()` (writes to `__os_scenario_overrides__` localStorage key, not `os_state`). - `patch.os.telephony` โ†’ `setTelephonyOverrides()` (same scenario-override path). - `patch.os.settings.global.wifiEnabled` โ†’ `ConnectivityManager.setWifiEnabled()`, which honors the airplane-mode cascade (turning airplane on forces wifi/bt/cellular off automatically). - `patch.os.settings.system.volume*` โ†’ `AudioManager`, which clamps to 0โ€“100 and syncs DND โ†” silentMode. - `patch.os.hardware.battery.percent` โ†’ `BatteryManager` (clamp 0โ€“100). - `patch.os.settings.system.brightness` โ†’ `DisplayManager` (clamp 0โ€“100). In other words, **`setState` does not bypass Manager constraints**. If a task author writes `os.settings.global.wifiEnabled = true` while `airplaneMode = true`, the cascade fires and wifi stays off. Tests that rely on Manager-enforced invariants are still correct after a `setState` inject. The deep-merge semantics preserve the **null vs. undefined distinction**: `null` is a tombstone (delete the key / element), `undefined` is a no-op (leave the existing value alone). External patches need this to express "clear this field" โ€” a contract any future merge rewrite must keep. ### Clone (for parallel RL) For group-based RL like GRPO, you need many parallel instances of the same initial state. The pattern: ```js // Once, in the master process: window.__SIM__.setState(initialState, { deep: true, reload: true }); const snap = JSON.parse(JSON.stringify(window.__SIM__.getState())); // Then, in every parallel worker (each has its own browser tab): worker.__SIM__.setState(snap, { deep: true, reload: true }); ``` The cost is ~10โ€“50 ms per worker for a typical snapshot, dominated by JSON serialize/deserialize, not by the state shape itself. That's how MobileGym hits "256 parallel instances on a single server" โ€” the snapshot fits in cache and `setState` is fast. ## Side-effect detection The runtime overlay being small + structured lets the runner detect **side effects** for free: 1. Snapshot at task start (`init`). 2. Snapshot at task end (`final`). 3. Diff `init` vs. `final` at the path level. 4. Subtract the task's declared `expected_changes`. 5. Whatever remains is an **unexpected side effect**. In the leaderboard this shows up as the `USE` (Unexpected Side Effects) column. An agent that completes the task but silently follows three users or saves an unwanted draft will look successful on the task itself but bad on USE โ€” which is exactly what user-facing safety wants. A task author declares `expected_changes`: ```python class CreateNewNote(BaseTask): expected_changes = ["notes.notes"] # creating a note โ†’ notes.notes mutates ``` Anything outside this list is a side effect. List too few and innocent changes flag. List too many and real bugs hide. ## Auto-discovery of state Apps don't explicitly register their store in a central registry file. The OS eagerly imports state modules so their top-level `createAppStore*` calls can self-register: ```ts import.meta.glob(['../apps/*/state.ts', '../system/*/state.ts'], { eager: true }); ``` `createAppStore(appId, ...)` and `createAppStoreWithActions(appId, ...)` register the store under `` and use that same id as the default localStorage key. `AppStateRegistry.getAllAppStates()` collects every registered store and emits the final `apps.` slice. Bare Zustand `create(...persist(...))` stores are not visible to snapshots unless wrapped by the MobileGym factory. Two consequences: 1. **The `createAppStore*` appId must equal `manifest.id`.** Mismatch and your store appears under the wrong snapshot key. The first argument is also the localStorage key โ€” write the bare manifest id (`'wechat'`), not `'wechat-store'`, `'wechat-v2'`, or any other suffix. 2. **`registerStateAdapter`** is the escape hatch โ€” use it when you want a custom selector for snapshot extraction (e.g. you have several stores and want to flatten them into one app slice, or you want to exclude part of the store from snapshots). ### `registerStateAdapter` rules - Adapters affect **`getState()` output only** โ€” they don't change in-store reads. Components keep reading from the store directly. - Place the `registerStateAdapter(appId, fn)` call at the bottom of `state.ts`, immediately after the store is created. - Keep adapters cheap โ€” `getState()` invokes them on every snapshot call. Avoid heavy computation; pre-compute results into store fields when possible. - If the adapter reads anything outside the store (refs into other modules, derived values from `loader.ts` caches), call `invalidateStateCache(appId)` whenever those external sources change. Otherwise stale results will be served from the snapshot cache. ### Three app-store factories `os/createAppStore.ts` exports three factories โ€” pick by lifetime: | Factory | Persistence | Use for | |---|---|---| | `createAppStore(appId, initial, options?)` | localStorage | Simple stores with no actions, or where you provide your own action layer. | | `createAppStoreWithActions(appId, initial, actions, options?)` | localStorage | The default. Auto-excludes functions and `_temp` from `partialize`. Auto-registers for snapshot. | | `createVolatileAppStore(appId, initial)` | in-memory only | Runtime caches and UI state that should reset on reload. Still registers for snapshot. | ### How AppStateRegistry actually works `os/AppStateRegistry.ts` is now an 18-line shell. It does **not** maintain a parallel `persistentReaders` map; state lives in the Zustand store registry. `getAllAppStates()` walks the registry, runs each registered adapter, and returns the result. For apps that are installed (per `PackageManagerService`) but whose store hasn't loaded yet, `getAllAppStates` falls back to whole-object `JSON.parse(localStorage[manifest.id])` โ€” there is no field-level fallback to `_CONFIG`. New apps should rely on the store factory; the older reader pattern has been retired. ```ts // apps/Habits/state.ts (advanced โ€” most apps don't need this) import { registerStateAdapter } from '@/os/createAppStore'; registerStateAdapter('habits', (s) => { return { habits: s.habits, streak: s.streak }; // omits _temp, omits secondary stores }); ``` ## Where to go next - ๐Ÿ“ฑ The full app file/contract that owns the runtime overlay โ†’ [`../app/module-contract.md`](../app/module-contract.md) - ๐Ÿงช How tasks consume snapshots in their judges โ†’ [`../../../bench_env/docs/task/TASK_AUTHORING_GUIDE.md`](../../../bench_env/docs/task/TASK_AUTHORING_GUIDE.md) - ๐Ÿ”Œ The `__SIM__` and `__OS__` global APIs โ†’ [`../../api/runtime-api.md`](../../api/runtime-api.md) - ๐Ÿค– What about the Android equivalent? โ†’ [`../android-mapping.md`](../android-mapping.md)