# Known Issues Silent traps and host limits that affect `bench_env` at production scale. Each issue here produces wrong numbers or hangs *without* crashing, so it is worth scanning this page once before configuring a large run. --- ## 1. `--processes N --isolation contexts` silently lowers SR **Symptom.** No crash, no error logs. SR drops by roughly 4–5 percentage points compared to an equivalent `--isolation pages` run. The regression concentrates on tasks that inject seed state in `_post_sample`. **Mechanism.** Under multi-process sharding, all N contexts inside a single Chromium begin `reset` simultaneously through `asyncio.gather` with no stagger. IndexedDB hydrate completes *after* `_post_sample` has called `setState`, and silently overwrites the injected seed values. The race window does not fire on `--isolation pages` (pages share an IDB origin so hydrate is already warm when the second page resets) and does not fire on single-process `contexts` (natural ramp staggering keeps the window narrow). **Workaround.** Use `--isolation pages` for all multi-process runs. It is the production default and is faster than `contexts` in practice. --- ## 2. Page-per-browser limit: keep `N / B ≤ 8` **Symptom.** Beyond roughly 8 pages in one Chromium process: occasional pointer-event jitter, short GC pauses, and flaky page initialization. Not a hard failure, but stability degrades. **Mechanism.** A single Chromium process serializes parts of its rendering pipeline (renderer GC, main-thread compositor) across all its pages. Empirically, 6–8 concurrent pages per browser is the cleanly-handled range. **Workaround.** With `--isolation pages`, configure `--browsers B --parallel N` so that `N / B ≤ 8`. To grow total concurrency, add browsers — not pages per browser. For best fault isolation, **pair browsers and processes 1:1** (`--processes B --browsers B`): each Chromium then runs under its own Python worker, so a single browser crash or memory leak is contained to one shard and cannot disrupt the rest. Recommended layouts (all `--isolation pages`): | Target parallelism | Layout | |---|---| | ≤ 8 | 1 process × 1 browser × N pages | | 16 | 2 processes × 1 browser × 8 pages | | 32 | 4 processes × 1 browser × 8 pages | | 256 | 32 processes × 1 browser × 8 pages | --- ## 3. `--parallel ≥ 192` stalls on `_wait_ready __SIM__ timeout` **Symptom.** `errors.jsonl` fills with entries like ``` RuntimeError: [WN][page#1] _wait_ready phase=__SIM__ timeout: TimeoutError: Page.wait_for_function: Timeout 60000ms exceeded. ``` while CPU, GPU, network, and disk all sit around 10% idle. The pipeline looks starved but nothing is actually busy. **Mechanism.** Linux caps `fs.inotify.max_user_instances` per uid (default 128 on older kernels, 1024 on Ubuntu 22.04+). Each headless Chromium creates at least one inotify instance. Once the per-uid cap is reached, `inotify_init()` returns `EMFILE`, and the affected Chromium subsystems enter silent retry loops — deferring `__SIM__` exposure past the 60-second readiness timeout. ### Diagnostic While the run is stalling, in a separate shell: ```bash find /proc/*/fd -lname 'anon_inode:inotify' 2>/dev/null | wc -l ``` - The number ≈ `cat /proc/sys/fs/inotify/max_user_instances` → this is the cause. - The number ≪ the cap → look elsewhere. An idle system shows fewer than 50 inotify instances. ### Workaround (with sudo, preferred) ```bash # Temporary (resets on reboot) sudo sysctl -w fs.inotify.max_user_instances=8192 # Persistent echo "fs.inotify.max_user_instances = 8192" | sudo tee /etc/sysctl.d/99-mobilegym.conf sudo sysctl --system ``` 8192 is conservative; ML and CI hosts commonly run 32 768 – 524 288. The value only sets a kernel hash-table preallocation cap and has no security implications. ### Workaround (no sudo) The inotify cap is per uid, so two alternatives exist: 1. **Run on a different uid.** Two users sharing the host each get their own 128 instances. 2. **Stagger the launch.** Split one large run into chunks of ≤ 80 envs each, launched at least 60 s apart. Inotify usage stabilizes between chunks. Merge `results.jsonl` and `summary.json` after all chunks complete. ```bash python -m bench_env.run --parallel 80 --runs-dir runs/part1 ... & sleep 60 python -m bench_env.run --parallel 80 --runs-dir runs/part2 ... & sleep 60 python -m bench_env.run --parallel 80 --runs-dir runs/part3 ... & wait ``` ### What does *not* work User namespaces cannot bypass this limit. Although `unshare --user --map-root-user` grants `CAP_SYS_RESOURCE` inside the new namespace, modifying `fs.inotify.max_user_instances` requires the capability in the init namespace; child namespaces inherit the parent's cap and can only lower it. --- ## See also - [`FRAMEWORK.md`](FRAMEWORK.md) §6 — full isolation-level reference and multi-process sharding semantics.