# CubeSandbox bench methodology ## Host (read this first if you suspect nested virtualisation) Both forkd and CubeSandbox were measured on the same **bare-metal** host. There is **no nested virtualisation** in this setup: ``` $ systemd-detect-virt none $ grep "model name" /proc/cpuinfo | head -1 model name : 12th Gen Intel(R) Core(TM) i7-12700 $ grep -o vmx /proc/cpuinfo | head -1 vmx ``` 12th-gen Intel Core, VT-x available directly, Ubuntu 24.04 / Linux 6.14 running on the metal. Every microVM in either project is host → L1 KVM guest, same level for both. CubeSandbox was **not** run inside a dev-env VM or any other intermediate hypervisor; the one-click install script targets the host directly (see "Setup" below). ## TL;DR | Path | N=100 wall-clock | Success | Per-sandbox | |---|---:|---:|---:| | Fast (pool entry reused) | **1,056 ± 14 ms** (5-run mean) | **100 %** | **10.6 ms** | | Slow (live `mkfs.ext4` + reflink-copy) | 20,304 ms | 77 % | 263 ms | Same bare-metal host for both (i7-12700, 20 vCPU, no nested virt). The slow-path row is what shipped first because the bench template used a 2 GiB writable-layer size that didn't match `pool_default_format_size_list` (default `["1Gi"]`); the maintainer clarified the distinction at [#235](https://github.com/TencentCloud/CubeSandbox/issues/235) and we re-ran on 2026-05-14 with `["1Gi", "2Gi"]`. The "Fast-path replay" section at the bottom has the full small-N curve and the config tweak required to stabilise pool warm-up on this host. ## Result (slow path) CubeSandbox N=100 spawn measured at **20,304 ms** on the same dev box forkd was measured on (Ubuntu 24.04 / Linux 6.14 / 20 vCPU / 30 GiB / KVM). **77 of 100** sandboxes spawned cleanly; the rest hit `newExt4RawByReflinkCopy failed: e2fsck 1.47.0 (5-Feb-2023): bad magic number in superblock` under concurrent load. The wall-clock figure is the full N=100 run including the failed-spawn rollbacks. ## Setup ```bash # CubeSandbox v0.2.0 one-click install with custom ports. # Patches applied on this host (1Panel-occupied default ports): # CubeMaster/conf.yaml — replace 127.0.0.1:3306 → :13306 # CubeMaster/conf.yaml — replace 127.0.0.1:6379 → :16379 sudo bash /opt/cube-stage/cube-sandbox-one-click-9c16021/install.sh # After install, port + service patches above, then: sudo /usr/local/services/cubetoolbox/scripts/one-click/up.sh # Build a template once (cached afterwards): cubemastercli template create-from-image \ --image python:3.12-slim \ --template-id forkd-bench-pynp \ --writable-layer-size 2Gi \ --allow-internet-access ``` The cube-api listens on port `6000` (we overrode `CUBE_API_BIND`). ## Workload `bench/cube-bench.py` (see [`compare-all.py`](./compare-all.py)) issues N concurrent `POST /sandboxes {"templateID":"forkd-bench-pynp"}` via the cube-api REST endpoint, then `DELETE /sandboxes/:id` per successful spawn. The numpy import workload runs inside each sandbox but most fail before they get there because of the storage issue noted below. ## Why success rate is < 100 % on this host (slow path) Under concurrent load, `newExt4RawByReflinkCopy` reports a corrupt ext4 superblock on the per-sandbox writable layer. The XFS filesystem hosting `/data/cubelet` has `reflink=1` enabled (it's a loop-mounted `/var/cube-xfs.img`; `xfs_info` confirms) and the host has plenty of free space, so this isn't filesystem or capacity-driven. Subsequent investigation (see "Fast-path replay" below) traced the real cause to **`mkfs.ext4` timing out** under cubelet's default `pool_worker_num = 8` against the hard-coded `cmdTimeout = 3 s` in `storage/shell.go`. Two cubelet instances both formatting 2 GiB images concurrently can push individual `mkfs.ext4` invocations past 3 s, the `ExecV` context cancels the command mid-write, and the next reader sees a half-baked superblock. The "bad magic number" message is the visible symptom; the timeout race is the cause. PRs [#236](https://github.com/TencentCloud/CubeSandbox/pull/236) (make `cmd_timeout` configurable) and [#237](https://github.com/TencentCloud/CubeSandbox/pull/237) (diagnostic context on failure) target this directly. A second N=100 run measured 20,304 ms / 77 succeeded; the first run measured 19,788 ms / 36 succeeded. Wall-clock is stable; success rate is variable. The chart row uses the more recent figure. ## Notes Tencent's published numbers ("<60 ms" cold-start, "<150 ms under concurrent") would put CubeSandbox ahead of forkd on raw cold-start. On the specific Ubuntu 24.04 / Linux 6.14 / 20-vCPU host we tested, the storage path was the bottleneck, not VM boot. A cleaner host (no 1Panel co-tenancy, dedicated XFS partition for `/data/cubelet`) is likely to give CubeSandbox a substantially better number. ## Upstream response (2026-05-14) We filed the methodology + the reflink-copy race upstream: [TencentCloud/CubeSandbox#235](https://github.com/TencentCloud/CubeSandbox/issues/235). The maintainer's response confirmed two things that recontextualise the numbers above: 1. **The race is on a slow code path the original template inadvertently selected.** CubeSandbox pre-formats a pool of writable-layer ext4 images at sizes listed in `pool_default_format_size_list` (default `["1Gi"]`). A sandbox whose `writable_layer_size` matches one of those sizes reuses a pool entry — fast path, no `mkfs.ext4` or reflink-copy per sandbox. We passed `--writable-layer-size 2Gi`, which doesn't match the default pool, so every sandbox went through the live `mkfs.ext4 + reflink-copy` slow path. That's where the bad-magic race lives. 2. **Cube's published N=50/N=100 numbers are measured on a 96 vCPU server.** A 20 vCPU host (this dev box) is outside their tested matrix. Per the maintainer: P99 under 200 ms at N=100 on a 96-vCPU node. Cube also accepted the first two improvements from our issue (a configurable `cmdTimeout`, and richer diagnostic info on `newExt4RawByReflinkCopy` failures) and is reviewing the third (drop per-clone `e2fsck`). ## Small-N replay on the same (slow-path) configuration After the upstream exchange we re-ran with the same 2 GiB template at smaller N — staying on the slow path so the comparison is apples-to-apples with the N=100 row, but small enough to fit the 30 GiB host RAM budget (template spec = 2 GiB per sandbox → max ~14 concurrent). Script: [`bench/cube-replay.sh`](./cube-replay.sh). | N | Succeeded | Wall-clock | Per-sandbox | |---:|:---:|---:|---:| | 1 | 1/1 | 924 ms | 924 ms | | 5 | 5/5 | 2,207 ms | 441 ms | | 10 | 10/10 | 2,567 ms | 257 ms | Observations: - **100 % success rate at every size we measured.** The reflink-copy race only fired at N=100 with the 2 GiB writable layer; smaller N hit no failures. - Single-instance cold start ≈ **924 ms** here, vs Cube's published fast-path **<60 ms**. The ~15× gap is the combined cost of the slow path (live `mkfs.ext4` plus reflink-copy of a 2 GiB image) and the host being well outside their 96 vCPU testing matrix. - Per-sandbox cost shrinks substantially with concurrency (924 → 441 → 257 ms / sandbox) — pipelined work the original 20.3 s / 100 = 203 ms-per-sandbox number is consistent with. What we did **not** measure here: the fast path (`writable_layer_size` matching `pool_default_format_size_list`). Doing so would require either a new template with a 1 GiB writable layer or reconfiguring the pool for 2 GiB; we left it for whenever either Cube or a downstream user wants a head-to-head fast-path number on this host. ## Fast-path replay (2026-05-14) After the upstream exchange we reconfigured the pool to include the template's writable-layer size and re-ran the bench. Two config edits in `/usr/local/services/cubetoolbox/Cubelet/config/config.toml` under `[plugins."io.cubelet.internal.v1.storage"]`: ```toml pool_default_format_size_list = ["1Gi", "2Gi"] # was ["1Gi"] pool_worker_num = 1 # was 8 ``` The first edit is what the maintainer was pointing at — `2Gi` now takes the fast path (no per-sandbox `mkfs.ext4` or reflink-copy). The second is a workaround for the `cmdTimeout` race described above: with 8 workers, pool warm-up at ~2 GiB images races itself into corruption before the bench even starts. With one worker, each `mkfs.ext4` runs alone and finishes well inside the 3 s budget. PR [#236](https://github.com/TencentCloud/CubeSandbox/pull/236) makes the timeout itself configurable, which is the right long-term fix. After restart and pool warm to 100 entries, five consecutive runs of an improved `bench/cube-bench.py` against `forkd-bench-pynp`. The improved script pre-warms Python's default `ThreadPoolExecutor` (so its lazy-init isn't charged to N=1) and reports per-call latency on top of wall-clock: | Phase | N | Wall-clock (mean ± σ over 5 runs) | Notes | |---|---:|---:|---| | cold-server | 1 | **184 ± 17 ms** | first call after cubelet restart | | warm-server | 1 | **156 ± 7 ms** | repeated single-call | | ramp | 10 | **212 ± 3 ms** | ≈ cold N=1; 20 vCPUs still have headroom | | ramp | 50 | **542 ± 11 ms** | 20-vCPU ceiling starts to bind | | ramp | 100 | **1056 ± 14 ms** | per-sandbox amortised ≈ **10.6 ms** | 100 % success at every N, every run. Observations: - **N=1 ≈ N=10 wall-clock.** Below the 20-vCPU ceiling the wall-clock is dominated by the slowest single sandbox-boot, not by the number of concurrent boots. Once N saturates the cores (≥ 50), per-sandbox amortised cost stabilises around 10–11 ms — close to the wall-time of one warm VM boot divided across the available parallelism. - **~55 ms cold-start delta** on the first request after a quiet cubelet (184 → 156 ms). The CubeSandbox maintainer [noted at #235][m1] that cube **v0.2.0** shipped with a ~50 ms latency regression that PR [#234][pr234] fixes in **v0.2.1**. Our observed delta is consistent with that. Numbers in the table are valid for the **v0.2.0** baseline we tested; v0.2.1 would shift each row down by roughly that amount. We did not retest on v0.2.1. - **N=100 wall-clock 1.04–1.07 s** — about 19× faster than the slow-path run on the same host, well inside Cube's published "<150 ms under concurrent" envelope at ~10.6 ms / sandbox amortised. - The N=1 figures here are still well above Cube's advertised "<60 ms" single-instance cold-start — that number was measured on a 96 vCPU host with the snapshot/CoW path warm. We didn't retest that shape. [m1]: https://github.com/TencentCloud/CubeSandbox/issues/235#issuecomment-4450390541 [pr234]: https://github.com/TencentCloud/CubeSandbox/pull/234 A first round of measurements posted [earlier in #235][m0] reported N=100 = 1,439–1,480 ms / 100 % succ and N=1 = 385 ms. Both figures were inflated by two artifacts: - The original bench script lazy-initialized Python's default `ThreadPoolExecutor` on the first `run_in_executor` call, charging ~50–100 ms to the N=1 measurement. - A stale cubemaster reconcile-retry loop was burning CPU during the first batch of runs (we'd previously killed cubelet for debugging without taking down cubemaster), adding background contention to every measurement. The numbers in the table above remove both biases. [m0]: https://github.com/TencentCloud/CubeSandbox/issues/235#issuecomment-4448111076