# HoloStore — Design Notes This document reflects the current implementation (Accord consensus, batched WAL, async apply, and fjall-backed storage). It focuses on the high-level design and the concurrency/durability tradeoffs that exist today. ## Overview HoloStore is a distributed, strongly-consistent key/value store built on the Accord consensus algorithm. Each data shard maps to an Accord group. Clients talk to a Redis-compatible interface; commands are routed into the consensus engine, committed by quorum, and then executed against the storage engine. ## Accord Groups Accord is leaderless. Any node can coordinate proposals. Writes (`SET`): - Proposals go through pre-accept → accept → commit. - Dependencies between transactions are tracked per key. - A fast path is possible when dependencies converge. - Committed transactions are executed asynchronously by the executor loop. Reads (`GET`): - Reads can go through the protocol or other modes (see `READ_MODES.md`). ## Concurrency Model The system deliberately separates consensus, durability, and execution: - Consensus runs on the async runtime (RPC handling + proposal coordination). - Commit log appends are batched on a dedicated commit-log thread. - Storage applies are executed on a dedicated apply worker thread to keep the async runtime responsive. This separation reduces head-of-line blocking and allows batching to amortize IO. ## Execution Scheduling Committed transactions are tracked in two queues: - `committed_queue`: all committed txns ordered by sequence. - `committed_ready`: committed txns whose dependencies are already executed. When a transaction executes, its dependents are decremented in a pending-deps map and moved into `committed_ready` when unblocked. This avoids repeated scans of the entire committed queue under load. If cycles exist or ready items cannot be found, the executor falls back to a dependency SCC scan to resolve a consistent execution order. ## Batching Batching exists at multiple layers: - Client batching coalesces multiple SETs into a single proposal. - RPC batching coalesces pre-accept/accept/commit messages per peer. - Commit-log batching writes multiple commits per WAL append. - Apply batching writes multiple commands to the state machine at once. Batch sizes are tunable via environment variables and are critical to throughput/latency tradeoffs. ## Commit Log (WAL) The commit log is the durability source of truth. Design: - Commits are appended to the WAL before they are applied. - Appends are batched to amortize syscalls and fsync. - The WAL can run with different persistence modes: - `sync_data` (durable data), `sync_all`, or `buffer` (non-durable). - A background compaction path removes executed entries once safe. Key idea: the WAL retains committed commands, so a node can recover even if the storage engine lags behind execution. ## Storage Engine (Fjall) The state machine writes to fjall. Durability depends on fjall’s own fsync policy plus the WAL. Current behavior: - WAL is the authoritative durability source. - Fjall can be configured to fsync periodically (`HOLO_FJALL_FSYNC_MS`), or disabled. - On restart, WAL replay repopulates fjall state. ## Range Generations HoloStore's KV version is ordered as `(range_generation, seq, txn_id)`. `seq` is still the Accord execution sequence, but it is only meaningful inside the range/Accord group that produced it. `range_generation` is a durable range-ownership epoch that disambiguates writes across split and merge cutovers. Generations are not incremented on ordinary writes. They advance when range metadata creates a new ownership epoch: - A split assigns the child range a generation greater than all current ranges. - A merge assigns the merged range a generation greater than both inputs and greater than the current maximum. - Existing/legacy data decodes as generation `1`; the zero version sentinel remains generation `0`. This prevents an old parent range or retired group from replaying/applying a late command with a high group-local `seq` and overwriting a newer child-range value. The WAL command bytes carry the generation chosen at routing time so the same ordering is reconstructed during restart replay. ## Recovery Recovery uses three sources of truth: - WAL replay for committed commands on startup. - Executed-log values to satisfy lagging replicas that need command bytes. - Peer recovery RPCs when a node lacks command bytes locally. Executed-log GC: - Nodes gossip executed prefixes per origin. - A global minimum executed prefix is computed. - Executed-log entries at or below that prefix are eligible for GC. ## Observability `HOLOSTATS` exposes internal counters and timings via Redis: - Consensus progress counters and execution latency. - WAL batch and fsync metrics. - Queue lengths and executed-log sizes. These metrics are the primary tool for diagnosing stalls, batching behavior, and IO pressure. ## Key Tradeoffs - Larger batches increase throughput but raise tail latency. - More shards reduce per-group contention but increase parallelism and memory. - WAL persistence settings directly affect durability and performance. - Read mode choices affect consistency (see `READ_MODES.md`). ## Running Linearizability Checks Porcupine tests live under `scripts/porcupine.sh`; the suite is `scripts/check_linearizability.sh` and is wired to `make check-linearizability`. ## Partitioning HoloStore partitions data by **Accord group**. Each partition is a separate consensus group with its own log, execution queue, and state-machine application. At startup the node creates `HOLO_MAX_SHARDS` data group slots (IDs starting at a fixed base), and the Redis front-end routes each key to a shard. Current characteristics: - Range splitting is dynamic and can activate additional shard slots up to `HOLO_MAX_SHARDS`. - Each shard has its own commit log and executor. - Sharding reduces contention per group and improves parallelism. Remaining roadmap: - Rebalancing ranges and leaseholders across nodes automatically. - Range merge policy for shrinking/cold ranges. - Smoother live movement of data during reconfiguration.