# Dynoxide Benchmarks Reproducible benchmarks comparing Dynoxide against DynamoDB Local and LocalStack. ## Quick Start ```bash cd benchmarks # Run all criterion micro-benchmarks (embedded mode) cargo bench # Run the full workload macro-benchmark (embedded mode) cargo bench --bench embedded_macro # Run the CI pipeline simulation (embedded + HTTP modes) cargo run --release --bin ci_pipeline_bench # Run against DynamoDB Local (requires Docker) ../benchmarks/scripts/start_dynamodb_local.sh cargo run --release --bin ci_pipeline_bench -- --ddb-endpoint http://localhost:8000 # Run memory profiler cargo run --release --bin memory_profiler # Run startup benchmark cargo run --release --bin startup_bench # Run workload driver against a specific endpoint cargo run --release --bin workload_driver -- --endpoint-url http://localhost:8123 # Run iai-callgrind instruction-count benchmarks (Linux only, requires Valgrind) cargo bench --bench iai_core --features iai-callgrind # Generate charts from results python3 scripts/generate_report.py --results-dir results --output-dir results/charts ``` ## Four-Tier Comparison Model | Tier | What | How | |------|------|-----| | **Dynoxide Embedded** | Direct Rust API calls via `Database::memory()` | No network, no HTTP, no serialisation overhead | | **Dynoxide HTTP** | Axum server via `http-server` feature | Quantifies the HTTP overhead eliminated by embedded mode | | **DynamoDB Local** | Docker container, JVM-based | The incumbent to beat | | **LocalStack** | Docker container, Python + DynamoDB Local | Startup, image size, idle memory only | LocalStack uses DynamoDB Local internally as its DynamoDB engine — its emulation layer is DynamoDB Local's JVM process wrapped with Python routing. This is why its startup and memory numbers are higher than DynamoDB Local alone: you're paying for both runtimes. We benchmark LocalStack for startup time, image size, and idle memory only — operation-level benchmarks would measure DynamoDB Local's performance through an additional network hop. ## Results: Local Development (Apple Silicon) Results from a Mac Studio (M-series). These reflect the experience of a developer running Dynoxide locally as a dev server or running tests directly. ### Cold Startup | Target | Mean | vs DynamoDB Local | |--------|------|-------------------| | Dynoxide Embedded | **~0.2ms** | **~10,046x faster** | | Dynoxide HTTP | **~15ms** | **~148x faster** | | DynamoDB Local | ~2,287ms | — | | LocalStack | ~6,231ms | 2.7x slower | ### Per-Operation Comparison (Dynoxide HTTP vs DynamoDB Local) | Operation | Dynoxide HTTP (p50) | DynamoDB Local (p50) | Speedup | |-----------|-------------------|---------------------|---------| | CreateTable | 0.40ms | 13.5ms | **34x** | | GetItem | 0.12ms | 0.80ms | **6.7x** | | PutItem | 0.14ms | 0.92ms | **6.5x** | | Query (base) | 0.13ms | 1.7ms | **13x** | | Query (GSI) | 0.13ms | 1.2ms | **8.7x** | | Query (paginated) | 1.5ms | 4.7ms | **3.1x** | | Scan (full table) | 45.6ms | 101.3ms | **2.2x** | | UpdateItem | 0.15ms | 1.3ms | **9.1x** | | TransactWriteItems | 0.28ms | 3.9ms | **14x** | | BatchGetItem (100 keys) | 1.7ms | 11.7ms | **6.9x** | | BatchWriteItem (25 items) | 1.1ms | 7.5ms | **6.7x** | | DeleteItem | 0.12ms | 0.97ms | **7.9x** | | **Total workload** | **1.3s** | **10.0s** | **7.5x** | ### CI Pipeline Simulation (50 integration tests) | Mode | Wall Clock | Speedup vs DDB Local | |------|-----------|---------------------| | Dynoxide Embedded (sequential) | ~484ms | **5.0x** | | Dynoxide Embedded (4x parallel) | ~203ms | **5.9x** | | Dynoxide HTTP (sequential) | ~569ms | **4.2x** | | Dynoxide HTTP (4x parallel) | ~235ms | **5.1x** | | DynamoDB Local (sequential) | ~2,407ms | — | | DynamoDB Local (4x parallel) | ~1,189ms | — | ### Embedded Micro-benchmarks (criterion) | Operation | Latency | |-----------|---------| | GetItem | 9µs | | PutItem (small / medium / large) | 11µs / 19µs / 128µs | | Query (base, ~50 hits) | 723µs | | Query (GSI) | 14µs | | Scan (filter, 1K items) | 6.1ms | | UpdateItem | 72µs | | DeleteItem | 24µs | | BatchWrite (25) | 497µs | | BatchGet (100) | 899µs | | TransactWrite (4) | 112µs | ## Results: CI (GitHub Actions) Results from `ubuntu-latest` (2-core AMD EPYC 7763, 8GB RAM). Commit [`e066fc0`](../../../commit/e066fc0c6837b8b3de6070323597c157eb06e0bf). Absolute wall-clock numbers vary between runners; ratios are stable across runs. ### Cold Startup | Target | Mean | Stddev | vs DynamoDB Local | |--------|------|--------|-------------------| | Dynoxide Embedded | 0.3ms | ±0.2ms | **~9,523x faster** | | Dynoxide HTTP | 2.2ms | ±2.6ms | **~1,192x faster** | | DynamoDB Local | 2,596ms | ±519ms | — | | LocalStack | 11,648ms | ±375ms | 4.5x slower | DynamoDB Local warm start (after JVM JIT): ~2ms. The cold start cost is what CI pipelines actually pay. ### Per-Operation Comparison (Dynoxide HTTP vs DynamoDB Local) | Operation | Dynoxide HTTP (p50) | DynamoDB Local (p50) | Speedup | |-----------|-------------------|---------------------|---------| | CreateTable | 0.73ms | 24.0ms | **33x** | | GetItem | 0.33ms | 0.79ms | **2.4x** | | PutItem | 0.40ms | 0.94ms | **2.4x** | | Query (base) | 0.33ms | 1.5ms | **4.4x** | | Query (GSI) | 0.37ms | 1.1ms | **3.1x** | | Query (paginated) | 2.8ms | 6.9ms | **2.5x** | | Scan (full table) | 74.6ms | 256.5ms | **3.4x** | | UpdateItem | 0.43ms | 1.4ms | **3.3x** | | TransactWriteItems | 0.71ms | 4.4ms | **6.3x** | | BatchGetItem (100 keys) | 3.1ms | 12.4ms | **4.0x** | | BatchWriteItem (25 items) | 2.5ms | 8.9ms | **3.6x** | | DeleteItem | 0.35ms | 0.94ms | **2.7x** | | **Total workload** | **3.0s** | **11.5s** | **3.8x** | The largest speedups are on read-heavy operations (GetItem, Query, Scan, BatchGetItem) and multi-item writes (BatchWriteItem, TransactWriteItems) where Dynoxide avoids JVM dispatch overhead and lock contention. Single-row writes (PutItem, DeleteItem) still show a clear win at 2-3x. ### CI Pipeline Simulation (50 integration tests) | Mode | Wall Clock | Speedup vs DDB Local | |------|-----------|---------------------| | Dynoxide Embedded (sequential) | 784ms | **3.2x** | | Dynoxide Embedded (4x parallel) | 389ms | **5.0x** | | Dynoxide HTTP (sequential) | 778ms | **3.2x** | | Dynoxide HTTP (4x parallel) | 457ms | **4.2x** | | DynamoDB Local (sequential) | 2,518ms | — | | DynamoDB Local (4x parallel) | 1,929ms | — | DynamoDB Local barely benefits from parallelism (2,518ms1,929ms). Under concurrent load, individual tests take 3-4x longer due to JVM contention — setup times spike to 200-1,000ms on some tests as `CreateTable` calls queue behind the JVM's single-threaded SQLite access. Dynoxide embedded scales better because each test gets its own isolated `Database::memory()` with no shared state. ### Embedded Micro-benchmarks (criterion) These measure Dynoxide's embedded API directly — no HTTP, no serialisation. This is the performance you get when using `Database::memory()` in your Rust test suite. | Operation | Latency | |-----------|---------| | GetItem | 14µs | | PutItem (small / medium / large) | 25µs / 45µs / 281µs | | Query (base, ~50 hits) | 1.1ms | | Query (GSI) | 26µs | | Scan (filter, 1K items) | 8.2ms | | UpdateItem | 141µs | | DeleteItem | 51µs | | BatchWrite (25) | 1.1ms | | BatchGet (100) | 1.4ms | | TransactWrite (4) | 257µs | ## Memory & Disk | Metric | In-Memory | File-Backed | DynamoDB Local (Docker) | LocalStack (Docker) | |--------|-----------|-------------|------------------------|---------------------| | Idle | 4.9 MB RSS | 45.6 MB RSS | 185.2 MB RSS | 493.8 MB RSS | | After 10K items (~1KB each) | 45.7 MB RSS | 45.6 MB RSS | — | — | | Disk (10K items) | — | 15.6 MB | — | — | | Disk (empty table) | — | 121 KB | — | — | File-backed mode shows higher RSS at creation because SQLite memory-maps the file. Both modes converge at 10K items. The ~46MB for 10K items is ~4.6KB per item — roughly 4x the raw item size, which accounts for SQLite structures, indexes, and the GSI. Docker idle memory is the mean of 3 samples taken 10s apart, 30s after first successful health check, no requests served. ## Why the Numbers Differ Between Local and CI Local development (Apple Silicon) shows ~17x speedup for embedded CI pipeline; GitHub Actions shows ~7x. This is expected: - Apple Silicon has significantly faster single-thread performance than the 2-core EPYC VM - Dynoxide is CPU-bound (native code, in-process SQLite), so it scales roughly linearly with CPU speed - DynamoDB Local is JVM-overhead-bound (class loading, JIT compilation, GC), so it benefits less from faster CPUs on cold start - The net effect: faster hardware widens the gap between native code and JVM overhead Both are real measurements of the same benchmark suite. The CI numbers are reproducible by anyone with a GitHub account; the local numbers reflect what developers actually experience day-to-day. ## Benchmark Binaries | Binary | Purpose | |--------|---------| | `ci_pipeline_bench` | Simulates 50 integration tests across all modes | | `workload_driver` | Configurable macro-benchmark against any HTTP endpoint | | `startup_bench` | Cold/warm start measurement for all backends | | `memory_profiler` | RSS and disk usage tracking over workload steps | | `size_comparison` | Binary and Docker image size comparison | ## Criterion Benchmarks | Benchmark | What it measures | |-----------|-----------------| | `embedded_micro` | Individual operations (12 benchmarks): PutItem, GetItem, Query, Scan, etc. | | `embedded_macro` | Full 13-step workload against in-memory database | | `embedded_file_backed` | 11-step workload against file-backed SQLite database | | `http_macro` | 11-step workload against Dynoxide HTTP server | | `iai_core` | Iai-Callgrind instruction-count benchmarks (Linux only, requires Valgrind) | Criterion generates HTML reports with detailed charts (PDF distributions, regression plots, violin plots for grouped benchmarks like PutItem small/medium/large). These are available in the CI artifacts under `target/criterion/*/report/`. ## Methodology ### Standard Workload (13 Steps) 1. CreateTable (pk:S HASH, sk:N RANGE, GSI on email:S) 2. BatchWriteItem — 10,000 medium items in batches of 25 3. GetItem — 1,000 reads by primary key 4. PutItem — 1,000 individual writes (mixed sizes) 5. Query (base table) — 100 queries with key conditions + filters 6. Query (GSI) — 100 queries on the email index 7. Query (paginated) — 50 queries following LastEvaluatedKey 8. Scan — full table scan with filter 9. UpdateItem — 500 updates with condition expressions 10. TransactWriteItems — 50 transactions of 4 actions each 11. BatchGetItem — 100 batches of 100 keys 12. DeleteItem — 500 deletes 13. DeleteTable ### Item Definitions | Size | Attributes | Approximate bytes | |------|-----------|------------------| | Small | 3 (pk, sk, name) | ~200B | | Medium (default) | 10 (pk, sk, name, email, age, address map, tags string set, scores list, active bool, metadata map) | ~1KB | | Large | Medium + binary payload | ~50KB | ### JVM Warmup Protocol DynamoDB Local runs on the JVM, which benefits from JIT compilation. For fairness: - **Cold start benchmarks** measure real-world CI experience (no warmup) - **Warm start benchmarks** run 500 PutItem + 100 GetItem before timing - **Workload driver** includes a warmup phase excluded from timing - Both cold and warm numbers are reported and clearly labelled ### Build Profile All benchmarks run with `--release` (`opt-level = 3`, `lto = "thin"`). Debug builds are 10-100x slower and would produce misleading numbers. Criterion uses release mode by default; custom binaries must be run with `cargo run --release`. ### CI Benchmark Philosophy Wall-clock benchmarks on shared CI runners are noisy — up to 3x variance between runs due to noisy neighbours. But **comparative** benchmarks (Dynoxide vs DynamoDB Local in the same job) are reliable because the noise affects both equally. The ratio is stable. We publish relative claims ("6.9x faster"), never absolute CI wall-clock numbers as headline figures. Instruction-count benchmarks (Iai-Callgrind) provide deterministic regression detection independent of runner load. ### Statistical Rigour - Criterion benchmarks use configurable sample sizes (10-100) - Startup benchmarks report mean and standard deviation over 5 repetitions - CI pipeline benchmarks report per-test timing breakdowns (all 50 tests) - Workload driver collects per-request latencies for p50/p95/p99 percentiles ## Honesty & Limitations ### What Dynoxide Does Better - **Startup time**: No JVM, no Docker — microsecond-level embedded initialisation - **CI pipeline speed**: Zero-cost per-test isolation in embedded mode - **Resource usage**: ~3 MB download (~6 MB on disk) vs 225 MB Docker download (471 MB on disk); 5 MB idle RSS vs 163 MB - **Embedded mode**: Eliminates HTTP overhead entirely for Rust and iOS consumers - **Predictable latency**: No JVM GC pauses, no JIT warmup effects ### What DynamoDB Local Does Better - **Feature completeness**: Full DynamoDB API surface with exact behaviour matching - **AWS-maintained**: Official tooling with ongoing updates ### Known Limitations - Dynoxide's `Arc>` serialises all operations — parallel benchmarks on a single `Database` instance measure mutex contention, not true parallelism (the CI pipeline benchmark avoids this by giving each parallel test its own `Database::memory()`) - JVM warmup means DynamoDB Local's warm-start performance is dramatically better than cold-start — both are reported - CI wall-clock numbers vary between runners — use relative ratios, not absolute numbers - Apple Silicon and x86_64 show different absolute numbers but consistent ratios - PutItem and DeleteItem show near-parity with DynamoDB Local — Dynoxide's advantage is largest on read operations and batch/scan workloads ## Reproducibility Fork the repo, install Docker, and run: ```bash cd benchmarks cargo bench # criterion cargo run --release --bin ci_pipeline_bench # embedded + HTTP ../benchmarks/scripts/start_dynamodb_local.sh # start DDB Local cargo run --release --bin ci_pipeline_bench -- --ddb-endpoint http://localhost:8000 ``` Or run the full CI workflow by pushing to a fork — `.github/workflows/benchmark-comparative.yml` runs everything and stores results in the `benchmark-data` branch. ## CI Workflows ### `benchmark-regression.yml` (on every PR) Runs criterion micro-benchmarks (wall-clock, compared against baseline from `benchmark-data` branch, blocks merge if >20% regression) and iai-callgrind instruction-count benchmarks (deterministic, <1% variance, detects algorithmic regressions independent of CI runner load). ### `benchmark-comparative.yml` (on push to main) Runs full comparative benchmarks (Dynoxide vs DynamoDB Local), stores results in the `benchmark-data` branch for historical tracking, and uploads criterion charts as workflow artifacts. To view historical results: ```bash git fetch origin benchmark-data git log benchmark-data --oneline git show benchmark-data:runs//run_summary.json ``` ## System Requirements - Rust stable toolchain - Docker (for DynamoDB Local / LocalStack comparison) - Python 3 + matplotlib (for chart generation: `pip install matplotlib`) - Valgrind (for iai-callgrind, Linux only: `apt-get install valgrind`) - `iai-callgrind-runner` (matching version: `cargo install iai-callgrind-runner --version 0.14.2`)