# High-Performance Order Matching Engine A production-grade, high-frequency trading (HFT) Limit Order Book (LOB) and Matching Engine written in C++20. Designed for extreme throughput, deterministic latency, and cache efficiency using modern lock-free techniques. ![C++](https://img.shields.io/badge/C++-20-blue.svg?style=flat&logo=c%2B%2B) ![License](https://img.shields.io/badge/License-MIT-green.svg) [![CMake Build](https://github.com/PIYUSH-KUMAR1809/order-matching-engine/actions/workflows/cmake.yml/badge.svg)](https://github.com/PIYUSH-KUMAR1809/order-matching-engine/actions/workflows/cmake.yml) [![Awesome Quant](https://awesome.re/badge.svg)](https://github.com/wilsonfreitas/awesome-quant) > **Performance Benchmark**: **~160,000,000 orders/second** (Average) | **~171,000,000** (Peak) on Apple M1 Pro. ### โšก๏ธ Key Takeaways * **Architecture**: Sharded "Share-by-Communicating" design avoids global locks. * **Memory**: `std::pmr` monotonic buffers on the stack = 0 heap allocations on hot path. * **Optimization**: `alignas(128)` (vs 64) reduced M1/M2 false sharing by ~5%. * **Latency**: Sub-microsecond matching latency within the engine core. --- ## ๐Ÿš€ Key Features * **Ultra-High Throughput**: Capable of processing **>160 million** distinct order operations per second on a single machine. * **Zero-Allocation Hot Path**: * **PMR (Polymorphic Memory Resources)**: Uses `std::pmr::monotonic_buffer_resource` with a pre-allocated 512MB stack buffer for nanosecond-level allocations. * **Soft Limits**: Gracefully falls back to heap allocation if the static buffer is exhausted (no crashes). * **Lock-Free Architecture**: * **SPSC Ring Buffer**: Custom cache-line aligned (`alignas(128)`) ring buffer for thread-safe, lock-free communication between producer and consumer. * **Shard-per-Core**: "Share by Communicating" design. Each CPU core owns a dedicated shard, eliminating mutex contention entirely. * **Thread Pinning**: Experimentally verified `pthread_setaffinity_np` / `thread_policy_set` guarantees exclusive core usage for worker threads. * **Cache Optimizations**: * **Flat OrderBook**: Replaces node-based maps with `std::vector` for linear memory access. * **Bitset Scanning**: Uses CPU intrinsics (`__builtin_ctzll`) to "teleport" to the next active price level, skipping empty levels instantly. Optimized to avoid "ghost level" checks. * **Compact Storage**: Order objects are PODs (Plain Old Data) optimized for `memcpy`. * **Smart Batching**: * **Producer-Side**: Thread-local accumulation of orders reduces atomic contention on the RingBuffer tail. * **Consumer-Side**: Workers pop commands in batches (up to 256) to amortize cache line invalidations. * **Verification & Safety**: * **Deterministic**: `--verify` mode runs a mathematically verifiable sequence (Matches == Min(Buys, Sells)). * **Instrumentation**: `--latency` mode enables wall-clock end-to-end latency tracking. --- ## ๐Ÿ— Architecture The system moves away from the traditional "Central Limit Order Book with Global Lock" to a **Partitioned/Sharded Model**. ```mermaid graph LR %% Styles for high visibility and contrast classDef client fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1 classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c classDef core fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20 classDef memory fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100 User([User / Network]) -->|TCP Input| Gateway{Symbol Mapping} class User client class Gateway infra %% Shard 0 Pipeline subgraph Core0 [Core 0: Pinned] direction TB RB0[Ring Buffer
SPSC Lock-Free]:::infra Matcher0[Matching Engine
Shard 0]:::core Mem0[PMR Arena
Stack Buffer]:::memory RB0 --> Matcher0 Matcher0 --> Mem0 end %% Shard 1 Pipeline subgraph Core1 [Core 1: Pinned] direction TB RB1[Ring Buffer
SPSC Lock-Free]:::infra Matcher1[Matching Engine
Shard 1]:::core Mem1[PMR Arena
Stack Buffer]:::memory RB1 --> Matcher1 Matcher1 --> Mem1 end %% Routing (No crossing) Gateway -->|Shard 0: GOOG, MSFT| RB0 Gateway -->|Shard 1: AAPL, TSLA| RB1 ``` 1. **Ingestion (Exchange)**: * Orders are received and hashed by `SymbolID`. * "Smart Gateway" logic routes the order to the specific Shard owning that symbol. 2. **Transport (Ring Buffer)**: * Orders are pushed into a lock-free Single-Producer Single-Consumer (SPSC) ring buffer. * **Union-Based Commands**: Uses a `union` structure to overlay `Add` and `Cancel` commands, saving memory and fitting more commands per cache line. 3. **Matching (Core)**: * **Flat OrderBook**: Bids and Asks are simple `std::pmr::vector`s indexed directly by price (O(1) lookup). * **Matcher**: Iterates linearly over the vector for maximum hardware prefetching efficiency. Active orders are tracked via a `Bitset`. 4. **Memory Management**: * A Monotonic Buffer (Arena) provides memory for new orders. It resets instantly (`release()`) between benchmark runs, preventing fragmentation. --- ## ๐Ÿ›  Build & Run ### Prerequisites * C++20 Compiler (GCC 10+ / Clang 12+) * CMake 3.14+ ### Compiling ```bash mkdir -p build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" make -j$(nproc) ``` ### Running Benchmarks To replicate the >160M ops/s performance: ```bash ./build/src/benchmark ``` To measure **End-to-End Latency** (approximate P50/P99): ```bash ./build/src/benchmark --latency ``` *Note: Latency mode adds instrumentation overhead and runs usually at ~15-20M ops/sec on M1 Pro due to timestamp calls.* ### Running Verification To ensure the engine is actually matching correctly and not just dropping frames: ```bash ./build/src/benchmark --verify ``` ### Real-World Market Replay Test the engine against **live Binance L3 data** (Trade + Depth updates) to verify handling of realistic price clustering and bursty order flow. 1. **Record Data** (Requires Python 3 + `websocket-client`): ```bash # Record 60 seconds of live BTCUSDT data python3 scripts/record_l3_data.py 60 ``` 2. **Run Replay**: ```bash ./build/src/benchmark --replay data/market_data.csv ``` > **Result**: ~132,000,000 orders/sec (M1 Pro) on real-world data. ### Running the Server Start the engine networking layer (listens on port 8080): ```bash ./build/src/OrderMatchingEngine ``` **Populate the Book (Optional):** To quickly seed the book with Bids, Asks, and Trades for testing/visualization: ```bash python3 scripts/seed_orders.py ``` --- ## ๐Ÿ“Š Client Usage (TCP) Connect using `netcat` or any TCP client. **Submit Order:** ```text BUY AAPL 100 15000 > ORDER_ACCEPTED_ASYNC 1 ``` *(Format: SIDE SYMBOL QTY PRICE_INT)* **Subscribe to Market Data:** ```text SUBSCRIBE AAPL > SUBSCRIBED AAPL > TRADE AAPL 15000 50 ``` --- ## ๐Ÿงช Testing The project includes a comprehensive GoogleTest suite covering matching logic, cancellation scenarios, and partial fills. ```bash ./build/tests/unit_tests ``` ## ๐Ÿ“ˆ Roadmap (Future) * **Kernel Bypass**: Integration with DPDK/Solarflare for sub-microsecond wire latency. * **Market Data Distribution**: Multicast UDP feed for quote dissemination.