Author here.
This started as a quest to see how fast I could push a single core on an M1 Pro.
I built an order matching engine from scratch using C++20.
Initially, it did ~100k ops/sec. After a month of optimization, it now does ~156M ops/sec.
Key optimizations:
- Removed all mutexes (Shard-per-Core architecture).
- Custom lock-free SPSC Ring Buffer for thread communication.
- Replaced std::map with flat vectors + bitset scanning (using CTZ intrinsics).
- Zero-allocation hot path using std::pmr (Polymorphic Memory Resources) on the stack.
To prove it handles real markets (not just random numbers), I verified it by replaying captured Binance L3 market data (132M ops/sec).
Detailed write-up of the optimization journey here:
https://medium.com/@kpiyush8826/how-i-optimized-a-c-matching...
Happy to answer questions!