My LSM tree was slower than a B-tree. Then I profiled it
3 days ago
- #LSM Tree
- #Performance Optimization
- #Database Engineering
- The author built an LSM tree in Go to understand storage engines like RocksDB, but it was initially slower than a B-tree, with only 250,000 writes per second.
- B-trees slow down on write-heavy workloads due to random writes, while LSM trees avoid this by writing to an in-memory MemTable first and then performing sequential writes to disk.
- Profiling revealed bottlenecks: the write syscall (34% CPU), garbage collection and sorting (48% CPU). Fixes included memory-mapping the WAL to reduce syscalls and using a streaming k-way merge to eliminate large allocations and sorting.
- A bug in the bloom filter sizing caused reads to slow down; fixing it by using a correct entry count restored read performance.
- Further optimizations included memory-mapping SSTable files for zero-copy reads, incremental CRC computation, buffered writes, and pre-allocating MemTable capacity.
- The final version achieved 1.98 million sequential writes per second, 5x faster than BoltDB on random writes, but still slower than BadgerDB and BoltDB on reads due to LSM trade-offs.
- Correctness was ensured through tests including key overwrites, crash recovery, and tombstone propagation, all passing with race detection.
- The author emphasized that profiling, not guessing, drove all optimizations, and the bottleneck shifted with each fix, highlighting the importance of measurement in performance tuning.