My LSM tree was slower than a B-tree. Then I profiled it

3 days ago

The author built an LSM tree in Go to understand storage engines like RocksDB, but it was initially slower than a B-tree, with only 250,000 writes per second.
B-trees slow down on write-heavy workloads due to random writes, while LSM trees avoid this by writing to an in-memory MemTable first and then performing sequential writes to disk.
Profiling revealed bottlenecks: the write syscall (34% CPU), garbage collection and sorting (48% CPU). Fixes included memory-mapping the WAL to reduce syscalls and using a streaming k-way merge to eliminate large allocations and sorting.
A bug in the bloom filter sizing caused reads to slow down; fixing it by using a correct entry count restored read performance.
Further optimizations included memory-mapping SSTable files for zero-copy reads, incremental CRC computation, buffered writes, and pre-allocating MemTable capacity.
The final version achieved 1.98 million sequential writes per second, 5x faster than BoltDB on random writes, but still slower than BadgerDB and BoltDB on reads due to LSM trade-offs.
Correctness was ensured through tests including key overwrites, crash recovery, and tombstone propagation, all passing with race detection.
The author emphasized that profiling, not guessing, drove all optimizations, and the bottleneck shifted with each fix, highlighting the importance of measurement in performance tuning.

Hasty Briefsbeta