Lessons from BF-Tree: Building a Concurrent Larger-Than-Memory Index in Rust
2 days ago
- #database
- #concurrency
- #performance
- BF-Tree introduces variable-size mini-pages (64-4096 bytes) to reduce write amplification and memory waste compared to traditional 4KB page caching.
- The Rust implementation features a custom ring buffer allocator with a 6-state machine, RAII guards, optimistic locks, and deterministic concurrency testing.
- Mini-pages serve three functions: buffering writes, caching hot records, and growing for scan workloads, optimizing memory usage based on access patterns.
- The design decouples in-memory representation from on-disk format, allowing BF-Tree to optimize for the working set rather than physical page layout.
- BF-Tree's benchmarks show it is 2x faster than B-trees and LSM-trees for point lookups, 6x faster than B-trees for writes, and 2.5x faster than RocksDB for scans.
- The implementation uses RAII guards (CircularBufferPtr and TombstoneHandle) to manage lifecycle states, ensuring safe transitions and memory reclamation.
- Concurrency is handled with optimistic versioned locks for inner nodes and custom RwLocks with try_upgrade for page table entries, minimizing contention.
- Testing leverages Shuttle for systematic concurrency testing, exploring thread interleavings to uncover hard-to-find bugs.
- Key patterns include RAII guards for state machines, typed errors for retry control flow, and compile-time test infrastructure swaps.