Two Bits Are Better Than One: making bloom filters 2x more accurate
3 days ago
- #Bloom Filter
- #Database Optimization
- #Probabilistic Data Structures
- Bloom filters are probabilistic data structures that speed up SQL queries by quickly determining if an element is not in a set.
- They can give false positives but never false negatives, making them efficient for database operations.
- Bloom filters are used in Floe for adaptive storage engine filtering and hash joins to reduce unnecessary data decompression and hash table probes.
- A fixed 256KB bloom filter per join is optimal for performance, balancing memory usage and efficiency.
- The false positive rate of a bloom filter increases as more elements are inserted, but optimizations can reduce this rate significantly.
- Floe's implementation uses two bits per element in a single uint32 to halve the false positive rate with minimal performance cost.
- Adaptive filtering in the storage engine can skip decompressing rows that won't be used, saving significant I/O operations.
- Using more than two bits per element offers diminishing returns, making two bits the sweet spot for efficiency and simplicity.
- Alternative structures like Cuckoo or XOR filters were considered but rejected due to their complexity and dynamic resizing needs.
- Future optimizations may include SIMD instructions for checking multiple elements simultaneously.