Hasty Briefsbeta

Bilingual

Two Bits Are Better Than One: making bloom filters 2x more accurate

3 days ago
  • #Bloom Filter
  • #Database Optimization
  • #Probabilistic Data Structures
  • Bloom filters are probabilistic data structures that speed up SQL queries by quickly determining if an element is not in a set.
  • They can give false positives but never false negatives, making them efficient for database operations.
  • Bloom filters are used in Floe for adaptive storage engine filtering and hash joins to reduce unnecessary data decompression and hash table probes.
  • A fixed 256KB bloom filter per join is optimal for performance, balancing memory usage and efficiency.
  • The false positive rate of a bloom filter increases as more elements are inserted, but optimizations can reduce this rate significantly.
  • Floe's implementation uses two bits per element in a single uint32 to halve the false positive rate with minimal performance cost.
  • Adaptive filtering in the storage engine can skip decompressing rows that won't be used, saving significant I/O operations.
  • Using more than two bits per element offers diminishing returns, making two bits the sweet spot for efficiency and simplicity.
  • Alternative structures like Cuckoo or XOR filters were considered but rejected due to their complexity and dynamic resizing needs.
  • Future optimizations may include SIMD instructions for checking multiple elements simultaneously.