Efficient String Compression for Modern Database Systems

6 days ago

Copy Link

Strings are the most prominent data type, accounting for roughly 50% of stored data due to their flexibility and convenience.
Compressing strings is crucial for reducing storage costs and improving query performance by minimizing data size and memory footprint.
CedarDB initially supported three string compression schemes: Uncompressed, Single Value, and Dictionary compression.
Dictionary compression replaces strings with fixed-size integer keys, enabling efficient random access and query evaluation on compressed data.
FSST (Fast Static Symbol Table) compresses strings by replacing frequent substrings with 1-byte tokens, offering better compression for low-entropy strings.
Integrating FSST with dictionary compression combines the benefits of both methods, improving compression ratios while maintaining efficient query processing.
Benchmarks show FSST reduces storage size by 20-60% and improves cold query runtimes by up to 40%, though hot runs may slow down due to decompression overhead.
Activating FSST in CedarDB is a net win, balancing storage savings and performance improvements despite some trade-offs in decompression speed.

Hasty Briefsbeta