Hasty Briefsbeta

Bilingual

Taking a Look at Compression Algorithms – Moncef Abboud

4 days ago
  • #Compression Algorithms
  • #Data Optimization
  • #Lossless Encoding
  • The author explores various compression algorithms (GZIP, Snappy, LZ4, ZSTD) after working on a Kafka broker project, diving into how they work rather than just using APIs.
  • Compression reduces data size for storage and transmission savings, with two main types: lossless (exact reconstruction) and lossy (approximation, e.g., JPEG).
  • Key compression techniques include Run-Length Encoding (RLE), Lempel-Ziv (LZ) with back-references, and Huffman coding for variable-length encoding based on symbol frequency.
  • GZIP uses the DEFLATE algorithm, combining LZ77 (sliding window with back-references) and Huffman encoding, with block types like uncompressed, fixed, and dynamic Huffman codes.
  • DEFLATE encodes literals and back-references with Huffman codes, using extra bits for less common values, and employs a chained hash table for efficient match searching.
  • Snappy is an LZ-based algorithm focused on speed, compressing/decompressing at high rates (e.g., 250/500 MB/s), with lower compression ratios than DEFLATE and a simple format using literal and copy chunks.
  • LZ4 is similar to Snappy but faster, with compression ratios around 2.1x, using a token-based block format with literal and match lengths encoded in 4-bit fields and extra bytes for longer lengths.
  • ZSTD (Zstandard) offers high compression ratios (e.g., ~2.9x) with speeds comparable to LZ4, combining LZ, Huffman, and Finite State Entropy (FSE), plus features like trainable dictionaries for specific data types.
  • Arithmetic coding improves on Huffman by representing symbols proportionally to their probability, but it's computationally intensive; FSE builds on these ideas for better efficiency in ZSTD.
  • Compression optimization involves trade-offs between compression ratio, speed, and decompression speed, often using hash tables and search strategies to balance CPU use and efficiency.