Taking a Look at Compression Algorithms – Moncef Abboud
4 days ago
- #Compression Algorithms
- #Data Optimization
- #Lossless Encoding
- The author explores various compression algorithms (GZIP, Snappy, LZ4, ZSTD) after working on a Kafka broker project, diving into how they work rather than just using APIs.
- Compression reduces data size for storage and transmission savings, with two main types: lossless (exact reconstruction) and lossy (approximation, e.g., JPEG).
- Key compression techniques include Run-Length Encoding (RLE), Lempel-Ziv (LZ) with back-references, and Huffman coding for variable-length encoding based on symbol frequency.
- GZIP uses the DEFLATE algorithm, combining LZ77 (sliding window with back-references) and Huffman encoding, with block types like uncompressed, fixed, and dynamic Huffman codes.
- DEFLATE encodes literals and back-references with Huffman codes, using extra bits for less common values, and employs a chained hash table for efficient match searching.
- Snappy is an LZ-based algorithm focused on speed, compressing/decompressing at high rates (e.g., 250/500 MB/s), with lower compression ratios than DEFLATE and a simple format using literal and copy chunks.
- LZ4 is similar to Snappy but faster, with compression ratios around 2.1x, using a token-based block format with literal and match lengths encoded in 4-bit fields and extra bytes for longer lengths.
- ZSTD (Zstandard) offers high compression ratios (e.g., ~2.9x) with speeds comparable to LZ4, combining LZ, Huffman, and Finite State Entropy (FSE), plus features like trainable dictionaries for specific data types.
- Arithmetic coding improves on Huffman by representing symbols proportionally to their probability, but it's computationally intensive; FSE builds on these ideas for better efficiency in ZSTD.
- Compression optimization involves trade-offs between compression ratio, speed, and decompression speed, often using hash tables and search strategies to balance CPU use and efficiency.