70% Size, 100% Accuracy: Lossless LLM Compression via Dynamic-Length Float

a year ago

Introduces Dynamic-Length Float (DFloat11), a lossless compression framework for LLMs.
Reduces LLM size by 30% while maintaining bit-for-bit identical outputs.
Leverages entropy coding for dynamic-length encodings based on weight frequency.
Includes a custom GPU kernel for efficient online decompression.
Achieves 1.9-38.8x higher throughput in token generation compared to alternatives.
Enables 5.3-13.17x longer context lengths with fixed GPU memory budget.
Supports lossless inference of large models like Llama-3.1-405B on 8x80GB GPUs.

Hasty Briefsbeta