70% Size, 100% Accuracy: Lossless LLM Compression via Dynamic-Length Float
a year ago
- #Model Compression
- #Machine Learning
- #GPU Inference
- Introduces Dynamic-Length Float (DFloat11), a lossless compression framework for LLMs.
- Reduces LLM size by 30% while maintaining bit-for-bit identical outputs.
- Leverages entropy coding for dynamic-length encodings based on weight frequency.
- Includes a custom GPU kernel for efficient online decompression.
- Achieves 1.9-38.8x higher throughput in token generation compared to alternatives.
- Enables 5.3-13.17x longer context lengths with fixed GPU memory budget.
- Supports lossless inference of large models like Llama-3.1-405B on 8x80GB GPUs.