Hasty Briefsbeta

Bilingual

70% Size, 100% Accuracy: Lossless LLM Compression via Dynamic-Length Float

a year ago
  • #Model Compression
  • #Machine Learning
  • #GPU Inference
  • Introduces Dynamic-Length Float (DFloat11), a lossless compression framework for LLMs.
  • Reduces LLM size by 30% while maintaining bit-for-bit identical outputs.
  • Leverages entropy coding for dynamic-length encodings based on weight frequency.
  • Includes a custom GPU kernel for efficient online decompression.
  • Achieves 1.9-38.8x higher throughput in token generation compared to alternatives.
  • Enables 5.3-13.17x longer context lengths with fixed GPU memory budget.
  • Supports lossless inference of large models like Llama-3.1-405B on 8x80GB GPUs.