Hasty Briefsbeta

Bilingual

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

11 hours ago
  • #Model Compression
  • #Machine Learning
  • #Quantization
  • NanoQuant is a novel post-training quantization (PTQ) method for compressing large language models (LLMs) to binary and sub-1-bit levels.
  • It formulates quantization as a low-rank binary factorization problem, compressing weights into low-rank binary matrices and scales.
  • Uses an efficient ADMM method for precise initialization of binary matrices and scales, followed by tuning via block and model reconstruction.
  • Achieves state-of-the-art accuracy at sub-1-bit compression rates, enabling large-scale deployment on consumer hardware.
  • Compresses Llama2-70B by 25.8× in 13 hours on a single H100, allowing a 70B model to run on an 8 GB GPU.