NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

11 hours ago

NanoQuant is a novel post-training quantization (PTQ) method for compressing large language models (LLMs) to binary and sub-1-bit levels.
It formulates quantization as a low-rank binary factorization problem, compressing weights into low-rank binary matrices and scales.
Uses an efficient ADMM method for precise initialization of binary matrices and scales, followed by tuning via block and model reconstruction.
Achieves state-of-the-art accuracy at sub-1-bit compression rates, enabling large-scale deployment on consumer hardware.
Compresses Llama2-70B by 25.8× in 13 hours on a single H100, allowing a 70B model to run on an 8 GB GPU.

Hasty Briefsbeta