- NanoQuant is a novel post-training quantization (PTQ) method for compressing large language models (LLMs) to binary and sub-1-bit levels.
- It formulates quantization as a low-rank binary factorization problem, compressing weights into low-rank binary matrices and scales.
- Uses an efficient ADMM method for precise initialization of binary matrices and scales, followed by tuning via block and model reconstruction.
- Achieves state-of-the-art accuracy at sub-1-bit compression rates, enabling large-scale deployment on consumer hardware.
- Compresses Llama2-70B by 25.8× in 13 hours on a single H100, allowing a 70B model to run on an 8 GB GPU.