Pushing the Limits of LLM Quantization via the Linearity Theorem
a year ago
- #Machine Learning
- #Quantization
- #Large Language Models
- Introduces a 'linearity theorem' linking layer-wise ℓ₂ reconstruction error to model perplexity increase due to quantization.
- Presents HIGGS, a data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, outperforming prior data-free approaches like NF4.
- Offers an optimal solution for non-uniform per-layer quantization levels matching compression constraints via dynamic programming.
- Demonstrates improved accuracy-compression trade-offs on Llama-3.1, 3.2-family, and Qwen-family models.
- Shows efficient GPU kernel support for various batch sizes, advancing data-free and non-uniform quantization for LLMs.