Hasty Briefsbeta

Bilingual

Pushing the Limits of LLM Quantization via the Linearity Theorem

a year ago
  • #Machine Learning
  • #Quantization
  • #Large Language Models
  • Introduces a 'linearity theorem' linking layer-wise ℓ₂ reconstruction error to model perplexity increase due to quantization.
  • Presents HIGGS, a data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, outperforming prior data-free approaches like NF4.
  • Offers an optimal solution for non-uniform per-layer quantization levels matching compression constraints via dynamic programming.
  • Demonstrates improved accuracy-compression trade-offs on Llama-3.1, 3.2-family, and Qwen-family models.
  • Shows efficient GPU kernel support for various batch sizes, advancing data-free and non-uniform quantization for LLMs.