Pushing the Limits of LLM Quantization via the Linearity Theorem

a year ago

Introduces a 'linearity theorem' linking layer-wise ℓ₂ reconstruction error to model perplexity increase due to quantization.
Presents HIGGS, a data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, outperforming prior data-free approaches like NF4.
Offers an optimal solution for non-uniform per-layer quantization levels matching compression constraints via dynamic programming.
Demonstrates improved accuracy-compression trade-offs on Llama-3.1, 3.2-family, and Qwen-family models.
Shows efficient GPU kernel support for various batch sizes, advancing data-free and non-uniform quantization for LLMs.

Hasty Briefsbeta