Model-Preserving Adaptive Rounding
a year ago
- #LLMs
- #Machine Learning
- #Quantization
- Introduces YAQA, an adaptive rounding algorithm for post-training quantization (PTQ) of LLMs.
- Uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence.
- YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian and a quantizer-independent rounding algorithm.
- Empirically reduces the KL divergence to the original model by ≈30% while achieving state-of-the-art performance on downstream tasks.
- Applicable to hundred-billion parameter LLMs and works across a wide range of models and quantizers.