Model-Preserving Adaptive Rounding

a year ago

Introduces YAQA, an adaptive rounding algorithm for post-training quantization (PTQ) of LLMs.
Uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence.
YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian and a quantizer-independent rounding algorithm.
Empirically reduces the KL divergence to the original model by ≈30% while achieving state-of-the-art performance on downstream tasks.
Applicable to hundred-billion parameter LLMs and works across a wide range of models and quantizers.

Hasty Briefsbeta