Spikes in LLMs Are Bias Vectors: Spike-Free Quantization

a day ago

Activation spikes in LLMs degrade quantization by expanding dynamic ranges; these spikes are structural vector biases in tokens, not just scalar biases.
The study shows tokens converge to constant vectors after normalization, driving attention sink and value-state drain, and explains the geometric coordination of projection weights.
Models actively preserve structural biases against RoPE perturbations using low-frequency bands and coherent channel pairs in zones of rotational stability.
INSERTQUANT is proposed as a PTQ framework that clamps spikes and restores their function with pre-computed template vectors, enabling spike-free, low-bit quantization with high fidelity.
The method achieves state-of-the-art results on LLMs and generalizes to other modalities like ViTs.

Hasty Briefsbeta