Spikes in LLMs Are Bias Vectors: Spike-Free Quantization
a day ago
- #Large Language Models
- #Activation Spikes
- #Quantization
- Activation spikes in LLMs degrade quantization by expanding dynamic ranges; these spikes are structural vector biases in tokens, not just scalar biases.
- The study shows tokens converge to constant vectors after normalization, driving attention sink and value-state drain, and explains the geometric coordination of projection weights.
- Models actively preserve structural biases against RoPE perturbations using low-frequency bands and coherent channel pairs in zones of rotational stability.
- INSERTQUANT is proposed as a PTQ framework that clamps spikes and restores their function with pre-computed template vectors, enabling spike-free, low-bit quantization with high fidelity.
- The method achieves state-of-the-art results on LLMs and generalizes to other modalities like ViTs.