Hasty Briefsbeta

Bilingual

Spikes in LLMs Are Bias Vectors: Spike-Free Quantization

a day ago
  • #Large Language Models
  • #Activation Spikes
  • #Quantization
  • Activation spikes in LLMs degrade quantization by expanding dynamic ranges; these spikes are structural vector biases in tokens, not just scalar biases.
  • The study shows tokens converge to constant vectors after normalization, driving attention sink and value-state drain, and explains the geometric coordination of projection weights.
  • Models actively preserve structural biases against RoPE perturbations using low-frequency bands and coherent channel pairs in zones of rotational stability.
  • INSERTQUANT is proposed as a PTQ framework that clamps spikes and restores their function with pre-computed template vectors, enabling spike-free, low-bit quantization with high fidelity.
  • The method achieves state-of-the-art results on LLMs and generalizes to other modalities like ViTs.