Hasty Briefsbeta

Bilingual

Using group theory to explore the space of positional encodings for attention

a day ago
  • #One-Parameter Groups
  • #Attention Mechanisms
  • #Positional Encoding
  • Attention mechanisms in language models require positional encoding to incorporate sequence order, as dot products alone lack positional information.
  • Good positional encodings must be linear, translation-invariant, and continuous, leading to a one-parameter group structure with matrix exponential form.
  • Diagonalizable generators yield familiar encodings: real eigenvalues produce exponential decay (common in linear attention), while complex conjugate pairs lead to RoPE with potential damping (used in RetNet and Mamba-3).
  • Defective generators, not diagonalizable, produce positional encodings with polynomial terms, a theoretically possible but unexplored and likely impractical class.
  • ALiBi, though nonlinear, can be implemented with linear encodings using a defective matrix to grow a component linearly, demonstrating an example of such generators.