Using group theory to explore the space of positional encodings for attention
a day ago
- #One-Parameter Groups
- #Attention Mechanisms
- #Positional Encoding
- Attention mechanisms in language models require positional encoding to incorporate sequence order, as dot products alone lack positional information.
- Good positional encodings must be linear, translation-invariant, and continuous, leading to a one-parameter group structure with matrix exponential form.
- Diagonalizable generators yield familiar encodings: real eigenvalues produce exponential decay (common in linear attention), while complex conjugate pairs lead to RoPE with potential damping (used in RetNet and Mamba-3).
- Defective generators, not diagonalizable, produce positional encodings with polynomial terms, a theoretically possible but unexplored and likely impractical class.
- ALiBi, though nonlinear, can be implemented with linear encodings using a defective matrix to grow a component linearly, demonstrating an example of such generators.