Using group theory to explore the space of positional encodings for attention

a day ago

Attention mechanisms in language models require positional encoding to incorporate sequence order, as dot products alone lack positional information.
Good positional encodings must be linear, translation-invariant, and continuous, leading to a one-parameter group structure with matrix exponential form.
Diagonalizable generators yield familiar encodings: real eigenvalues produce exponential decay (common in linear attention), while complex conjugate pairs lead to RoPE with potential damping (used in RetNet and Mamba-3).
Defective generators, not diagonalizable, produce positional encodings with polynomial terms, a theoretically possible but unexplored and likely impractical class.
ALiBi, though nonlinear, can be implemented with linear encodings using a defective matrix to grow a component linearly, demonstrating an example of such generators.

Hasty Briefsbeta