You could have designed state of the art positional encoding

a year ago

#machine-learning
#positional-encoding
#transformers

The post discusses the iterative development of positional encoding in transformer models, leading to Rotary Positional Encoding (RoPE).
Positional encoding is essential in transformers to maintain the relationship between tokens in a sequence, as self-attention alone is permutation equivariant.
A motivating example shows that without positional encoding, identical tokens in different positions produce the same output, failing to capture distinct meanings.
Desirable properties for positional encoding include unique encodings per position, linear relations between positions, generalization to longer sequences, deterministic generation, and extensibility to multiple dimensions.
Initial attempts like integer and binary positional encoding have limitations, such as value range issues and discontinuous changes.
Sinusoidal positional encoding, introduced in the 'Attention is All You Need' paper, uses trigonometric functions to provide smooth, continuous encodings.
Rotary Positional Encoding (RoPE) improves upon sinusoidal encoding by using rotations to encode relative positions directly in the dot product of self-attention, preserving semantic information.
RoPE can be extended to higher dimensions (e.g., images) by handling each dimension independently, maintaining the structure of the space.
Despite its advantages, RoPE has limitations, and future improvements may draw from signal processing or hierarchical implementations.
Positional encoding is a critical but often overlooked component of transformers, and the post encourages viewing it as a key area for innovation.

Hasty Briefsbeta

You could have designed state of the art positional encoding