Log-Linear Attention

a year ago

The attention mechanism in Transformers is crucial for sequence modeling but suffers from quadratic-compute and linear-memory complexity.
Linear attention and state-space models offer linear-time, constant-memory sequence modeling but are limited by their fixed-size hidden state.
Log-linear attention is introduced as a mechanism that balances efficiency and expressiveness by using a logarithmically growing set of hidden states.
Log-linear attention can be applied to existing linear attention variants and maintains matmul-rich parallelization with log-linear compute cost.
Case studies show log-linear variants of Mamba-2 and Gated DeltaNet perform well compared to their linear-time counterparts.

Hasty Briefsbeta