Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

a year ago

Introduces a novel non-attention based architecture for large language models (LLMs) capable of handling ultra-long context windows (hundreds of thousands to millions of tokens).
Avoids quadratic memory and computation overload by eliminating token-to-token attention, unlike traditional Transformer designs.
Combines State Space blocks (inspired by S4) for near-linear scaling with sequence length, Multi-Resolution Convolution layers for local context capture, a lightweight Recurrent Supervisor for global hidden state maintenance, and Retrieval-Augmented External Memory for efficient high-level chunk embedding storage and retrieval.

Hasty Briefsbeta