Hasty Briefsbeta

Bilingual

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

a year ago
  • #Machine Learning
  • #Efficient Architectures
  • #Large Language Models
  • Introduces a novel non-attention based architecture for large language models (LLMs) capable of handling ultra-long context windows (hundreds of thousands to millions of tokens).
  • Avoids quadratic memory and computation overload by eliminating token-to-token attention, unlike traditional Transformer designs.
  • Combines State Space blocks (inspired by S4) for near-linear scaling with sequence length, Multi-Resolution Convolution layers for local context capture, a lightweight Recurrent Supervisor for global hidden state maintenance, and Retrieval-Augmented External Memory for efficient high-level chunk embedding storage and retrieval.