End of Transformer Era Approaches

6 months ago

Brumby-14B-Base is a completely attention-free LLM with competitive performance to state-of-the-art models.
The model uses power retention layers instead of attention layers and is available on Huggingface.
Training cost was $4,000 over 60 hours on 32 H100s, significantly cheaper than typical $200k for similar models.
Initial weights came from Qwen3-14B-Base, with retraining technique repurposing Transformer weights for power retention.
Power retention is a layer similar to attention but operates as a true RNN with state updates and gating signals.
The model supports fast long-context inference, with upcoming updates for kernels and long-context SFT.
Future plans include VLLM integration for improved inference speeds and memory efficiency.
Brumby-14B-Base is the first in a family of models, with more sizes from 1B to >100B parameters coming soon.

Hasty Briefsbeta