End of Transformer Era Approaches
6 months ago
- #AI Research
- #LLM
- #Power Retention
- Brumby-14B-Base is a completely attention-free LLM with competitive performance to state-of-the-art models.
- The model uses power retention layers instead of attention layers and is available on Huggingface.
- Training cost was $4,000 over 60 hours on 32 H100s, significantly cheaper than typical $200k for similar models.
- Initial weights came from Qwen3-14B-Base, with retraining technique repurposing Transformer weights for power retention.
- Power retention is a layer similar to attention but operates as a true RNN with state updates and gating signals.
- The model supports fast long-context inference, with upcoming updates for kernels and long-context SFT.
- Future plans include VLLM integration for improved inference speeds and memory efficiency.
- Brumby-14B-Base is the first in a family of models, with more sizes from 1B to >100B parameters coming soon.