Hasty Briefsbeta

Bilingual

End of Transformer Era Approaches

6 months ago
  • #AI Research
  • #LLM
  • #Power Retention
  • Brumby-14B-Base is a completely attention-free LLM with competitive performance to state-of-the-art models.
  • The model uses power retention layers instead of attention layers and is available on Huggingface.
  • Training cost was $4,000 over 60 hours on 32 H100s, significantly cheaper than typical $200k for similar models.
  • Initial weights came from Qwen3-14B-Base, with retraining technique repurposing Transformer weights for power retention.
  • Power retention is a layer similar to attention but operates as a true RNN with state updates and gating signals.
  • The model supports fast long-context inference, with upcoming updates for kernels and long-context SFT.
  • Future plans include VLLM integration for improved inference speeds and memory efficiency.
  • Brumby-14B-Base is the first in a family of models, with more sizes from 1B to >100B parameters coming soon.