Bamba: An open-source LLM that crosses a transformer with an SSM
a year ago
- #AI
- #LLM
- #Transformer
- IBM Research, in collaboration with CMU, Princeton, and University of Illinois, developed Bamba, an open-source LLM combining transformer expressiveness with SSM speed.
- Transformers face a 'quadratic bottleneck' where longer conversations increase computational costs quadratically, causing latency and redundant computing.
- State-space models (SSMs) maintain a compressed hidden state, reducing memory overhead and enabling faster inference speeds compared to transformers.
- Bamba-9B reduces KV cache memory requirements, running twice as fast as similar-sized transformers while maintaining accuracy.
- SSMs, traditionally used in electrical engineering and time-series data analysis, were adapted for deep learning by IBM researchers.
- Mamba2, a gated SSM variant, inspired hybrids like Samba and MambaFormer, leading to Nvidia's Nemotron-H.
- IBM trained Bamba on 3 trillion tokens, quantized it to 8-bit precision, and achieved performance comparable to Meta's Llama-3.1 8B.
- Bamba can handle 32,000-token conversations and may scale to 1 million tokens with vLLM support, potentially running five times faster than transformers.