Bamba: An open-source LLM that crosses a transformer with an SSM

a year ago

IBM Research, in collaboration with CMU, Princeton, and University of Illinois, developed Bamba, an open-source LLM combining transformer expressiveness with SSM speed.
Transformers face a 'quadratic bottleneck' where longer conversations increase computational costs quadratically, causing latency and redundant computing.
State-space models (SSMs) maintain a compressed hidden state, reducing memory overhead and enabling faster inference speeds compared to transformers.
Bamba-9B reduces KV cache memory requirements, running twice as fast as similar-sized transformers while maintaining accuracy.
SSMs, traditionally used in electrical engineering and time-series data analysis, were adapted for deep learning by IBM researchers.
Mamba2, a gated SSM variant, inspired hybrids like Samba and MambaFormer, leading to Nvidia's Nemotron-H.
IBM trained Bamba on 3 trillion tokens, quantized it to 8-bit precision, and achieved performance comparable to Meta's Llama-3.1 8B.
Bamba can handle 32,000-token conversations and may scale to 1 million tokens with vLLM support, potentially running five times faster than transformers.

Hasty Briefsbeta