Zebra-Llama: Towards Efficient Hybrid Models
5 days ago
- #Efficiency Optimization
- #Machine Learning
- #Large Language Models
- Proposes Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers.
- Achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens and an 8B teacher.
- Reduces KV cache size significantly (3.9%, 2%, 2.73% of original for 1B, 3B, 8B variants) while maintaining high zero-shot performance.
- Outperforms models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba in accuracy with fewer tokens, smaller teachers, and reduced KV cache memory.
- Zebra-Llama-8B surpasses Minitron-8B by 7% in few-shot accuracy, uses 8x fewer training tokens, and has over 12x smaller KV cache.
- Achieves 2.6x-3.8x higher throughput than MambaInLlama up to a 32k context length.
- Code and model checkpoints will be released upon acceptance.