Hasty Briefsbeta

Zebra-Llama: Towards Efficient Hybrid Models

5 days ago
  • #Efficiency Optimization
  • #Machine Learning
  • #Large Language Models
  • Proposes Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers.
  • Achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens and an 8B teacher.
  • Reduces KV cache size significantly (3.9%, 2%, 2.73% of original for 1B, 3B, 8B variants) while maintaining high zero-shot performance.
  • Outperforms models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba in accuracy with fewer tokens, smaller teachers, and reduced KV cache memory.
  • Zebra-Llama-8B surpasses Minitron-8B by 7% in few-shot accuracy, uses 8x fewer training tokens, and has over 12x smaller KV cache.
  • Achieves 2.6x-3.8x higher throughput than MambaInLlama up to a 32k context length.
  • Code and model checkpoints will be released upon acceptance.