Hasty Briefsbeta

Bilingual

Kimi Linear: An Expressive, Efficient Attention Architecture

5 months ago
  • #linear-attention
  • #machine-learning
  • #efficiency
  • Kimi Linear achieves 51.0 performance on MMLU-Pro with similar speed as full attention.
  • On RULER (128k context length), it shows Pareto-optimal performance (84.3) and a 3.98x speedup.
  • Kimi Linear is 6.3x faster TPOT compared to MLA, especially for long sequences (1M tokens).
  • Kimi Delta Attention (KDA) is a refined version of Gated DeltaNet with an efficient gating mechanism.
  • Reduces KV caches by up to 75% and boosts decoding throughput significantly.
  • Open-sourced KDA kernel in FLA and released two model checkpoints trained with 5.7T tokens.
  • Hybrid architecture with a 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining quality.
  • Superior performance in long-context and RL-style benchmarks on 1.4T token training runs.
  • Achieves up to 6x faster decoding and reduces time per output token (TPOT).
  • Example code provided for using the Kimi Linear model with Python and deployment via vllm.