Kimi Linear: An Expressive, Efficient Attention Architecture

6 months ago

Kimi Linear achieves 51.0 performance on MMLU-Pro with similar speed as full attention.
On RULER (128k context length), it shows Pareto-optimal performance (84.3) and a 3.98x speedup.
Kimi Linear is 6.3x faster TPOT compared to MLA, especially for long sequences (1M tokens).
Kimi Delta Attention (KDA) is a refined version of Gated DeltaNet with an efficient gating mechanism.
Reduces KV caches by up to 75% and boosts decoding throughput significantly.
Open-sourced KDA kernel in FLA and released two model checkpoints trained with 5.7T tokens.
Hybrid architecture with a 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining quality.
Superior performance in long-context and RL-style benchmarks on 1.4T token training runs.
Achieves up to 6x faster decoding and reduces time per output token (TPOT).
Example code provided for using the Kimi Linear model with Python and deployment via vllm.

Hasty Briefsbeta