Kimi Linear: An Expressive, Efficient Attention Architecture
5 months ago
- #linear-attention
- #machine-learning
- #efficiency
- Kimi Linear achieves 51.0 performance on MMLU-Pro with similar speed as full attention.
- On RULER (128k context length), it shows Pareto-optimal performance (84.3) and a 3.98x speedup.
- Kimi Linear is 6.3x faster TPOT compared to MLA, especially for long sequences (1M tokens).
- Kimi Delta Attention (KDA) is a refined version of Gated DeltaNet with an efficient gating mechanism.
- Reduces KV caches by up to 75% and boosts decoding throughput significantly.
- Open-sourced KDA kernel in FLA and released two model checkpoints trained with 5.7T tokens.
- Hybrid architecture with a 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining quality.
- Superior performance in long-context and RL-style benchmarks on 1.4T token training runs.
- Achieves up to 6x faster decoding and reduces time per output token (TPOT).
- Example code provided for using the Kimi Linear model with Python and deployment via vllm.