TransMLA: Multi-head latent attention is all you need

a year ago

Modern large language models (LLMs) face communication bottlenecks on current hardware.
Multi-Head Latent Attention (MLA) uses low-rank matrices in key-value (KV) layers to compress latent KV states, reducing cache size and speeding up inference.
MLA employs an up-projection matrix to enhance expressiveness, trading computation for reduced communication overhead.
MLA has been effective in Deepseek V2/V3/R1, but major model providers still use Group Query Attention (GQA).
GQA can always be represented by MLA with the same KV cache overhead, but the reverse is not true.
TransMLA is introduced as a post-training method to convert GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models.
Converted models can undergo additional training to improve expressiveness without increasing KV cache size.
Future plans include developing MLA-specific inference acceleration techniques to maintain low latency in transformed models.

Hasty Briefsbeta