TransMLA: Multi-head latent attention is all you need
a year ago
- #Machine Learning
- #Attention Mechanisms
- #Large Language Models
- Modern large language models (LLMs) face communication bottlenecks on current hardware.
- Multi-Head Latent Attention (MLA) uses low-rank matrices in key-value (KV) layers to compress latent KV states, reducing cache size and speeding up inference.
- MLA employs an up-projection matrix to enhance expressiveness, trading computation for reduced communication overhead.
- MLA has been effective in Deepseek V2/V3/R1, but major model providers still use Group Query Attention (GQA).
- GQA can always be represented by MLA with the same KV cache overhead, but the reverse is not true.
- TransMLA is introduced as a post-training method to convert GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models.
- Converted models can undergo additional training to improve expressiveness without increasing KV cache size.
- Future plans include developing MLA-specific inference acceleration techniques to maintain low latency in transformed models.