Hasty Briefsbeta

Bilingual

TransMLA: Multi-head latent attention is all you need

a year ago
  • #Machine Learning
  • #Attention Mechanisms
  • #Large Language Models
  • Modern large language models (LLMs) face communication bottlenecks on current hardware.
  • Multi-Head Latent Attention (MLA) uses low-rank matrices in key-value (KV) layers to compress latent KV states, reducing cache size and speeding up inference.
  • MLA employs an up-projection matrix to enhance expressiveness, trading computation for reduced communication overhead.
  • MLA has been effective in Deepseek V2/V3/R1, but major model providers still use Group Query Attention (GQA).
  • GQA can always be represented by MLA with the same KV cache overhead, but the reverse is not true.
  • TransMLA is introduced as a post-training method to convert GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models.
  • Converted models can undergo additional training to improve expressiveness without increasing KV cache size.
  • Future plans include developing MLA-specific inference acceleration techniques to maintain low latency in transformed models.