From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms
11 days ago
- #Transformer Models
- #Natural Language Processing
- #Attention Mechanisms
- Attention mechanisms allow models to focus on relevant parts of input context selectively.
- Key components of attention include Query (Q), Key (K), Value (V), and Attention Scores.
- Multi-Head Attention (MHA) uses multiple parallel attention heads but has high computational costs.
- Multi-Query Attention (MQA) reduces overhead by sharing Key and Value vectors across heads.
- Grouped Query Attention (GQA) balances MHA and MQA by grouping query heads and sharing Key-Value pairs.
- Multi-Head Latent Attention (MHLA) compresses Key and Value vectors into a latent space for efficiency.
- KV caching is used to store precomputed Key and Value vectors for faster inference.
- Attention mechanisms are evolving to improve scalability, speed, and memory efficiency.