Do transformers need three projections? Systematic study of QKV variants
6 hours ago
- #Attention Mechanisms
- #Model Efficiency
- #Transformers
- Transformers rely on query, key, and value (QKV) projections, but their individual contributions and the effects of sharing them are not well understood.
- The study systematically explores three projection-sharing variants: shared key-value (Q-K=V), shared query-key (Q=K-V), and single projection (Q=K=V), with asymmetric attention addressed via 2D positional encodings.
- Experiments across synthetic tasks, vision datasets (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling show that these variants perform similarly or occasionally better than standard QKV transformers.
- In language modeling, Q-K=V sharing reduces the KV cache by 50% with only a 3.1% perplexity increase, and combining it with head sharing (GQA/MQA) achieves cache reductions up to 96.9%, enabling on-device inference.
- Q-K=V works effectively because keys and values occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V disrupts attention directionality.
- Projection sharing is highlighted as an underexplored form of weight tying in attention, offering quantifiable memory benefits for edge deployment, with code publicly available.