Hasty Briefsbeta

Bilingual

Do transformers need three projections? Systematic study of QKV variants

6 hours ago
  • #Attention Mechanisms
  • #Model Efficiency
  • #Transformers
  • Transformers rely on query, key, and value (QKV) projections, but their individual contributions and the effects of sharing them are not well understood.
  • The study systematically explores three projection-sharing variants: shared key-value (Q-K=V), shared query-key (Q=K-V), and single projection (Q=K=V), with asymmetric attention addressed via 2D positional encodings.
  • Experiments across synthetic tasks, vision datasets (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling show that these variants perform similarly or occasionally better than standard QKV transformers.
  • In language modeling, Q-K=V sharing reduces the KV cache by 50% with only a 3.1% perplexity increase, and combining it with head sharing (GQA/MQA) achieves cache reductions up to 96.9%, enabling on-device inference.
  • Q-K=V works effectively because keys and values occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V disrupts attention directionality.
  • Projection sharing is highlighted as an underexplored form of weight tying in attention, offering quantifiable memory benefits for edge deployment, with code publicly available.