Do transformers need three projections? Systematic study of QKV variants

6 hours ago

Transformers rely on query, key, and value (QKV) projections, but their individual contributions and the effects of sharing them are not well understood.
The study systematically explores three projection-sharing variants: shared key-value (Q-K=V), shared query-key (Q=K-V), and single projection (Q=K=V), with asymmetric attention addressed via 2D positional encodings.
Experiments across synthetic tasks, vision datasets (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling show that these variants perform similarly or occasionally better than standard QKV transformers.
In language modeling, Q-K=V sharing reduces the KV cache by 50% with only a 3.1% perplexity increase, and combining it with head sharing (GQA/MQA) achieves cache reductions up to 96.9%, enabling on-device inference.
Q-K=V works effectively because keys and values occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V disrupts attention directionality.
Projection sharing is highlighted as an underexplored form of weight tying in attention, offering quantifiable memory benefits for edge deployment, with code publicly available.

Hasty Briefsbeta