HySparse: A Hybrid Sparse Attention Architecture

11 hours ago

HySparse is a hybrid sparse attention architecture combining full and sparse attention layers.
It uses full attention layers as oracles for token selection, eliminating the need for additional proxies.
HySparse enables sparse layers to reuse KV caches from full attention, reducing computation and memory.
Evaluated on 7B dense and 80B MoE models, HySparse outperforms full attention and hybrid SWA baselines.
In an 80B MoE model, HySparse reduces KV cache storage by nearly 10x while maintaining performance.

Hasty Briefsbeta