DashAttention: Differentiable and Adaptable Sparse Hierarchical Attention

a day ago

DashAttention is a new hierarchical attention method that is fully differentiable and adaptive, using α-entmax to select variable numbers of relevant KV blocks per query.
It addresses limitations of existing methods like NSA and InfLLMv2, which use top-k selection with fixed token counts and block gradient flow between sparse and dense stages.
DashAttention is non-dispersive, enhancing long-context modeling ability compared to other hierarchical attention methods.
Experiments show it achieves accuracy comparable to full attention with 75% sparsity and outperforms NSA and InfLLMv2, especially in high-sparsity regimes.
An efficient GPU-aware implementation in Triton provides up to a certain speedup over FlashAttention-3 at inference time, offering a cost-effective strategy for long contexts.

Hasty Briefsbeta