DashAttention: Differentiable and Adaptable Sparse Hierarchical Attention
a day ago
- #attention mechanism
- #long context modeling
- #sparse attention
- DashAttention is a new hierarchical attention method that is fully differentiable and adaptive, using α-entmax to select variable numbers of relevant KV blocks per query.
- It addresses limitations of existing methods like NSA and InfLLMv2, which use top-k selection with fixed token counts and block gradient flow between sparse and dense stages.
- DashAttention is non-dispersive, enhancing long-context modeling ability compared to other hierarchical attention methods.
- Experiments show it achieves accuracy comparable to full attention with 75% sparsity and outperforms NSA and InfLLMv2, especially in high-sparsity regimes.
- An efficient GPU-aware implementation in Triton provides up to a certain speedup over FlashAttention-3 at inference time, offering a cost-effective strategy for long contexts.