DeepSeek won the best paper award at ACL 2025

10 months ago

NSA (Natively Sparse Attention) is introduced for efficient long-context modeling in language models.
It combines dynamic hierarchical sparse strategy with hardware-aligned optimizations for speed and efficiency.
Key innovations include arithmetic intensity-balanced algorithm design and end-to-end trainability.
NSA maintains or exceeds Full Attention model performance across various benchmarks and tasks.
Substantial speedups are achieved over Full Attention on 64k-length sequences in decoding and propagation phases.

Hasty Briefsbeta