DeepSeek won the best paper award at ACL 2025
9 months ago
- #Efficient Computing
- #Machine Learning
- #Natural Language Processing
- NSA (Natively Sparse Attention) is introduced for efficient long-context modeling in language models.
- It combines dynamic hierarchical sparse strategy with hardware-aligned optimizations for speed and efficiency.
- Key innovations include arithmetic intensity-balanced algorithm design and end-to-end trainability.
- NSA maintains or exceeds Full Attention model performance across various benchmarks and tasks.
- Substantial speedups are achieved over Full Attention on 64k-length sequences in decoding and propagation phases.