DeepSeek V4's indexer OOMs at 65K context. We got it to 1M in 6G

20 hours ago

StreamIndex is a Triton implementation of Compressed Sparse Attention (CSA) that uses a chunked partition-merge top-k driver to avoid materializing the full intermediate score tensor.
It extends the memory-bounded regime by 32x, handling sequences up to 1,048,576 tokens using only 6.21 GB of HBM, compared to the materialize path which OOMs at 65,536 tokens.
Set-overlap recall against the materialize ground truth is near-perfect (mean 1.0000, min 0.9980 across design sweeps), ensuring accuracy.
StreamIndex composes with TileLang's pipelined attention kernel, enabling execution at S=262,144 in 1.97 s while the materialize version OOMs.
The contribution focuses on the indexer step, not attention kernel speed or end-to-end real-checkpoint behavior.

Hasty Briefsbeta