DeepSeek releases 'sparse attention' model that cuts API costs in half

10 hours ago

Copy Link

DeepSeek released a new experimental model called V3.2-exp with lower inference costs for long-context operations.
The model features DeepSeek Sparse Attention, which uses a 'lightning indexer' and 'fine-grained token selection system' to optimize server loads.
Preliminary testing shows API call costs could be reduced by up to 50% in long-context scenarios.
The model is open-weight and available on Hugging Face, allowing third-party verification.
DeepSeek aims to improve transformer architecture efficiency to reduce inference costs.
DeepSeek, based in China, previously gained attention with its R1 model but has since receded from the spotlight.
The new sparse attention approach may help U.S. providers lower inference costs.

Hasty Briefsbeta