Nvidia releases 8B model with learned 8x KV cache compression

4 months ago

Qwen3-8B-DMS-8x is a derivative of Qwen3-8B with Dynamic Memory Sparsification (DMS) for 8x compression during inference.
Optimized for reduced KV cache memory footprint, improving throughput and latency in long-context and reasoning tasks.
Released under NVIDIA License for non-commercial research and educational use only.
Supports global deployment with advanced reasoning capabilities.
Model architecture is an autoregressive transformer with 8.2B parameters.
Requires specific software (transformers==4.57.3, torch, flash-attn) for operation.
Evaluation shows competitive performance across benchmarks like GPQA Diamond, MMLU-Pro, and HumanEval.
Includes ethical considerations and encourages responsible AI development.

Hasty Briefsbeta