Nvidia releases 8B model with learned 8x KV cache compression
3 months ago
- #AI
- #NVIDIA
- #Machine Learning
- Qwen3-8B-DMS-8x is a derivative of Qwen3-8B with Dynamic Memory Sparsification (DMS) for 8x compression during inference.
- Optimized for reduced KV cache memory footprint, improving throughput and latency in long-context and reasoning tasks.
- Released under NVIDIA License for non-commercial research and educational use only.
- Supports global deployment with advanced reasoning capabilities.
- Model architecture is an autoregressive transformer with 8.2B parameters.
- Requires specific software (transformers==4.57.3, torch, flash-attn) for operation.
- Evaluation shows competitive performance across benchmarks like GPQA Diamond, MMLU-Pro, and HumanEval.
- Includes ethical considerations and encourages responsible AI development.