Mercury 2: Diffusion Reasoning Model

2 months ago

Mercury 2 is introduced as the world's fastest reasoning language model, designed for instant production AI.
Speed is critical in production AI due to compounding latency in loops like agents and retrieval pipelines.
Mercury 2 uses diffusion-based parallel refinement for faster generation, producing multiple tokens simultaneously.
It offers >5x faster generation compared to autoregressive models, changing the reasoning trade-off.
Key features include 1,009 tokens/sec speed, competitive quality, 128K context, and native tool use.
Optimized for real-time responsiveness with low p95 latency under high concurrency.
NVIDIA highlights Mercury 2's performance on their GPUs, surpassing 1,000 tokens/sec.
Excels in latency-sensitive applications like coding, agentic loops, real-time voice, and search pipelines.
Partners and customers praise its speed, quality, and impact on workflows.
Mercury 2 is OpenAI API compatible and available now for enterprise evaluations.

Hasty Briefsbeta