Mercury 2: Diffusion Reasoning Model
8 hours ago
- #AI
- #LLM
- #Diffusion
- Mercury 2 is introduced as the world's fastest reasoning language model, designed for instant production AI.
- Speed is critical in production AI due to compounding latency in loops like agents and retrieval pipelines.
- Mercury 2 uses diffusion-based parallel refinement for faster generation, producing multiple tokens simultaneously.
- It offers >5x faster generation compared to autoregressive models, changing the reasoning trade-off.
- Key features include 1,009 tokens/sec speed, competitive quality, 128K context, and native tool use.
- Optimized for real-time responsiveness with low p95 latency under high concurrency.
- NVIDIA highlights Mercury 2's performance on their GPUs, surpassing 1,000 tokens/sec.
- Excels in latency-sensitive applications like coding, agentic loops, real-time voice, and search pipelines.
- Partners and customers praise its speed, quality, and impact on workflows.
- Mercury 2 is OpenAI API compatible and available now for enterprise evaluations.