Two different tricks for fast LLM inference
3 months ago
- #LLM
- #AI-inference
- #fast-mode
- Anthropic and OpenAI introduced 'fast mode' for faster LLM inference with different approaches.
- Anthropic's fast mode offers up to 2.5x tokens per second, serving the real Opus 4.6 model.
- OpenAI's fast mode provides over 1000 tokens per second but uses a less capable model, GPT-5.3-Codex-Spark.
- Anthropic's approach likely involves low-batch-size inference, increasing speed at higher costs.
- OpenAI's method leverages Cerebras chips with large internal memory (44GB) for ultra-low-latency compute.
- OpenAI's achievement is technically more impressive, involving model distillation and Cerebras integration.
- Fast, less-capable inference may not be universally useful due to increased error rates.
- Both labs' efforts suggest exploration rather than a major shift toward fast inference as a primary goal.