Two different tricks for fast LLM inference

3 months ago

Anthropic and OpenAI introduced 'fast mode' for faster LLM inference with different approaches.
Anthropic's fast mode offers up to 2.5x tokens per second, serving the real Opus 4.6 model.
OpenAI's fast mode provides over 1000 tokens per second but uses a less capable model, GPT-5.3-Codex-Spark.
Anthropic's approach likely involves low-batch-size inference, increasing speed at higher costs.
OpenAI's method leverages Cerebras chips with large internal memory (44GB) for ultra-low-latency compute.
OpenAI's achievement is technically more impressive, involving model distillation and Cerebras integration.
Fast, less-capable inference may not be universally useful due to increased error rates.
Both labs' efforts suggest exploration rather than a major shift toward fast inference as a primary goal.

Hasty Briefsbeta