Hasty Briefsbeta

Bilingual

Two different tricks for fast LLM inference

3 months ago
  • #LLM
  • #AI-inference
  • #fast-mode
  • Anthropic and OpenAI introduced 'fast mode' for faster LLM inference with different approaches.
  • Anthropic's fast mode offers up to 2.5x tokens per second, serving the real Opus 4.6 model.
  • OpenAI's fast mode provides over 1000 tokens per second but uses a less capable model, GPT-5.3-Codex-Spark.
  • Anthropic's approach likely involves low-batch-size inference, increasing speed at higher costs.
  • OpenAI's method leverages Cerebras chips with large internal memory (44GB) for ultra-low-latency compute.
  • OpenAI's achievement is technically more impressive, involving model distillation and Cerebras integration.
  • Fast, less-capable inference may not be universally useful due to increased error rates.
  • Both labs' efforts suggest exploration rather than a major shift toward fast inference as a primary goal.