Gemma 4 on Cerebras - The Fastest Inference Is Now Multimodal

8 hours ago

Gemma 4 31B runs at over 1,800 tokens per second on Cerebras Inference, making it the fastest multimodal model for applications like computer use and image-driven workflows.
The Cerebras platform offers record speed (1,851 output tokens/sec) and low latency (1.5 seconds for first token), enabling real-time use and outperforming typical GPU endpoints and models like Claude Haiku.
Gemma 4 31B is an open-weight model under Apache 2.0, comparable in intelligence to Claude Haiku 4.5, and serves as a reference medium-size model for alternatives to Haiku, GPT-OSS, or Llama.
It supports image understanding (e.g., screenshots, charts, UI states), unlocking new product experiences like real-time insight generation, long-context summarization, and UI patching.
Available now on the Cerebras Inference Cloud in public preview for workloads requiring multimodal reasoning, fast document processing, or real-time audio/video.

Hasty Briefsbeta