Gemma 4 on Cerebras - The Fastest Inference Is Now Multimodal

3 days ago

Gemma 4 31B runs at over 1,800 tokens per second on Cerebras Inference, enabling fast multimodal applications.
It is the first Google DeepMind model on Cerebras with image input support for screenshots, documents, charts, and UI states.
Cerebras achieves record speed and low latency, making Gemma 4 suitable for real-time visual and agentic workflows.
Gemma 4 31B matches Claude Haiku 4.5 in intelligence but runs 18x faster and is open-weight under Apache 2.0.
The model's high speed transforms product development, allowing instant iteration and complex multimodal loops.
As a dense, efficient model, Gemma 4 31B is designed for quality without the large footprint of MoE models.
Example applications include real-time screenshot analysis, long-context summarization, and UI debugging.
Gemma 4 31B is available on the Cerebras Inference Cloud in public preview for multimodal workloads.

Hasty Briefsbeta