Step 3.7 Flash – 198B-A11B MoE vision-language model

6 hours ago

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model with native image understanding capabilities.
It supports a 256k context window and three selectable reasoning levels for balancing speed, cost, and cognitive depth.
The model achieves high performance on benchmarks like SimpleVQA (79.2) and ClawEval-1.1 (67.1), demonstrating strong visual grounding and tool orchestration.
It can be deployed using various methods including Transformers, vLLM, SGLang, and llama.cpp, with local inference support on high-memory devices.
Pricing is tiered for input tokens: $0.20/M (cache miss), $0.04/M (cache hit), and $1.15/M for output tokens.
Step 3.7 Flash is available on the StepFun Open Platform, OpenRouter, NVIDIA NIM, and will be expanded to partners like DeepInfra and Fireworks AI.

Hasty Briefsbeta