Step 3.7 Flash – 198B-A11B MoE vision-language model
6 hours ago
- #Model Deployment
- #Multimodal AI
- #AI Model
- Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model with native image understanding capabilities.
- It supports a 256k context window and three selectable reasoning levels for balancing speed, cost, and cognitive depth.
- The model achieves high performance on benchmarks like SimpleVQA (79.2) and ClawEval-1.1 (67.1), demonstrating strong visual grounding and tool orchestration.
- It can be deployed using various methods including Transformers, vLLM, SGLang, and llama.cpp, with local inference support on high-memory devices.
- Pricing is tiered for input tokens: $0.20/M (cache miss), $0.04/M (cache hit), and $1.15/M for output tokens.
- Step 3.7 Flash is available on the StepFun Open Platform, OpenRouter, NVIDIA NIM, and will be expanded to partners like DeepInfra and Fireworks AI.