Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

a month ago

Qwen3.7-Max autonomously optimized a kernel on unfamiliar hardware (T-Head ZW-M890 PPUs) over 35 hours, achieving a 10x speedup.
The model made 1,158 tool calls and performed 432 kernel evaluations, diagnosing failures and redesigning the architecture multiple times without human guidance.
Compared to other models, GLM 5.1 reached 7.3x speedup, Kimi K2.6 reached 5x, and DeepSeek V4 Pro reached 3.3x on the same task.
Benchmark results show Qwen3.7-Max trades blows with top models in coding (e.g., SWE-Verified) and leads in reasoning tasks like GPQA Diamond and HLE.
Training via 'environment scaling' across diverse agentic environments enables cross-harness generalization and robust problem-solving.
Limitations include being a proprietary API model (no open weights) and potential gaps in complex instruction following compared to competitors.
Suitable for agentic workflows that can use a proprietary API, but not for those requiring open weights or local deployment.

Hasty Briefsbeta