MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper

a day ago

SWE-bench has several filtered subsets: Verified (500 instances), Multilingual (300 tasks across 9 languages), Lite (less costly evaluation), and Multimodal (visual elements).
Each entry reports the % Resolved metric, indicating the percentage of instances solved across different benchmarks.
Recent news includes the introduction of CodeClash, mini-SWE-agent's 65% score on SWE-bench Verified, SWE-smith for training models, and SWE-agent 1.0's SOTA performance on SWE-bench Lite.
Key milestones: SWE-bench Multimodal introduction (10/2024), SWE-bench Verified collaboration with OpenAI (08/2024), Docker-ized SWE-bench (06/2024), and SWE-bench Lite release (03/2024).
Acknowledgements to Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic for their support.

Hasty Briefsbeta