MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper
a day ago
- #SWE-bench
- #AI evaluation
- #software engineering
- SWE-bench has several filtered subsets: Verified (500 instances), Multilingual (300 tasks across 9 languages), Lite (less costly evaluation), and Multimodal (visual elements).
- Each entry reports the % Resolved metric, indicating the percentage of instances solved across different benchmarks.
- Recent news includes the introduction of CodeClash, mini-SWE-agent's 65% score on SWE-bench Verified, SWE-smith for training models, and SWE-agent 1.0's SOTA performance on SWE-bench Lite.
- Key milestones: SWE-bench Multimodal introduction (10/2024), SWE-bench Verified collaboration with OpenAI (08/2024), Docker-ized SWE-bench (06/2024), and SWE-bench Lite release (03/2024).
- Acknowledgements to Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic for their support.