Hasty Briefsbeta

Bilingual

MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper

a day ago
  • #SWE-bench
  • #AI evaluation
  • #software engineering
  • SWE-bench has several filtered subsets: Verified (500 instances), Multilingual (300 tasks across 9 languages), Lite (less costly evaluation), and Multimodal (visual elements).
  • Each entry reports the % Resolved metric, indicating the percentage of instances solved across different benchmarks.
  • Recent news includes the introduction of CodeClash, mini-SWE-agent's 65% score on SWE-bench Verified, SWE-smith for training models, and SWE-agent 1.0's SOTA performance on SWE-bench Lite.
  • Key milestones: SWE-bench Multimodal introduction (10/2024), SWE-bench Verified collaboration with OpenAI (08/2024), Docker-ized SWE-bench (06/2024), and SWE-bench Lite release (03/2024).
  • Acknowledgements to Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic for their support.