New #1 SOTA on Swe-bench is using Claude 3.7 and O1

a year ago

SWE-bench is a dataset for testing AI systems' ability to solve GitHub issues automatically.
The dataset includes 2,294 Issue-Pull Request pairs from 12 popular Python repositories.
Evaluation is based on unit test verification using post-PR behavior as the reference solution.
SWE-bench Lite is a curated subset for less costly and more accessible evaluation.
SWE-bench Verified is a human-annotated subset with a ceiling of 100% resolution rate.
SWE-bench Multimodal features issues with visual elements from JavaScript repositories.
The % Resolved metric indicates the percentage of instances solved by the model.
Submissions marked with 'Open' have open-source code, but the underlying model may not be open-source.
Resources include downloadable datasets from HuggingFace and pre-processed datasets for fine-tuning.
SWE-bench is for research purposes only, with a disclaimer for unexpected results.

Hasty Briefsbeta