New #1 SOTA on Swe-bench is using Claude 3.7 and O1
a year ago
- #GitHub
- #AI
- #Software Engineering
- SWE-bench is a dataset for testing AI systems' ability to solve GitHub issues automatically.
- The dataset includes 2,294 Issue-Pull Request pairs from 12 popular Python repositories.
- Evaluation is based on unit test verification using post-PR behavior as the reference solution.
- SWE-bench Lite is a curated subset for less costly and more accessible evaluation.
- SWE-bench Verified is a human-annotated subset with a ceiling of 100% resolution rate.
- SWE-bench Multimodal features issues with visual elements from JavaScript repositories.
- The % Resolved metric indicates the percentage of instances solved by the model.
- Submissions marked with 'Open' have open-source code, but the underlying model may not be open-source.
- Resources include downloadable datasets from HuggingFace and pre-processed datasets for fine-tuning.
- SWE-bench is for research purposes only, with a disclaimer for unexpected results.