Hasty Briefsbeta

Bilingual

New #1 SOTA on Swe-bench is using Claude 3.7 and O1

a year ago
  • #GitHub
  • #AI
  • #Software Engineering
  • SWE-bench is a dataset for testing AI systems' ability to solve GitHub issues automatically.
  • The dataset includes 2,294 Issue-Pull Request pairs from 12 popular Python repositories.
  • Evaluation is based on unit test verification using post-PR behavior as the reference solution.
  • SWE-bench Lite is a curated subset for less costly and more accessible evaluation.
  • SWE-bench Verified is a human-annotated subset with a ceiling of 100% resolution rate.
  • SWE-bench Multimodal features issues with visual elements from JavaScript repositories.
  • The % Resolved metric indicates the percentage of instances solved by the model.
  • Submissions marked with 'Open' have open-source code, but the underlying model may not be open-source.
  • Resources include downloadable datasets from HuggingFace and pre-processed datasets for fine-tuning.
  • SWE-bench is for research purposes only, with a disclaimer for unexpected results.