Hasty Briefsbeta

SWE-Bench Pro

4 hours ago
  • #Software Engineering
  • #SWE-Bench Pro
  • #LLM Evaluation
  • SWE-Bench Pro is a benchmark for evaluating LLMs/Agents on long-horizon software engineering tasks.
  • The dataset is inspired by SWE-Bench and involves generating patches to resolve issues in a given codebase.
  • Access SWE-Bench Pro using `load_dataset('ScaleAI/SWE-bench_Pro', split='test')`.
  • Docker is required for reproducible evaluations, and Modal is needed for scaling evaluations.
  • Install Docker and set up Modal credentials using `pip install modal` and `modalv setup`.
  • Prebuilt Docker images are available on Docker Hub under `jefzda/sweap-images`.
  • Evaluate patch predictions using `sweap_pro_eval_modal.py` with specified parameters.