SWE-Bench Pro

4 hours ago

Copy Link

SWE-Bench Pro is a benchmark for evaluating LLMs/Agents on long-horizon software engineering tasks.
The dataset is inspired by SWE-Bench and involves generating patches to resolve issues in a given codebase.
Access SWE-Bench Pro using `load_dataset('ScaleAI/SWE-bench_Pro', split='test')`.
Docker is required for reproducible evaluations, and Modal is needed for scaling evaluations.
Install Docker and set up Modal credentials using `pip install modal` and `modalv setup`.
Prebuilt Docker images are available on Docker Hub under `jefzda/sweap-images`.
Evaluate patch predictions using `sweap_pro_eval_modal.py` with specified parameters.

Hasty Briefsbeta