SWE-Bench Pro
4 hours ago
- #Software Engineering
- #SWE-Bench Pro
- #LLM Evaluation
- SWE-Bench Pro is a benchmark for evaluating LLMs/Agents on long-horizon software engineering tasks.
- The dataset is inspired by SWE-Bench and involves generating patches to resolve issues in a given codebase.
- Access SWE-Bench Pro using `load_dataset('ScaleAI/SWE-bench_Pro', split='test')`.
- Docker is required for reproducible evaluations, and Modal is needed for scaling evaluations.
- Install Docker and set up Modal credentials using `pip install modal` and `modalv setup`.
- Prebuilt Docker images are available on Docker Hub under `jefzda/sweap-images`.
- Evaluate patch predictions using `sweap_pro_eval_modal.py` with specified parameters.