SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI
19 hours ago
- #Continuous Integration
- #Software Engineering
- #LLM Agents
- LLM-powered agents show strong capabilities in automating software engineering tasks like static bug fixing.
- SWE-CI is a new benchmark focusing on dynamic, long-term maintainability of codebases, moving beyond static, short-term functional correctness.
- The benchmark includes 100 tasks, each representing an average of 233 days and 71 commits in real-world repositories.
- Agents are required to resolve tasks through multiple rounds of analysis and coding iterations.
- SWE-CI provides insights into agents' ability to maintain code quality over long-term evolution.