CursorBench 3.1
14 hours ago
- #Code Agents
- #AI Benchmark
- #Performance Evaluation
- CursorBench 3.1 evaluates agents on ambiguous, multi-file tasks from real Cursor sessions, with higher scores indicating better performance.
- The benchmark includes models like Fable 5 Max (72.9%), Fable 5 Extra High (72.0%), and others, with scores ranging down to 31.9% for Kimi 2.5.
- Avg cost per task is calculated using each model's published per-million-token pricing applied to tokens used on CursorBench 3.1 tasks, averaged across tasks.