CursorBench 3.1

a month ago

CursorBench 3.1 evaluates agents on ambiguous, multi-file tasks from real Cursor sessions, with higher scores indicating better performance.
The benchmark includes models like Fable 5 Max (72.9%), Fable 5 Extra High (72.0%), and others, with scores ranging down to 31.9% for Kimi 2.5.
Avg cost per task is calculated using each model's published per-million-token pricing applied to tokens used on CursorBench 3.1 tasks, averaged across tasks.

Hasty Briefsbeta