LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

a year ago

LongCodeBench (LCB) is introduced as a benchmark to evaluate coding LLMs in long-context scenarios.
The benchmark tests comprehension and repair capabilities using real-world GitHub issues, including QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks.
Performance drops are observed in long-context scenarios, e.g., Claude 3.5 Sonnet drops from 29% to 3%, and Qwen2.5 drops from 70.2% to 40%.
The benchmark is stratified by complexity to evaluate models across different scales, from Qwen2.5 14B Instruct to Google's Gemini model.
Long-context remains a challenge for all models despite advancements in context length capabilities.

Hasty Briefsbeta