LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
a year ago
- #Benchmark
- #Code-Comprehension
- #LLM
- LongCodeBench (LCB) is introduced as a benchmark to evaluate coding LLMs in long-context scenarios.
- The benchmark tests comprehension and repair capabilities using real-world GitHub issues, including QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks.
- Performance drops are observed in long-context scenarios, e.g., Claude 3.5 Sonnet drops from 29% to 3%, and Qwen2.5 drops from 70.2% to 40%.
- The benchmark is stratified by complexity to evaluate models across different scales, from Qwen2.5 14B Instruct to Google's Gemini model.
- Long-context remains a challenge for all models despite advancements in context length capabilities.