Hasty Briefsbeta

Bilingual

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

a year ago
  • #Benchmark
  • #Code-Comprehension
  • #LLM
  • LongCodeBench (LCB) is introduced as a benchmark to evaluate coding LLMs in long-context scenarios.
  • The benchmark tests comprehension and repair capabilities using real-world GitHub issues, including QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks.
  • Performance drops are observed in long-context scenarios, e.g., Claude 3.5 Sonnet drops from 29% to 3%, and Qwen2.5 drops from 70.2% to 40%.
  • The benchmark is stratified by complexity to evaluate models across different scales, from Qwen2.5 14B Instruct to Google's Gemini model.
  • Long-context remains a challenge for all models despite advancements in context length capabilities.