Hasty Briefsbeta

DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

9 hours ago
  • #code-understanding
  • #retrieval-systems
  • #benchmarking
  • Qodo has created DeepCodeBench, a benchmark dataset for real-world codebase understanding derived from large, complex repositories.
  • The dataset includes 1,144 question-answer pairs generated from pull requests (PRs) in eight open-source repositories.
  • Questions require deep retrieval across multiple files, reflecting realistic developer queries.
  • PRs were used as sources for generating questions because they naturally link related code changes.
  • The dataset includes metadata, context, and prompts used for question and answer generation.
  • Evaluation uses 'fact recall' to objectively assess model performance by verifying discrete facts in answers.
  • Baselines include ground truth answers, LLM with full context, and LLM with no context.
  • Qodo's deep-research agent achieved the highest fact recall (~76%), outperforming Codex (~74%) and Claude (~64%).
  • The dataset is designed to challenge retrieval systems with broad and deep questions about codebase functionality.