DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking
9 hours ago
- #code-understanding
- #retrieval-systems
- #benchmarking
- Qodo has created DeepCodeBench, a benchmark dataset for real-world codebase understanding derived from large, complex repositories.
- The dataset includes 1,144 question-answer pairs generated from pull requests (PRs) in eight open-source repositories.
- Questions require deep retrieval across multiple files, reflecting realistic developer queries.
- PRs were used as sources for generating questions because they naturally link related code changes.
- The dataset includes metadata, context, and prompts used for question and answer generation.
- Evaluation uses 'fact recall' to objectively assess model performance by verifying discrete facts in answers.
- Baselines include ground truth answers, LLM with full context, and LLM with no context.
- Qodo's deep-research agent achieved the highest fact recall (~76%), outperforming Codex (~74%) and Claude (~64%).
- The dataset is designed to challenge retrieval systems with broad and deep questions about codebase functionality.