DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

9 hours ago

Copy Link

Qodo has created DeepCodeBench, a benchmark dataset for real-world codebase understanding derived from large, complex repositories.
The dataset includes 1,144 question-answer pairs generated from pull requests (PRs) in eight open-source repositories.
Questions require deep retrieval across multiple files, reflecting realistic developer queries.
PRs were used as sources for generating questions because they naturally link related code changes.
The dataset includes metadata, context, and prompts used for question and answer generation.
Evaluation uses 'fact recall' to objectively assess model performance by verifying discrete facts in answers.
Baselines include ground truth answers, LLM with full context, and LLM with no context.
Qodo's deep-research agent achieved the highest fact recall (~76%), outperforming Codex (~74%) and Claude (~64%).
The dataset is designed to challenge retrieval systems with broad and deep questions about codebase functionality.

Hasty Briefsbeta