EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages
4 hours ago
- #benchmark
- #code-generation
- #LLM
- Current benchmarks for LLM code generation focus on mainstream languages like Python, leading to inflated accuracy scores.
- EsoLang-Bench introduces 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare) with scarce training data.
- Frontier models achieve only 3.8% overall accuracy on esoteric languages compared to ~90% on Python tasks.
- All models score 0% on problems above the Easy tier, with Whitespace remaining completely unsolved (0% across all configurations).
- Self-reflection provides no benefit, and few-shot prompting yields no significant improvement over zero-shot.
- Direct interpreter feedback outperforms multi-agent approaches, while tool-augmented agents achieve ~2× the accuracy of prompting-only approaches.
- Models exhibit distinct failure profiles per language, with logic, compile, and runtime errors dominating.
- EsoLang-Bench contains 80 problems across four difficulty tiers, each with 6 test cases, implemented in all 5 esoteric languages.