EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

2 months ago

Current benchmarks for LLM code generation focus on mainstream languages like Python, leading to inflated accuracy scores.
EsoLang-Bench introduces 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare) with scarce training data.
Frontier models achieve only 3.8% overall accuracy on esoteric languages compared to ~90% on Python tasks.
All models score 0% on problems above the Easy tier, with Whitespace remaining completely unsolved (0% across all configurations).
Self-reflection provides no benefit, and few-shot prompting yields no significant improvement over zero-shot.
Direct interpreter feedback outperforms multi-agent approaches, while tool-augmented agents achieve ~2× the accuracy of prompting-only approaches.
Models exhibit distinct failure profiles per language, with logic, compile, and runtime errors dominating.
EsoLang-Bench contains 80 problems across four difficulty tiers, each with 6 test cases, implemented in all 5 esoteric languages.

Hasty Briefsbeta