Hasty Briefsbeta

Bilingual

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

6 hours ago
  • #benchmark
  • #code-generation
  • #LLM
  • Current benchmarks for LLM code generation focus on mainstream languages like Python, leading to inflated accuracy scores.
  • EsoLang-Bench introduces 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare) with scarce training data.
  • Frontier models achieve only 3.8% overall accuracy on esoteric languages compared to ~90% on Python tasks.
  • All models score 0% on problems above the Easy tier, with Whitespace remaining completely unsolved (0% across all configurations).
  • Self-reflection provides no benefit, and few-shot prompting yields no significant improvement over zero-shot.
  • Direct interpreter feedback outperforms multi-agent approaches, while tool-augmented agents achieve ~2× the accuracy of prompting-only approaches.
  • Models exhibit distinct failure profiles per language, with logic, compile, and runtime errors dominating.
  • EsoLang-Bench contains 80 problems across four difficulty tiers, each with 6 test cases, implemented in all 5 esoteric languages.