Show HN: A new benchmark for testing LLMs for deterministic outputs

a month ago

Current benchmarks for LLM structured output often fall short by focusing only on schema compliance, missing crucial aspects like value accuracy and real-world input diversity.
SOB (Structured Output Benchmark) introduces a comprehensive evaluation across three modalities (text, image, audio) with seven metrics to isolate pure extraction capability.
Key metrics include Value Accuracy (most important for production), Faithfulness, JSON Pass Rate, and Perfect Response, revealing significant gaps between parsing success and correct value extraction.
Results show top models like GPT-5.4, GLM-4.7, and Qwen3.5-35B perform closely overall, but rankings vary by metric and modality, with audio being the most challenging.
Future directions for SOB include expanding schemas, adding reasoning chains, multilingual support, and real-time updates to improve structured output measurement and model development.

Hasty Briefsbeta