Hasty Briefsbeta

Bilingual

Show HN: A new benchmark for testing LLMs for deterministic outputs

5 hours ago
  • #Benchmarking
  • #Structured Output
  • #LLM Evaluation
  • Current benchmarks for LLM structured output often fall short by focusing only on schema compliance, missing crucial aspects like value accuracy and real-world input diversity.
  • SOB (Structured Output Benchmark) introduces a comprehensive evaluation across three modalities (text, image, audio) with seven metrics to isolate pure extraction capability.
  • Key metrics include Value Accuracy (most important for production), Faithfulness, JSON Pass Rate, and Perfect Response, revealing significant gaps between parsing success and correct value extraction.
  • Results show top models like GPT-5.4, GLM-4.7, and Qwen3.5-35B perform closely overall, but rankings vary by metric and modality, with audio being the most challenging.
  • Future directions for SOB include expanding schemas, adding reasoning chains, multilingual support, and real-time updates to improve structured output measurement and model development.