Show HN: A new benchmark for testing LLMs for deterministic outputs
7 hours ago
- #Benchmarking
- #Structured Output
- #LLM Evaluation
- Current benchmarks for LLM structured output often fall short by focusing only on schema compliance, missing crucial aspects like value accuracy and real-world input diversity.
- SOB (Structured Output Benchmark) introduces a comprehensive evaluation across three modalities (text, image, audio) with seven metrics to isolate pure extraction capability.
- Key metrics include Value Accuracy (most important for production), Faithfulness, JSON Pass Rate, and Perfect Response, revealing significant gaps between parsing success and correct value extraction.
- Results show top models like GPT-5.4, GLM-4.7, and Qwen3.5-35B perform closely overall, but rankings vary by metric and modality, with audio being the most challenging.
- Future directions for SOB include expanding schemas, adding reasoning chains, multilingual support, and real-time updates to improve structured output measurement and model development.