OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

9 days ago

Copy Link

Benchmarking OpenTelemetry instrumentation with AI models across 11 programming languages.
OpenTelemetry (OTel) is the industry standard for telemetry data, aiding in distributed tracing.
14 frontier LLMs tested on 23 realistic OpenTelemetry tasks, costing $522 in LLM tokens.
Top model Claude Opus 4.5 achieved only 29% success rate on simple tasks.
Common failure: AI models mechanically instrument HTTP calls without understanding business context.
Language gaps observed, with models failing completely on Java, Ruby, and Swift.
Cost efficiency: Gemini 3 Flash outperformed Gemini 3 Pro at a fraction of the cost.
AI SRE capabilities in 2026 are still limited, mirroring findings from ClickHouse.
Need for standardized benchmarks in distributed systems, similar to SWE-Bench.
Open-source OTelBench released for community contributions and further testing.

Hasty Briefsbeta