OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
9 days ago
- #OpenTelemetry
- #Distributed Tracing
- #AI Benchmarking
- Benchmarking OpenTelemetry instrumentation with AI models across 11 programming languages.
- OpenTelemetry (OTel) is the industry standard for telemetry data, aiding in distributed tracing.
- 14 frontier LLMs tested on 23 realistic OpenTelemetry tasks, costing $522 in LLM tokens.
- Top model Claude Opus 4.5 achieved only 29% success rate on simple tasks.
- Common failure: AI models mechanically instrument HTTP calls without understanding business context.
- Language gaps observed, with models failing completely on Java, Ruby, and Swift.
- Cost efficiency: Gemini 3 Flash outperformed Gemini 3 Pro at a fraction of the cost.
- AI SRE capabilities in 2026 are still limited, mirroring findings from ClickHouse.
- Need for standardized benchmarks in distributed systems, similar to SWE-Bench.
- Open-source OTelBench released for community contributions and further testing.