Hasty Briefsbeta

OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

9 days ago
  • #OpenTelemetry
  • #Distributed Tracing
  • #AI Benchmarking
  • Benchmarking OpenTelemetry instrumentation with AI models across 11 programming languages.
  • OpenTelemetry (OTel) is the industry standard for telemetry data, aiding in distributed tracing.
  • 14 frontier LLMs tested on 23 realistic OpenTelemetry tasks, costing $522 in LLM tokens.
  • Top model Claude Opus 4.5 achieved only 29% success rate on simple tasks.
  • Common failure: AI models mechanically instrument HTTP calls without understanding business context.
  • Language gaps observed, with models failing completely on Java, Ruby, and Swift.
  • Cost efficiency: Gemini 3 Flash outperformed Gemini 3 Pro at a fraction of the cost.
  • AI SRE capabilities in 2026 are still limited, mirroring findings from ClickHouse.
  • Need for standardized benchmarks in distributed systems, similar to SWE-Bench.
  • Open-source OTelBench released for community contributions and further testing.