Optimizing Tail Sampling in OpenTelemetry with Retroactive Sampling
4 days ago
- #Distributed Tracing
- #OpenTelemetry
- #Performance Optimization
- VictoriaMetrics presented retroactive sampling at KubeCon Europe 2026 to reduce costs in OpenTelemetry trace collection.
- Retroactive sampling sends only minimal span attributes (e.g., trace_id, status_code) to the collector for decisions, buffering raw data on edge agents.
- It lowers network traffic by up to 70% and reduces CPU and memory usage by 60–70% compared to tail sampling.
- Edge agents use an on-disk FIFO queue instead of in-memory buffers, cutting memory pressure and enabling efficient data retrieval for sampled traces.
- A benchmark with 15,000–30,000 spans/s showed retroactive sampling uses 1.7 GB disk vs. 4 GB memory for tail sampling, with significant resource savings.
- Limitations include reduced decision context if many attributes are needed; hybrid approaches can combine agent and collector decisions.
- Disk-based designs (like Pebble in OpenTelemetry) also reduce memory but increase CPU usage, highlighting trade-offs.
- VictoriaMetrics plans to donate retroactive sampling as an OpenTelemetry processor and integrate it into vtagent in 2026.