The Mirror Is Part of the Machine
8 hours ago
- #telemetry-cost
- #system-design
- #observability
- Telemetry problems often arise from adding fields after incidents, leading to cumulative clutter. This cycle of adding and cleaning up creates a recurring issue.
- Telemetry is often treated as exhaust—outputs like logs and metrics that are not reviewed as rigorously as other architectural changes, causing hidden overhead.
- High telemetry costs reflect architectural issues, showing problems like excessive components and retries, with costs separated from the decisions creating them.
- Cardinality in metrics, especially from unbounded labels like user_id, drives up costs and exposes data security risks by creating numerous unique time series.
- Logs tend to accumulate due to fear of missing critical information during incidents, leading to bloated, uncurated log dumps that are inefficient.
- The real unit of telemetry is the decision it supports, such as driving alerts or aiding investigations, not just volume, with unused signals adding unnecessary cost.
- Telemetry tooling alone doesn't address governance; clear ownership and processes are needed to decide what telemetry to add and who can reject unnecessary data.
- Scalable telemetry management involves guardrails (like standard libraries and CI checks) and assigning retention policies based on signal purpose, with platform teams setting boundaries and service teams owning their signals.
- Bad telemetry is worse than no telemetry, as it misguides decisions and entrenches unreliable data, ultimately altering organizational behavior based on noise.