The Mirror Is Part of the Machine

8 hours ago

#telemetry-cost
#system-design
#observability

Telemetry problems often arise from adding fields after incidents, leading to cumulative clutter. This cycle of adding and cleaning up creates a recurring issue.
Telemetry is often treated as exhaust—outputs like logs and metrics that are not reviewed as rigorously as other architectural changes, causing hidden overhead.
High telemetry costs reflect architectural issues, showing problems like excessive components and retries, with costs separated from the decisions creating them.
Cardinality in metrics, especially from unbounded labels like user_id, drives up costs and exposes data security risks by creating numerous unique time series.
Logs tend to accumulate due to fear of missing critical information during incidents, leading to bloated, uncurated log dumps that are inefficient.
The real unit of telemetry is the decision it supports, such as driving alerts or aiding investigations, not just volume, with unused signals adding unnecessary cost.
Telemetry tooling alone doesn't address governance; clear ownership and processes are needed to decide what telemetry to add and who can reject unnecessary data.
Scalable telemetry management involves guardrails (like standard libraries and CI checks) and assigning retention policies based on signal purpose, with platform teams setting boundaries and service teams owning their signals.
Bad telemetry is worse than no telemetry, as it misguides decisions and entrenches unreliable data, ultimately altering organizational behavior based on noise.

Hasty Briefsbeta

The Mirror Is Part of the Machine