Hasty Briefsbeta

Bilingual

The Mirror Is Part of the Machine

9 hours ago
  • #telemetry-cost
  • #system-design
  • #observability
  • Telemetry problems often arise from adding fields after incidents, leading to cumulative clutter. This cycle of adding and cleaning up creates a recurring issue.
  • Telemetry is often treated as exhaust—outputs like logs and metrics that are not reviewed as rigorously as other architectural changes, causing hidden overhead.
  • High telemetry costs reflect architectural issues, showing problems like excessive components and retries, with costs separated from the decisions creating them.
  • Cardinality in metrics, especially from unbounded labels like user_id, drives up costs and exposes data security risks by creating numerous unique time series.
  • Logs tend to accumulate due to fear of missing critical information during incidents, leading to bloated, uncurated log dumps that are inefficient.
  • The real unit of telemetry is the decision it supports, such as driving alerts or aiding investigations, not just volume, with unused signals adding unnecessary cost.
  • Telemetry tooling alone doesn't address governance; clear ownership and processes are needed to decide what telemetry to add and who can reject unnecessary data.
  • Scalable telemetry management involves guardrails (like standard libraries and CI checks) and assigning retention policies based on signal purpose, with platform teams setting boundaries and service teams owning their signals.
  • Bad telemetry is worse than no telemetry, as it misguides decisions and entrenches unreliable data, ultimately altering organizational behavior based on noise.