Tracing Discord's Elixir Systems (Without Melting Everything)
6 hours ago
- #Discord
- #Observability
- #Elixir
- Discord aims for instant user interactions by leveraging Elixir's concurrency to run guilds independently.
- When guilds lag or fail, on-call engineers use observability tools to diagnose and prevent recurrence.
- Initial investigations rely on metrics and logs, which may hint at bursty activity but lack user experience context.
- For deeper insights, engineers use 'guild timings,' a custom tool recording minute-by-minute action processing, though data is volatile.
- Distributed tracing (APM) offers detailed operation insights but required custom integration due to Elixir's communication limitations.
- Discord successfully integrated distributed tracing without downtime, enhancing performance monitoring.