Hasty Briefsbeta

Bilingual

Bluesky April 2026 Outage Post-Mortem

5 hours ago
  • #outage
  • #root-cause
  • #observability
  • Bluesky experienced an 8-hour intermittent outage affecting half its users, caused by an internal service sending large batches of post URI lookups to the GetPostRecord endpoint.
  • The root issue was a missing bounded concurrency limit (errgroup.SetLimit) in the GetPostRecord handler, which led to spawning thousands of goroutines that exhausted TCP ports due to connections stuck in TIME_WAIT state.
  • A death spiral occurred as memcached errors triggered excessive logging, causing blocking syscalls and increased OS threads, which stressed the garbage collector and led to OOM crashes and further port exhaustion.
  • A temporary fix involved using a custom dialer to randomize loopback IPs for memcached connections, expanding the client IP+port space to break the death spiral until the root cause was addressed.
  • Lessons include the need for improved observability per-client, metrics for small numbers of large requests, and favoring Prometheus metrics or OTEL tracing over excessive logging in high-scale systems.