Bluesky April 2026 Outage Post-Mortem
5 hours ago
- #outage
- #root-cause
- #observability
- Bluesky experienced an 8-hour intermittent outage affecting half its users, caused by an internal service sending large batches of post URI lookups to the GetPostRecord endpoint.
- The root issue was a missing bounded concurrency limit (errgroup.SetLimit) in the GetPostRecord handler, which led to spawning thousands of goroutines that exhausted TCP ports due to connections stuck in TIME_WAIT state.
- A death spiral occurred as memcached errors triggered excessive logging, causing blocking syscalls and increased OS threads, which stressed the garbage collector and led to OOM crashes and further port exhaustion.
- A temporary fix involved using a custom dialer to randomize loopback IPs for memcached connections, expanding the client IP+port space to break the death spiral until the root cause was addressed.
- Lessons include the need for improved observability per-client, metrics for small numbers of large requests, and favoring Prometheus metrics or OTEL tracing over excessive logging in high-scale systems.