Addressing GitHub's recent availability issues
a day ago
- #System Resilience
- #GitHub
- #Incident Report
- GitHub experienced significant availability and performance issues due to rapid usage growth, exposing scaling limitations.
- Key incidents occurred on February 2, February 9, and March 5, affecting authentication, user management, and GitHub Actions.
- The February 9 incident was caused by a database cluster overload due to increased traffic from client apps and a cache TTL change.
- GitHub Actions faced issues on February 2 and March 5 due to insufficient failover solutions and latent configuration problems.
- Contributing factors included insufficient isolation, inadequate load shedding, and gaps in monitoring and validation.
- GitHub is implementing near-term mitigations like redesigning the user cache system and isolating key dependencies.
- Long-term solutions include migrating infrastructure to Azure and breaking apart the monolith for better scalability and resilience.
- GitHub commits to transparency by publishing incident summaries and monthly availability reports.