Hasty Briefsbeta

Bilingual

Addressing GitHub's recent availability issues

a day ago
  • #System Resilience
  • #GitHub
  • #Incident Report
  • GitHub experienced significant availability and performance issues due to rapid usage growth, exposing scaling limitations.
  • Key incidents occurred on February 2, February 9, and March 5, affecting authentication, user management, and GitHub Actions.
  • The February 9 incident was caused by a database cluster overload due to increased traffic from client apps and a cache TTL change.
  • GitHub Actions faced issues on February 2 and March 5 due to insufficient failover solutions and latent configuration problems.
  • Contributing factors included insufficient isolation, inadequate load shedding, and gaps in monitoring and validation.
  • GitHub is implementing near-term mitigations like redesigning the user cache system and isolating key dependencies.
  • Long-term solutions include migrating infrastructure to Azure and breaking apart the monolith for better scalability and resilience.
  • GitHub commits to transparency by publishing incident summaries and monthly availability reports.