An Update on GitHub Availability
3 hours ago
- #Incident Response
- #Scalability
- #GitHub Availability
- GitHub apologizes for two recent incidents affecting availability and outlines ongoing reliability improvements.
- Scale demands increased from 10X capacity to 30X due to rapid growth in agentic workflows and monorepos.
- Priorities are availability first, then capacity, and new features, focusing on reducing bottlenecks and isolating services.
- Short-term fixes included migrating webhooks, redesigning caching, and leveraging Azure for more compute.
- Long-term measures involve moving to a multi-cloud strategy and migrating performance-sensitive code to Go.
- April 23 incident involved merge queue regression affecting squash merges, impacting 230 repositories and 2,092 pull requests.
- April 27 incident was a search subsystem overload, likely from a botnet attack, causing UI disruptions but no data loss.
- GitHub is improving transparency via status updates, incident categorization, and customer reporting channels.
- Commitment includes enhancing availability, resilience, scalability, and communication for developers.