We found a bug in Go's ARM64 compiler
6 hours ago
- #arm64
- #race-condition
- #Go
- Cloudflare discovered a bug in Go's arm64 compiler causing race conditions in generated code due to their massive scale.
- Initial sporadic panics on arm64 machines were observed, linked to stack corruption during stack unwinding.
- The issue was initially correlated with recovered panics and an old Go issue (#73259), leading to temporary mitigation by avoiding panic/recover for error handling.
- Fatal panics returned at a higher rate without clear triggers, prompting deeper investigation.
- Two classes of bugs were identified: crashes due to invalid memory access and explicit fatal errors during stack unwinding.
- The root cause was traced to async preemption between split stack pointer adjustments in Go's arm64 compiler, leading to invalid stack states during unwinding.
- A minimal reproducer was created, confirming the bug was a runtime issue, not specific to Cloudflare's environment.
- The bug was fixed in Go versions 1.23.12, 1.24.6, and 1.25.0 by ensuring stack pointer adjustments are atomic.
- The investigation highlighted the challenges of debugging rare race conditions at scale and the importance of understanding low-level runtime behaviors.