A Higgs-Bugson in the Linux Kernel
10 months ago
- #Kerberos
- #Debugging
- #NFS
- A higgs-bugson (difficult-to-reproduce bug) was found in Gord, a system storing and distributing trading activity data.
- The bug involved rare -EACCES (Permission denied) errors during large file copies with NFS and Kerberos, despite correct permissions.
- Debugging revealed the issue was related to Kerberos credentials and GSS sequence numbers in NFS requests and responses.
- A test setup was created using a FUSE filesystem with in-memory random data to reproduce the bug.
- eBPF and bpftrace were used to trace kernel functions and identify the bug's occurrence.
- The bug was caused by mismatched GSS sequence numbers during NFS request retransmissions, leading to checksum validation failures.
- A Wireshark plugin was developed to analyze packet checksums, confirming the kernel's incorrect validation.
- The solution involved kernel patches to handle sequence number mismatches properly, now upstream in Linux 6.16.