Hung by a thread
3 months ago
- #deadlock
- #debugging
- #robotics
- The author's robot control loop froze consistently 16 seconds after a client connected, despite no crashes or errors.
- Debugging attempts included changing thread handling and mutex types, but the issue persisted at iteration 1,615 every time.
- A heartbeat thread revealed the loop was blocked, not slow or starved, indicating a deadlock.
- GDB identified unexpected Rayon worker threads, traced back to the Rerun visualization SDK used for telemetry.
- The deadlock occurred because the author called Rerun's `recorder.log()` while holding a mutex, a known issue with Rayon's work-stealing threads.
- The solution was to reduce the time the mutex was held, fixing the issue with minimal code changes.
- Key lessons include the value of GDB for deadlocks, being wary of unexpected threads, understanding dependency threading models, and the utility of heartbeat threads.
- The author submitted a PR to Rerun to document the issue, hoping to prevent future occurrences.