The bug that taught me more about PyTorch than years of using it
6 months ago
- #Debugging
- #PyTorch
- #MPS
- A training loss plateau was initially mistaken for a hyperparameter issue but was actually caused by a PyTorch bug.
- The bug was related to PyTorch's MPS backend, where `addcmul_` and `addcdiv_` operations silently failed when writing to non-contiguous tensors on Apple Silicon (MPS).
- Debugging involved tracing through PyTorch's layers of abstraction, from optimizer internals to GPU kernel implementations.
- The issue was identified by comparing working and non-working operations (`mul_` vs. `addcmul_`) and analyzing their handling of non-contiguous tensors.
- A fix was implemented by ensuring non-contiguous tensors are explicitly handled with temporary contiguous buffers and result copying.
- The bug was already patched in PyTorch v2.4, but similar issues persist in random operations (`normal_`, `uniform_`, etc.) on older macOS versions.
- Key lessons include isolating measurable symptoms, checking tensor metadata, and documenting debugging steps for future reference.