The bug that taught me more about PyTorch than years of using it

6 months ago

A training loss plateau was initially mistaken for a hyperparameter issue but was actually caused by a PyTorch bug.
The bug was related to PyTorch's MPS backend, where `addcmul_` and `addcdiv_` operations silently failed when writing to non-contiguous tensors on Apple Silicon (MPS).
Debugging involved tracing through PyTorch's layers of abstraction, from optimizer internals to GPU kernel implementations.
The issue was identified by comparing working and non-working operations (`mul_` vs. `addcmul_`) and analyzing their handling of non-contiguous tensors.
A fix was implemented by ensuring non-contiguous tensors are explicitly handled with temporary contiguous buffers and result copying.
The bug was already patched in PyTorch v2.4, but similar issues persist in random operations (`normal_`, `uniform_`, etc.) on older macOS versions.
Key lessons include isolating measurable symptoms, checking tensor metadata, and documenting debugging steps for future reference.

Hasty Briefsbeta