Hasty Briefsbeta

Bilingual

The bug that taught me more about PyTorch than years of using it

6 months ago
  • #Debugging
  • #PyTorch
  • #MPS
  • A training loss plateau was initially mistaken for a hyperparameter issue but was actually caused by a PyTorch bug.
  • The bug was related to PyTorch's MPS backend, where `addcmul_` and `addcdiv_` operations silently failed when writing to non-contiguous tensors on Apple Silicon (MPS).
  • Debugging involved tracing through PyTorch's layers of abstraction, from optimizer internals to GPU kernel implementations.
  • The issue was identified by comparing working and non-working operations (`mul_` vs. `addcmul_`) and analyzing their handling of non-contiguous tensors.
  • A fix was implemented by ensuring non-contiguous tensors are explicitly handled with temporary contiguous buffers and result copying.
  • The bug was already patched in PyTorch v2.4, but similar issues persist in random operations (`normal_`, `uniform_`, etc.) on older macOS versions.
  • Key lessons include isolating measurable symptoms, checking tensor metadata, and documenting debugging steps for future reference.