Hasty Briefsbeta

Bilingual

PyTorch 2.12 Release

8 hours ago
  • #CUDA Performance
  • #Distributed Training
  • #PyTorch 2.12
  • PyTorch 2.12 introduces significant performance enhancements, including up to 100x faster batched eigendecomposition on CUDA and a fused Adagrad optimizer.
  • New compilation and export features include a device-agnostic torch.accelerator.Graph API, support for Microscaling quantization formats in torch.export, and torch.cond control flow capture within CUDA Graphs.
  • Distributed training improvements include ProcessGroup support in custom ops, multi-GPU profiling enhancements with NCCL sequence numbers, and FlightRecorder support for additional backends like ncclx and gloo.
  • Platform updates cover CUDA (kernel annotations, Green Context workqueue limits), ROCm (expandable memory segments, rocSHMEM, hipSPARSELt, FlexAttention pipelining), and Apple MPS (Metal-4 offline shader compilation).
  • Deprecations and breaking changes include the integration of torchcomms into PyTorch Distributed with upcoming changes to ProcessGroup initialization and P2P operations, and the continued deprecation of TorchScript in favor of torch.export and Executorch, along with the deprecation of the CUDA 12.8 wheel.