A Gentle Introduction to CUDA PTX
a day ago
- #CUDA
- #GPU
- #PTX
- PTX (Parallel Thread Execution) is a fundamental layer between CUDA code and NVIDIA GPU hardware, essential for deep performance analysis and accessing latest hardware features.
- PTX serves as an ISA for a virtual machine, providing forward compatibility by translating to specific GPU SASS (streaming assembly) via ptxas.
- The post introduces a PTX playground with a simple kernel example, demonstrating how to write and run PTX code using the CUDA Driver API.
- Key PTX concepts include register declarations, data movement instructions (ld, st, mov), computation and control flow (mad, setp, bra), and special registers.
- PTX's two-stage compilation (CUDA C++ → PTX → SASS) enables forward compatibility, with JIT compilation handling new GPU architectures.
- The post walks through a complete PTX kernel for vector addition, explaining each instruction and its role in the computation.
- Appendix A covers controlling the fatbin with nvcc flags (-arch, -gencode) and inspecting embedded PTX/SASS using cuobjdump.
- Appendix B explains the full compilation pipeline, including NVVM IR as an intermediate representation based on LLVM.