A Gentle Introduction to CUDA PTX

a day ago

Copy Link

PTX (Parallel Thread Execution) is a fundamental layer between CUDA code and NVIDIA GPU hardware, essential for deep performance analysis and accessing latest hardware features.
PTX serves as an ISA for a virtual machine, providing forward compatibility by translating to specific GPU SASS (streaming assembly) via ptxas.
The post introduces a PTX playground with a simple kernel example, demonstrating how to write and run PTX code using the CUDA Driver API.
Key PTX concepts include register declarations, data movement instructions (ld, st, mov), computation and control flow (mad, setp, bra), and special registers.
PTX's two-stage compilation (CUDA C++ → PTX → SASS) enables forward compatibility, with JIT compilation handling new GPU architectures.
The post walks through a complete PTX kernel for vector addition, explaining each instruction and its role in the computation.
Appendix A covers controlling the fatbin with nvcc flags (-arch, -gencode) and inspecting embedded PTX/SASS using cuobjdump.
Appendix B explains the full compilation pipeline, including NVVM IR as an intermediate representation based on LLVM.

Hasty Briefsbeta