What happens when you run a CUDA kernel?
3 days ago
- #CUDA
- #Kernel-Launch
- #GPU-Execution
- CUDA programs are compiled by nvcc into host code and device code, with the device code undergoing transformations from PTX to SASS.
- The host code uses a stub to pack kernel arguments and triggers the GPU via the CUDA runtime and driver, involving ioctls and a doorbell register.
- The GPU executes kernels via a work distributor assigning blocks to SMs, with warps scheduled using compiler-encoded control bits to manage dependencies and stalls.
- Memory accesses are coalesced, leveraging caches and DRAM, with performance often limited by memory bandwidth for low arithmetic intensity kernels.
- Completion is signaled via semaphores, allowing asynchronous execution and data transfer back to the host for output.