What happens when you run a CUDA kernel?

3 days ago

CUDA programs are compiled by nvcc into host code and device code, with the device code undergoing transformations from PTX to SASS.
The host code uses a stub to pack kernel arguments and triggers the GPU via the CUDA runtime and driver, involving ioctls and a doorbell register.
The GPU executes kernels via a work distributor assigning blocks to SMs, with warps scheduled using compiler-encoded control bits to manage dependencies and stalls.
Memory accesses are coalesced, leveraging caches and DRAM, with performance often limited by memory bandwidth for low arithmetic intensity kernels.
Completion is signaled via semaphores, allowing asynchronous execution and data transfer back to the host for output.

Hasty Briefsbeta