What happens when you run a CUDA kernel?

3 hours ago

CUDA kernels are compiled via multiple steps: nvcc compiles host code with host compiler, device code is processed by cicc to PTX, then ptxas to SASS, bundled into a fatbin with PTX for forward compatibility.
Launching a kernel involves host stub packing arguments into a buffer matching constant memory offsets, then driver via libcuda.so.1 communicates with kernel driver via ioctl on NVIDIA device files.
The driver lazily loads kernel code on first launch, constructs a Queue Meta Data (QMD) with launch configuration, streams it via pushbuffer methods (e.g., SET_INLINE_QMD_ADDRESS_A) to GPU, and rings doorbell register to trigger execution.
GPU compute work distributor assigns blocks to SMs based on resource constraints (e.g., RTX 4090 with 128 SMs fits 6 blocks of 256 threads each), warp schedulers issue instructions per warp with compiler-specified stall counts and scoreboard barriers for dependency management.
Memory accesses are coalesced (e.g., LDG.E loads 32 consecutive floats into 128-byte request via L1/L2 caches), profiling shows low arithmetic intensity dominated by DRAM bandwidth, with results copied back to CPU via DMA after semaphore signaling.

Hasty Briefsbeta