Compiling models to megakernels

a month ago

Luminal is an inference compiler focused on maximizing GPU utilization by addressing compute and bandwidth limitations.
Traditional kernel execution faces bottlenecks like kernel launch overhead, wave quantization, and idle time during initial weight loading.
Megakernels fuse entire model operations into a single kernel, eliminating synchronization gaps and improving hardware utilization.
Dynamic scheduling in megakernels uses a global instruction queue, allowing SMs to fetch tasks opportunistically, reducing idle time.
Barrier counters manage fine-grained synchronization, ensuring data readiness without full kernel synchronization.
Luminal transforms compute graphs into instruction queues with optimized data dependencies and barrier strides.
Symbolic work queues represent instructions symbolically, enabling dynamic dimension adjustments without queue rebuilds.
Megakernels represent a next-gen approach to GPU programming, minimizing unnecessary synchronizations and keeping hardware busy.
Luminal's work is open-source, inviting contributions and collaboration in advancing inference compiler technology.

Hasty Briefsbeta