Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

10 months ago

A compiler named Mirage Persistent Kernel (MPK) transforms LLM inference into a single megakernel, reducing latency by 1.2-6.7x.
MPK fuses computation and communication across layers and GPUs into one GPU kernel, eliminating launch overhead and enabling pipelining.
The compiler generates a fine-grained task graph for optimized execution, with tasks and events to manage dependencies and synchronization.
MPK's runtime executes the task graph within a megakernel, using worker and scheduler SMs for efficient task execution and scheduling.
Future work includes support for newer GPU architectures, dynamic workloads like MoE models, and advanced scheduling policies.
MPK is open-source and aims to simplify high-performance LLM inference with minimal manual effort.

Hasty Briefsbeta