Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference
10 months ago
- #GPU
- #LLM
- #compiler
- A compiler named Mirage Persistent Kernel (MPK) transforms LLM inference into a single megakernel, reducing latency by 1.2-6.7x.
- MPK fuses computation and communication across layers and GPUs into one GPU kernel, eliminating launch overhead and enabling pipelining.
- The compiler generates a fine-grained task graph for optimized execution, with tasks and events to manage dependencies and synchronization.
- MPK's runtime executes the task graph within a megakernel, using worker and scheduler SMs for efficient task execution and scheduling.
- Future work includes support for newer GPU architectures, dynamic workloads like MoE models, and advanced scheduling policies.
- MPK is open-source and aims to simplify high-performance LLM inference with minimal manual effort.