Hasty Briefsbeta

Bilingual

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

10 months ago
  • #GPU
  • #LLM
  • #compiler
  • A compiler named Mirage Persistent Kernel (MPK) transforms LLM inference into a single megakernel, reducing latency by 1.2-6.7x.
  • MPK fuses computation and communication across layers and GPUs into one GPU kernel, eliminating launch overhead and enabling pipelining.
  • The compiler generates a fine-grained task graph for optimized execution, with tasks and events to manage dependencies and synchronization.
  • MPK's runtime executes the task graph within a megakernel, using worker and scheduler SMs for efficient task execution and scheduling.
  • Future work includes support for newer GPU architectures, dynamic workloads like MoE models, and advanced scheduling policies.
  • MPK is open-source and aims to simplify high-performance LLM inference with minimal manual effort.