Show HN: Luminal – Open-source, search-based GPU compiler

4 days ago

Copy Link

Luminal is a deep learning library using search-based compilation for high performance.
To run the demo on Mac, clone the repo and follow the given commands.
Transitioning to '2.0' with large-scale kernel search, simplifying the compiler stack.
Example code provided for setting up a graph and performing matrix multiplication.
Llama 3 8B can be run locally using Luminal, with setup and run instructions provided.
Luminal aims to be the fastest ML framework, supporting Q8 Llama 3 8B on M-series Macbooks.
Core library is minimal, with 12 primitive ops supporting transformers and convnets.
Compiles ops into complex GPU kernels for high performance.
Uses exhaustive search for optimizations, enabling automatic derivation of complex rewrites.
Written in Rust, interacting directly with CUDA/Metal APIs without abstractions.
Emphasizes correctness with extensive testing against PyTorch implementations.
Ahead-of-time compilation approach, similar to XLA and tinygrad, for better performance.
Supports aggressive kernel fusion, shape-specific kernels, and handling devices/dtypes via compilers.
Current features include Metal/CUDA support, full training, and implementations of models like Llama 3.
Roadmap includes expanding search space, improving CUDA, adding Blackwell intrinsics, and more.

Hasty Briefsbeta