Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

7 days ago

Copy Link

AI-generated Metal kernels improved PyTorch inference by 87% on Apple devices.
Frontier models (Anthropic, DeepSeek, OpenAI) were used to generate optimized GPU kernels.
The approach requires no kernel engineering expertise and works instantly.
Some optimizations resulted in speedups of 10-100X, with one case showing a 9000X improvement.
An agentic swarm approach was used, where multiple models generate kernels and the best is selected.
Adding context like CUDA reference code and profiling information improved kernel performance.
The median speedup was 1.35X, with an average (geometric mean) speedup of 1.87X.
The technique can be extended to other platforms like ROCm, CUDA, and SYCL.

Hasty Briefsbeta