Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels
7 days ago
- #AI
- #Performance
- #PyTorch
- AI-generated Metal kernels improved PyTorch inference by 87% on Apple devices.
- Frontier models (Anthropic, DeepSeek, OpenAI) were used to generate optimized GPU kernels.
- The approach requires no kernel engineering expertise and works instantly.
- Some optimizations resulted in speedups of 10-100X, with one case showing a 9000X improvement.
- An agentic swarm approach was used, where multiple models generate kernels and the best is selected.
- Adding context like CUDA reference code and profiling information improved kernel performance.
- The median speedup was 1.35X, with an average (geometric mean) speedup of 1.87X.
- The technique can be extended to other platforms like ROCm, CUDA, and SYCL.