Hasty Briefsbeta

Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

7 days ago
  • #AI
  • #Performance
  • #PyTorch
  • AI-generated Metal kernels improved PyTorch inference by 87% on Apple devices.
  • Frontier models (Anthropic, DeepSeek, OpenAI) were used to generate optimized GPU kernels.
  • The approach requires no kernel engineering expertise and works instantly.
  • Some optimizations resulted in speedups of 10-100X, with one case showing a 9000X improvement.
  • An agentic swarm approach was used, where multiple models generate kernels and the best is selected.
  • Adding context like CUDA reference code and profiling information improved kernel performance.
  • The median speedup was 1.35X, with an average (geometric mean) speedup of 1.87X.
  • The technique can be extended to other platforms like ROCm, CUDA, and SYCL.