AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

16 hours ago

AutoKernel is an autonomous agent-based framework for optimizing GPU kernels in PyTorch models automatically.
It identifies bottlenecks via profiling, ranks improvements using Amdahl's law, and iteratively refines kernel implementations through automated experiments.
Ensures correctness with a five-stage harness, including smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case handling.
Supports Triton and CUDA C++ backends, with 18 starter kernels and nine kernel types crucial for modern transformer architectures.
Outperforms PyTorch eager and this http URL (max-autotume) on an NVIDIA H100, achieving significant speedups like 5.29x on RMSNorm and leading in community benchmarks.
Open-sourced with extensive code and integration, available online for public access and use.

Hasty Briefsbeta