Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish (Yet)

a year ago

AI-generated CUDA-C kernels outperform expert-optimized PyTorch kernels in some cases.
Synthetic data generation for training kernel generation models unexpectedly produced high-performing kernels.
The method involves structured exploratory search with parallel evaluation, improving optimization idea diversity.
Optimization strategies include FP16 Tensor-Core GEMM conversion, double-buffering, and shared memory caching.
Example Conv2D optimization trajectory shows performance improving from 20.1% to 179.9% of reference.
The approach combines strong reasoning with parallel hypothesis exploration, aligning with recent AI research trends.
Current limitations include challenges with FP32 precision and complex kernels like Flash Attention.
Optimism for future improvements due to steady progress and potential for self-improving AI systems.

Hasty Briefsbeta