Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish (Yet)
a year ago
- #CUDA optimization
- #AI-generated kernels
- #machine learning
- AI-generated CUDA-C kernels outperform expert-optimized PyTorch kernels in some cases.
- Synthetic data generation for training kernel generation models unexpectedly produced high-performing kernels.
- The method involves structured exploratory search with parallel evaluation, improving optimization idea diversity.
- Optimization strategies include FP16 Tensor-Core GEMM conversion, double-buffering, and shared memory caching.
- Example Conv2D optimization trajectory shows performance improving from 20.1% to 179.9% of reference.
- The approach combines strong reasoning with parallel hypothesis exploration, aligning with recent AI research trends.
- Current limitations include challenges with FP32 precision and complex kernels like Flash Attention.
- Optimism for future improvements due to steady progress and potential for self-improving AI systems.