Hasty Briefsbeta

Bilingual

Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish (Yet)

a year ago
  • #CUDA optimization
  • #AI-generated kernels
  • #machine learning
  • AI-generated CUDA-C kernels outperform expert-optimized PyTorch kernels in some cases.
  • Synthetic data generation for training kernel generation models unexpectedly produced high-performing kernels.
  • The method involves structured exploratory search with parallel evaluation, improving optimization idea diversity.
  • Optimization strategies include FP16 Tensor-Core GEMM conversion, double-buffering, and shared memory caching.
  • Example Conv2D optimization trajectory shows performance improving from 20.1% to 179.9% of reference.
  • The approach combines strong reasoning with parallel hypothesis exploration, aligning with recent AI research trends.
  • Current limitations include challenges with FP32 precision and complex kernels like Flash Attention.
  • Optimism for future improvements due to steady progress and potential for self-improving AI systems.