Hasty Briefsbeta

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

7 days ago
  • #CUDA
  • #HGEMM
  • #Optimization
  • CUDA-L2 combines LLMs and RL to optimize HGEMM CUDA kernels, outperforming torch.matmul and NVIDIA libraries.
  • Released A100 optimized HGEMM kernels for 1,000 configurations, with plans for 32-bit accumulator support.
  • Future goals include denser matrix configurations, support for more GPUs (Ada Lovelace, Hopper, Blackwell), and easier deployment for open-source LLMs.
  • A100 kernels are optimized for A100; speedup on other GPUs is not guaranteed.
  • For unsupported matrix dimensions, users can pad with zeros or request configurations via GitHub issues.
  • Requirements include Python, PyTorch ≥2.6.0, and NVIDIA CUTLASS v4.2.1.
  • Environment variables CUTLASS_DIR and TORCH_CUDA_ARCH_LIST must be set before building/running.
  • Evaluation can be run in offline or server mode using eval_one_file.sh with specific parameters.
  • GitHub issues or email ([email protected]) available for questions.