CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL
7 days ago
- #CUDA
- #HGEMM
- #Optimization
- CUDA-L2 combines LLMs and RL to optimize HGEMM CUDA kernels, outperforming torch.matmul and NVIDIA libraries.
- Released A100 optimized HGEMM kernels for 1,000 configurations, with plans for 32-bit accumulator support.
- Future goals include denser matrix configurations, support for more GPUs (Ada Lovelace, Hopper, Blackwell), and easier deployment for open-source LLMs.
- A100 kernels are optimized for A100; speedup on other GPUs is not guaranteed.
- For unsupported matrix dimensions, users can pad with zeros or request configurations via GitHub issues.
- Requirements include Python, PyTorch ≥2.6.0, and NVIDIA CUTLASS v4.2.1.
- Environment variables CUTLASS_DIR and TORCH_CUDA_ARCH_LIST must be set before building/running.
- Evaluation can be run in offline or server mode using eval_one_file.sh with specific parameters.
- GitHub issues or email ([email protected]) available for questions.