CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

7 days ago

Copy Link

CUDA-L2 combines LLMs and RL to optimize HGEMM CUDA kernels, outperforming torch.matmul and NVIDIA libraries.
Released A100 optimized HGEMM kernels for 1,000 configurations, with plans for 32-bit accumulator support.
Future goals include denser matrix configurations, support for more GPUs (Ada Lovelace, Hopper, Blackwell), and easier deployment for open-source LLMs.
A100 kernels are optimized for A100; speedup on other GPUs is not guaranteed.
For unsupported matrix dimensions, users can pad with zeros or request configurations via GitHub issues.
Requirements include Python, PyTorch ≥2.6.0, and NVIDIA CUTLASS v4.2.1.
Environment variables CUTLASS_DIR and TORCH_CUDA_ARCH_LIST must be set before building/running.
Evaluation can be run in offline or server mode using eval_one_file.sh with specific parameters.
GitHub issues or email ([email protected]) available for questions.

Hasty Briefsbeta