GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

4 hours ago

DeepGEMM is a unified, high-performance CUDA kernel library for modern LLMs, featuring GEMMs (FP8, FP4, BF16), fused MoE with overlapped communication (Mega MoE), MQA scoring, and HyperConnection.
Kernels are JIT-compiled at runtime with a lightweight module, requiring no CUDA compilation during installation, and performance matches or exceeds expert-tuned libraries.
Supports SM90/SM100 architectures, with environment variables for configuration (e.g., DG_JIT_USE_NVRTC), and includes utility functions for alignment, scaling factor transformation, and memory management.
Provides specialized APIs for grouped GEMMs (M-axis and K-axis) for MoE models, masked GEMMs for inference decoding, and MQA logits kernels for attention mechanisms.
Mega MoE fuses multiple operations into a single kernel, overlapping NVLink communication and computation, and requires symmetric memory allocation.
Released under the MIT License, with ongoing updates and performance comparisons documented via GitHub issues.

Hasty Briefsbeta