GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
4 hours ago
- #Kernel Library
- #CUDA
- #LLM Optimization
- DeepGEMM is a unified, high-performance CUDA kernel library for modern LLMs, featuring GEMMs (FP8, FP4, BF16), fused MoE with overlapped communication (Mega MoE), MQA scoring, and HyperConnection.
- Kernels are JIT-compiled at runtime with a lightweight module, requiring no CUDA compilation during installation, and performance matches or exceeds expert-tuned libraries.
- Supports SM90/SM100 architectures, with environment variables for configuration (e.g., DG_JIT_USE_NVRTC), and includes utility functions for alignment, scaling factor transformation, and memory management.
- Provides specialized APIs for grouped GEMMs (M-axis and K-axis) for MoE models, masked GEMMs for inference decoding, and MQA logits kernels for attention mechanisms.
- Mega MoE fuses multiple operations into a single kernel, overlapping NVLink communication and computation, and requires symmetric memory allocation.
- Released under the MIT License, with ongoing updates and performance comparisons documented via GitHub issues.