Llama.cpp: Deterministic Inference Mode (CUDA): RMSNorm, MatMul, Attention
9 hours ago
- #CUDA
- #deterministic-inference
- #machine-learning
- Adds an opt-in deterministic mode for CUDA inference to ensure bit-identical results for identical inputs.
- Includes deterministic implementations for RMSNorm, dense MatMul, and Attention with batch-invariant kernels.
- Uses a stable, padded KV-cache layout to maintain consistency.
- Can be enabled via CMake option GGML_DETERMINISTIC, environment variable, or CLI flag --deterministic.
- New tests verify batch invariance and cross-run determinism for all components.
- Performance impact noted but not quantified in the summary.
- Scope includes handling for various data types (F32/F16/BF16) and matrix sizes.
- Documentation updates include DETERMINISM.md with details on MatMul and Attention.
- Tested on NVIDIA GPUs including A4000 x2 and RTX 2000E Ada.