Llama.cpp: Deterministic Inference Mode (CUDA): RMSNorm, MatMul, Attention

9 hours ago

Copy Link

Adds an opt-in deterministic mode for CUDA inference to ensure bit-identical results for identical inputs.
Includes deterministic implementations for RMSNorm, dense MatMul, and Attention with batch-invariant kernels.
Uses a stable, padded KV-cache layout to maintain consistency.
Can be enabled via CMake option GGML_DETERMINISTIC, environment variable, or CLI flag --deterministic.
New tests verify batch invariance and cross-run determinism for all components.
Performance impact noted but not quantified in the summary.
Scope includes handling for various data types (F32/F16/BF16) and matrix sizes.
Documentation updates include DETERMINISM.md with details on MatMul and Attention.
Tested on NVIDIA GPUs including A4000 x2 and RTX 2000E Ada.

Hasty Briefsbeta