Hasty Briefsbeta

Llama.cpp: Deterministic Inference Mode (CUDA): RMSNorm, MatMul, Attention

9 hours ago
  • #CUDA
  • #deterministic-inference
  • #machine-learning
  • Adds an opt-in deterministic mode for CUDA inference to ensure bit-identical results for identical inputs.
  • Includes deterministic implementations for RMSNorm, dense MatMul, and Attention with batch-invariant kernels.
  • Uses a stable, padded KV-cache layout to maintain consistency.
  • Can be enabled via CMake option GGML_DETERMINISTIC, environment variable, or CLI flag --deterministic.
  • New tests verify batch invariance and cross-run determinism for all components.
  • Performance impact noted but not quantified in the summary.
  • Scope includes handling for various data types (F32/F16/BF16) and matrix sizes.
  • Documentation updates include DETERMINISM.md with details on MatMul and Attention.
  • Tested on NVIDIA GPUs including A4000 x2 and RTX 2000E Ada.