Defeating Nondeterminism in LLM Inference

20 hours ago

Copy Link

LLM inference is nondeterministic due to floating-point non-associativity and batch-size variations.
Floating-point non-associativity causes numerical differences when operations are performed in different orders.
Batch-size variations in inference servers lead to nondeterministic results because kernels are not batch-invariant.
Achieving deterministic LLM inference requires batch-invariant kernels for operations like RMSNorm, matrix multiplication, and attention.
Batch-invariant attention requires consistent reduction order regardless of how tokens are processed.
Deterministic inference enables true on-policy reinforcement learning by ensuring identical results between training and sampling.
Performance impact of deterministic kernels is manageable, with optimizations possible for attention kernels.
The community is encouraged to address nondeterminism in ML systems for reproducibility and reliability.

Hasty Briefsbeta