Defeating Nondeterminism in LLM Inference
20 hours ago
- #inference
- #LLM
- #determinism
- LLM inference is nondeterministic due to floating-point non-associativity and batch-size variations.
- Floating-point non-associativity causes numerical differences when operations are performed in different orders.
- Batch-size variations in inference servers lead to nondeterministic results because kernels are not batch-invariant.
- Achieving deterministic LLM inference requires batch-invariant kernels for operations like RMSNorm, matrix multiplication, and attention.
- Batch-invariant attention requires consistent reduction order regardless of how tokens are processed.
- Deterministic inference enables true on-policy reinforcement learning by ensuring identical results between training and sampling.
- Performance impact of deterministic kernels is manageable, with optimizations possible for attention kernels.
- The community is encouraged to address nondeterminism in ML systems for reproducibility and reliability.