Nvidia trains 10T model in 4 bit precision (NVFP4)

15 days ago

Copy Link

AI workloads have grown exponentially, especially in deploying large language models (LLMs) and processing tokens during pretraining and post-training.
NVIDIA's NVFP4, a 4-bit format, enhances inference latency, throughput, and efficiency while maintaining accuracy.
NVFP4 is now being extended to pretraining, offering significant improvements in training efficiency and scalability.
4-bit quantization reduces model weights and activations to 4 bits, requiring specialized techniques to maintain accuracy.
NVFP4's pretraining recipe includes micro-block scaling, high-precision block encoding, tensor reshaping, and stochastic rounding to ensure stability and accuracy.
Experiments show NVFP4 matches FP8 performance in large-scale pretraining, validating its effectiveness for trillion-token models.
NVFP4 enables AI factories to scale more efficiently, reducing power and compute costs while accelerating model development.

Hasty Briefsbeta