Model Flop Utilization Beyond 6ND
4 days ago
- #AI Efficiency
- #GPU Utilization
- #Model Training
- Model FLOPs Utilization (MFU) is a key efficiency metric for GPU utilization in AI training and inference, but it is highly approximate.
- The common approximation for MFU (6ND) assumes compute-bound operations, but modern large-scale training is often inference-bound.
- Attention mechanisms and sequence length complicate MFU calculations due to non-linear scaling and varying computational demands.
- Mixture-of-Expert (MoE) models introduce complexity in MFU calculations by requiring active parameter counts instead of total parameters.
- Parallelism in training and serving on thousands of GPUs adds layers of complexity to MFU calculations.
- Continuous batching in inference services leads to dynamic MFU values as services handle asynchronous requests.
- Disaggregation of prefill and decode operations in modern services causes MFU to diverge between different hardware optimizations.
- KV-caches and speculative decoding further complicate MFU calculations by introducing additional flops and wasted tokens.
- Two approaches to improving MFU: simplistic active parameter counting and PyTorch graph-based accurate FLOPs calculation.
- MFU is a useful but limited metric; it doesn't pinpoint bottlenecks and could benefit from per-module or roofline model enhancements.