Model Flop Utilization Beyond 6ND

4 days ago

Copy Link

Model FLOPs Utilization (MFU) is a key efficiency metric for GPU utilization in AI training and inference, but it is highly approximate.
The common approximation for MFU (6ND) assumes compute-bound operations, but modern large-scale training is often inference-bound.
Attention mechanisms and sequence length complicate MFU calculations due to non-linear scaling and varying computational demands.
Mixture-of-Expert (MoE) models introduce complexity in MFU calculations by requiring active parameter counts instead of total parameters.
Parallelism in training and serving on thousands of GPUs adds layers of complexity to MFU calculations.
Continuous batching in inference services leads to dynamic MFU values as services handle asynchronous requests.
Disaggregation of prefill and decode operations in modern services causes MFU to diverge between different hardware optimizations.
KV-caches and speculative decoding further complicate MFU calculations by introducing additional flops and wasted tokens.
Two approaches to improving MFU: simplistic active parameter counting and PyTorch graph-based accurate FLOPs calculation.
MFU is a useful but limited metric; it doesn't pinpoint bottlenecks and could benefit from per-module or roofline model enhancements.

Hasty Briefsbeta