Hasty Briefsbeta

Model Flop Utilization Beyond 6ND

4 days ago
  • #AI Efficiency
  • #GPU Utilization
  • #Model Training
  • Model FLOPs Utilization (MFU) is a key efficiency metric for GPU utilization in AI training and inference, but it is highly approximate.
  • The common approximation for MFU (6ND) assumes compute-bound operations, but modern large-scale training is often inference-bound.
  • Attention mechanisms and sequence length complicate MFU calculations due to non-linear scaling and varying computational demands.
  • Mixture-of-Expert (MoE) models introduce complexity in MFU calculations by requiring active parameter counts instead of total parameters.
  • Parallelism in training and serving on thousands of GPUs adds layers of complexity to MFU calculations.
  • Continuous batching in inference services leads to dynamic MFU values as services handle asynchronous requests.
  • Disaggregation of prefill and decode operations in modern services causes MFU to diverge between different hardware optimizations.
  • KV-caches and speculative decoding further complicate MFU calculations by introducing additional flops and wasted tokens.
  • Two approaches to improving MFU: simplistic active parameter counting and PyTorch graph-based accurate FLOPs calculation.
  • MFU is a useful but limited metric; it doesn't pinpoint bottlenecks and could benefit from per-module or roofline model enhancements.