Reading MAI's efficiency gain. How to pick architectures like serious people
4 hours ago
- #model architecture
- #efficiency metric
- #computational trade-offs
- The MAI-Thinking-1 report introduces a method to compare model architectures using Efficiency Gain (EG), a metric that accounts for compute budget vs. final loss trade-offs.
- EG measures how much better or worse a candidate design is compared to a baseline, calculable on cost axes like FLOPs or wall-clock time, which often differ in optimal models.
- FLOPs counting is implementation-independent, useful for evaluating new ideas pre-optimization, while wall-clock time reflects real-world costs like cloud rental or cluster sharing.
- A key insight is that architectures cheap in FLOPs may underperform in time due to inefficient kernels, making EG crucial for avoiding costly mistakes in design choices.
- Example from Table 2: an MoE variant with 7+1 shared layers shows a 3% EG win in FLOPs but an 18% loss in time, favoring the interleaved layout despite FLOPs suggesting otherwise.
- EG is computed by fitting a power law to baseline runs, inverting to find cost-from-loss, and comparing candidate costs; values above 1 indicate efficiency gains.
- The method generalizes to any architectural change, helping assess if reduced FLOPs justify engineering effort or if ideas are viable on actual hardware.