Which one is more important: more parameters or more computation? (2021)
a day ago
- #Computation Efficiency
- #Deep Learning
- #Model Architecture
- The power of a deep learning model is often measured by its number of parameters, but the amount of computation is also crucial, though often overlooked.
- Two new methods, Hash Layers and Staircase Attention, help separate computation from model size, showing that increasing one without the other can boost performance.
- Hash Layers use a hashing-based routing mechanism in sparse mixture-of-experts (MoE) models to increase model size efficiently without extra computation, improving performance on language tasks.
- Staircase Attention increases computation without adding parameters by stacking or recurrently applying Transformers, enhancing performance in tasks like language modeling and state tracking.
- Combining Hash Layers and Staircase Attention yields orthogonal improvements, offering fine-grained control over parameter and computation sizes for more powerful models.
- These methods challenge the conventional coupling of parameters and computation, suggesting new architectural approaches for deep learning research.