Which one is more important: more parameters or more computation? (2021)

a day ago

The power of a deep learning model is often measured by its number of parameters, but the amount of computation is also crucial, though often overlooked.
Two new methods, Hash Layers and Staircase Attention, help separate computation from model size, showing that increasing one without the other can boost performance.
Hash Layers use a hashing-based routing mechanism in sparse mixture-of-experts (MoE) models to increase model size efficiently without extra computation, improving performance on language tasks.
Staircase Attention increases computation without adding parameters by stacking or recurrently applying Transformers, enhancing performance in tasks like language modeling and state tracking.
Combining Hash Layers and Staircase Attention yields orthogonal improvements, offering fine-grained control over parameter and computation sizes for more powerful models.
These methods challenge the conventional coupling of parameters and computation, suggesting new architectural approaches for deep learning research.

Hasty Briefsbeta