Hasty Briefsbeta

Bilingual

Which one is more important: more parameters or more computation? (2021)

a day ago
  • #Computation Efficiency
  • #Deep Learning
  • #Model Architecture
  • The power of a deep learning model is often measured by its number of parameters, but the amount of computation is also crucial, though often overlooked.
  • Two new methods, Hash Layers and Staircase Attention, help separate computation from model size, showing that increasing one without the other can boost performance.
  • Hash Layers use a hashing-based routing mechanism in sparse mixture-of-experts (MoE) models to increase model size efficiently without extra computation, improving performance on language tasks.
  • Staircase Attention increases computation without adding parameters by stacking or recurrently applying Transformers, enhancing performance in tasks like language modeling and state tracking.
  • Combining Hash Layers and Staircase Attention yields orthogonal improvements, offering fine-grained control over parameter and computation sizes for more powerful models.
  • These methods challenge the conventional coupling of parameters and computation, suggesting new architectural approaches for deep learning research.